-
Notifications
You must be signed in to change notification settings - Fork 4k
[python-package] Add test for converting a ctypes int64 pointer array to a NumPy array
#7071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ctypes int64 pointer array to a NumPy arrayctypes int64 pointer array to a NumPy array
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @nicklamiller
Before I review this... did you try to do this through LightGBM's public API?
From #7031:
Tests should only use lightgbm's public API, unless that is very difficult or expensive. Any function whose name begins with a _ is considered private.
This particular internal function is so small and simple, I think it'd be preferable to have tests which cover it via the public API. That'd give us more coverage of lightgbm under whatever conditions lead to this function being invoked. I'm not exactly sure what code paths do that, probably passing int64 arrays for label or something,... you'd have to do a bit of investigation.
This reverts commit 511070f.
@jameslamb thank you very much for the detailed instructions in #7031 and sorry for completely missing that point! Testing through the public API to mimic how functionality is called in the wild by users makes sense, and Newly covered linesRunning: pytest \
--cov=lightgbm \
--cov-report="term" \
--cov-report="html:htmlcov" \
tests/python_package_test/test_engine.pyand viewing coverage report for Some additional notes/thoughts:
`_cint64_array_to_numpy` call chain from Booster(...).predict(...)
Booster.predict(pred_contrib=True)
↓
_InnerPredictor.predict(pred_contrib=True)
↓
__pred_for_csr(csr, predict_type=_C_API_PREDICT_CONTRIB)
↓
__inner_predict_csr_sparse(csr, predict_type=_C_API_PREDICT_CONTRIB)
↓
_LIB.LGBM_BoosterPredictSparseOutput() # C API call
↓
__create_sparse_native(csr, out_ptr_indptr, ...)
↓
_cint64_array_to_numpy(cptr=out_ptr_indptr, length=indptr_len) |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your effort here and for figuring out a code path through the public API that relies on this function!!! I really appreciate your thorough explanation, as always.
I've merged in latest master to trigger a new CI run, let's see if that issue in the CUDA jobs persists.
I've put up a few suggestions for changes to the tests. In general, when testing LightGBM with very small datasets, it's important to ensure that a non-empty model (multiple trees and splits) is trained. There are code paths that are skpped when the model doesn't have any splits.
| n_samples = 100 | ||
| n_features = 5 | ||
|
|
||
| X, y = make_multilabel_classification( | ||
| n_samples=n_samples, sparse=True, n_features=n_features, n_classes=1, n_labels=2 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| n_samples = 100 | |
| n_features = 5 | |
| X, y = make_multilabel_classification( | |
| n_samples=n_samples, sparse=True, n_features=n_features, n_classes=1, n_labels=2 | |
| ) | |
| X, y = make_multilabel_classification( | |
| n_samples=100, sparse=True, n_features=5, n_classes=1, n_labels=2 | |
| ) |
Since these values are only used one time in the code, let's just hard-code them here instead of introducing this indirection.
| params = {"objective": "binary", "num_leaves": 7, "learning_rate": 0.1, "verbose": -1} | ||
| booster = lgb.Booster(params, train_data) | ||
|
|
||
| for _ in range(5): | ||
| booster.update() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| params = {"objective": "binary", "num_leaves": 7, "learning_rate": 0.1, "verbose": -1} | |
| booster = lgb.Booster(params, train_data) | |
| for _ in range(5): | |
| booster.update() | |
| params = { | |
| "objective": "binary", | |
| "num_leaves": 7, | |
| "min_data_in_bin": 1, | |
| "min_data_in_leaf": 1, | |
| "seed": 708, | |
| "verbose": -1 | |
| } | |
| booster = lgb.train(params, train_set=train_data, num_boost_round=5) | |
| assert booster.num_trees() == 5 |
Instead of manually implementing the training loop, let's just use lgb.train().
And because this is such a small Dataset, let's set min_data_* parameters to ensure that the trained model actually has some trees and splits.
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your effort here and for figuring out a code path through the public API that relies on this function!!! I really appreciate your thorough explanation, as always.
I've merged in latest master to trigger a new CI run, let's see if that issue in the CUDA jobs persists.
I've put up a few suggestions for changes to the tests. In general, when testing LightGBM with very small datasets, it's important to ensure that a non-empty model (multiple trees and splits) is trained. There are code paths that are skpped when the model doesn't have any splits.
@jameslamb Ah I see that makes sense, I've gone ahead and added these suggestions. It looks like the CUDA jobs were still failing even after merging the latest master, and specifically due to |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this!
I think skipping this test is totally fine. Exciting that adding this testing actually found a bug! Even if it's intentional that the library doesn't currently support int64 sparse matrix indices for the CUDA variant, it's still a bug that that results in a segfault instead of a nice exception.
Could you add a short bug report issue documenting that? It'd be nice to eventually either support this or at least eliminate the segfault.
I left one other very small suggestion. Please address that and update to latest master... otherwise I think this is ready to merge 😁
|
@jameslamb thanks very much for the review!
This is now being tracked in #7101. |
|
That issue looks perfect, thank you so much!!! Don't worry about the failing R-package tests here, I bet it is still some fallout from recent R-devel changes (#7099). Hopefully will be resolved by a re-run in a few hours or tomorrow. |


Contributes to: #7031
Adds test for
_cint64_array_to_numpy:LightGBM/python-package/lightgbm/basic.py
Lines 504 to 509 in 5dbfcdc