Skip to content

Conversation

@nicklamiller
Copy link
Contributor

@nicklamiller nicklamiller commented Oct 24, 2025

Contributes to: #7031

Adds test for _cint64_array_to_numpy:

def _cint64_array_to_numpy(*, cptr: "ctypes._Pointer", length: int) -> np.ndarray:
"""Convert a ctypes int pointer array to a numpy array."""
if isinstance(cptr, ctypes.POINTER(ctypes.c_int64)):
return np.ctypeslib.as_array(cptr, shape=(length,)).copy()
else:
raise RuntimeError("Expected int64 pointer")

@nicklamiller nicklamiller changed the title Add test for converting a ctypes int64 pointer array to a NumPy array [python-package] Add test for converting a ctypes int64 pointer array to a NumPy array Oct 24, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @nicklamiller

Before I review this... did you try to do this through LightGBM's public API?

From #7031:

Tests should only use lightgbm's public API, unless that is very difficult or expensive. Any function whose name begins with a _ is considered private.

This particular internal function is so small and simple, I think it'd be preferable to have tests which cover it via the public API. That'd give us more coverage of lightgbm under whatever conditions lead to this function being invoked. I'm not exactly sure what code paths do that, probably passing int64 arrays for label or something,... you'd have to do a bit of investigation.

@nicklamiller
Copy link
Contributor Author

did you try to do this through LightGBM's public API?

@jameslamb thank you very much for the detailed instructions in #7031 and sorry for completely missing that point! Testing through the public API to mimic how functionality is called in the wild by users makes sense, and _cint64_array_to_numpy is now tested through Booster(...).predict(...).

Newly covered lines

Running:

pytest \
    --cov=lightgbm \
    --cov-report="term" \
    --cov-report="html:htmlcov" \
    tests/python_package_test/test_engine.py

and viewing coverage report for python-package/lightgbm/basic.py (where _cint64_array_to_numpy is defined)

on master:
coverage_master

on this PR's branch:
coverage_pr

Some additional notes/thoughts:

  • _cint64_array_to_numpy only ever gets called if pred_contrib=True in Booster(...).predict(...), please see call chain/graph below
  • Even though _cint64_array_to_numpy is defined in basic.py, I noticed some tests in test_engine.py were directly testing Booster(...).predict(...) specifically with pred_contrib=True so I placed my test in test_engine.py instead of test_basic.py, one notable example of such a test is test_contribs_sparse:
    def test_contribs_sparse():
    n_features = 20
    n_samples = 100
    # generate CSR sparse dataset
    X, y = make_multilabel_classification(
    n_samples=n_samples, sparse=True, n_features=n_features, n_classes=1, n_labels=2
    )
    y = y.flatten()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
    params = {
    "objective": "binary",
    "verbose": -1,
    }
    lgb_train = lgb.Dataset(X_train, y_train)
    gbm = lgb.train(params, lgb_train, num_boost_round=20)
    contribs_csr = gbm.predict(X_test, pred_contrib=True)
    assert isspmatrix_csr(contribs_csr)
    # convert data to dense and get back same contribs
    contribs_dense = gbm.predict(X_test.toarray(), pred_contrib=True)
    # validate the values are the same
    if platform.machine() == "aarch64":
    np.testing.assert_allclose(contribs_csr.toarray(), contribs_dense, rtol=1, atol=1e-12)
    else:
    np.testing.assert_allclose(contribs_csr.toarray(), contribs_dense)
    assert np.linalg.norm(gbm.predict(X_test, raw_score=True) - np.sum(contribs_dense, axis=1)) < 1e-4
    # validate using CSC matrix
    X_test_csc = X_test.tocsc()
    contribs_csc = gbm.predict(X_test_csc, pred_contrib=True)
    assert isspmatrix_csc(contribs_csc)
    # validate the values are the same
    if platform.machine() == "aarch64":
    np.testing.assert_allclose(contribs_csc.toarray(), contribs_dense, rtol=1, atol=1e-12)
    else:
    np.testing.assert_allclose(contribs_csc.toarray(), contribs_dense)
    • I debated appending to this test but given it was already testing quite a few things, I decided to add test_predict_contrib_int64 as a separate one
  • There are some failing CI jobs for older CUDA versions, it's not immediately obvious to me if these are related to my changes, I'll dig a bit deeper.
`_cint64_array_to_numpy` call chain from Booster(...).predict(...)

Booster.predict(pred_contrib=True)
    ↓
_InnerPredictor.predict(pred_contrib=True)
    ↓
__pred_for_csr(csr, predict_type=_C_API_PREDICT_CONTRIB)
    ↓
__inner_predict_csr_sparse(csr, predict_type=_C_API_PREDICT_CONTRIB)
    ↓
_LIB.LGBM_BoosterPredictSparseOutput()  # C API call
    ↓
__create_sparse_native(csr, out_ptr_indptr, ...)
    ↓
_cint64_array_to_numpy(cptr=out_ptr_indptr, length=indptr_len)

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your effort here and for figuring out a code path through the public API that relies on this function!!! I really appreciate your thorough explanation, as always.

I've merged in latest master to trigger a new CI run, let's see if that issue in the CUDA jobs persists.

I've put up a few suggestions for changes to the tests. In general, when testing LightGBM with very small datasets, it's important to ensure that a non-empty model (multiple trees and splits) is trained. There are code paths that are skpped when the model doesn't have any splits.

Comment on lines 2017 to 2022
n_samples = 100
n_features = 5

X, y = make_multilabel_classification(
n_samples=n_samples, sparse=True, n_features=n_features, n_classes=1, n_labels=2
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_samples = 100
n_features = 5
X, y = make_multilabel_classification(
n_samples=n_samples, sparse=True, n_features=n_features, n_classes=1, n_labels=2
)
X, y = make_multilabel_classification(
n_samples=100, sparse=True, n_features=5, n_classes=1, n_labels=2
)

Since these values are only used one time in the code, let's just hard-code them here instead of introducing this indirection.

Comment on lines 2028 to 2032
params = {"objective": "binary", "num_leaves": 7, "learning_rate": 0.1, "verbose": -1}
booster = lgb.Booster(params, train_data)

for _ in range(5):
booster.update()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params = {"objective": "binary", "num_leaves": 7, "learning_rate": 0.1, "verbose": -1}
booster = lgb.Booster(params, train_data)
for _ in range(5):
booster.update()
params = {
"objective": "binary",
"num_leaves": 7,
"min_data_in_bin": 1,
"min_data_in_leaf": 1,
"seed": 708,
"verbose": -1
}
booster = lgb.train(params, train_set=train_data, num_boost_round=5)
assert booster.num_trees() == 5

Instead of manually implementing the training loop, let's just use lgb.train().

And because this is such a small Dataset, let's set min_data_* parameters to ensure that the trained model actually has some trees and splits.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your effort here and for figuring out a code path through the public API that relies on this function!!! I really appreciate your thorough explanation, as always.

I've merged in latest master to trigger a new CI run, let's see if that issue in the CUDA jobs persists.

I've put up a few suggestions for changes to the tests. In general, when testing LightGBM with very small datasets, it's important to ensure that a non-empty model (multiple trees and splits) is trained. There are code paths that are skpped when the model doesn't have any splits.

@nicklamiller
Copy link
Contributor Author

when testing LightGBM with very small datasets, it's important to ensure that a non-empty model (multiple trees and splits) is trained. There are code paths that are skpped when the model doesn't have any splits.

@jameslamb Ah I see that makes sense, I've gone ahead and added these suggestions.

It looks like the CUDA jobs were still failing even after merging the latest master, and specifically due to test_predict_contrib_int64 (error). So it appears that there's an issue having 64 bit indices in sparse matrices as done in this test for CUDA jobs. For that reason, I've simply skipped this test for CUDA jobs, but please let me know if we want to make sure this functionality works when CUDA is used.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this!

I think skipping this test is totally fine. Exciting that adding this testing actually found a bug! Even if it's intentional that the library doesn't currently support int64 sparse matrix indices for the CUDA variant, it's still a bug that that results in a segfault instead of a nice exception.

Could you add a short bug report issue documenting that? It'd be nice to eventually either support this or at least eliminate the segfault.

I left one other very small suggestion. Please address that and update to latest master... otherwise I think this is ready to merge 😁

@nicklamiller
Copy link
Contributor Author

@jameslamb thanks very much for the review!

Could you add a short bug report issue documenting that? It'd be nice to eventually either support this or at least eliminate the segfault.

This is now being tracked in #7101.

@jameslamb
Copy link
Collaborator

That issue looks perfect, thank you so much!!!

Don't worry about the failing R-package tests here, I bet it is still some fallout from recent R-devel changes (#7099). Hopefully will be resolved by a re-run in a few hours or tomorrow.

@jameslamb jameslamb merged commit d5e449d into microsoft:master Dec 10, 2025
100 of 110 checks passed
@nicklamiller nicklamiller deleted the cint64-arr-test branch December 10, 2025 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants