-
-
Notifications
You must be signed in to change notification settings - Fork 211
Fix/sklearn test compatibility #1340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It is unclear how a condition where the test is supposed to pass is created. Even after running the test suite 2-3 times, it does not yet seem to pass.
There are some minor changes to the docstrings. I do not know that it is useful to keep testing it this way, so for now I will disable the test on newer versions.
The loss has been renamed. The performance of the model also seems to have changed slightly for the same seed. So I decided to compare with the lower fidelity that was already used on Windows systems.
|
edit: no longer relevant Test failures seem to be caused by exceeding the 10 minute time limit we have set. I am not sure why these tests need as long, but it looks like they fail and are repeated. It's likely their failure is deterministic however. So I will go ahead and disable retries so the test does a hard fail instead with a more meaningful error message (hopefully). |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1340 +/- ##
============================================
+ Coverage 32.94% 83.95% +51.00%
============================================
Files 38 38
Lines 5254 5259 +5
============================================
+ Hits 1731 4415 +2684
+ Misses 3523 844 -2679 ☔ View full report in Codecov by Sentry. |
|
Tests are incredibly slow, I am not sure why. This also happens locally, so it's not just the runners. I suspect it's the server. edit: it looks like the culprit is mostly the production server :( (most (all?) of these long tests use prod) |
Numpy2.0 cleaned up their namespace.
|
edit: message outdated, pr now fully supports numpy 2.0. old message: On Numpy 2.0 support/scikit-learn support (from #1341):
But I am currently actually leaning towards putting in a bit of extra time to make sure we can test on later scikit-learn versions. |
There is a breaking change to the way 'mode' works, that breaks scikit-learn internals.
It seems to me that run.evaluations is set only when the run is fetched. Whether it has evaluations depends on server state. So if the server has resolved the traces between the initial fetch and the trace-check, you could be checking len(run.evaluations) where evaluations is None.
Scikit-learn or numpy changed the typing of the parameters (seen in a masked array, not sure if also outside of that). Convert these values back to Python builtins.
| ) | ||
|
|
||
| if upload_flow and flow_id is None: | ||
| if upload_flow and flow_id is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flow_exists does not return None but False when a flow does not exist.
I find flow_id is False easier to parse than not flow_id since flow_id can be both integer or boolean. While both are correct, the check is clearer to me this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug would fail a unit test only the first time around, unit tests would later upload the flow anyway which means the second time this path is no longer taken and tests can pass. Because there are other tests which update server state that need to be ran a few times, this error was never caught as it was just in the initial sea of reds after a server reset.
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| matrix: | ||
| python-version: ["3.8"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the matrix to be effectively the same, but also testing each 1.x release.
The only difference is that the default is now 3.9, since that is supported for all scikit-learn versions (except 0.23) so it makes things easier.
| - name: Install numpy for Python 3.8 | ||
| # Python 3.8 & scikit-learn<0.24 requires numpy<=1.23.5 | ||
| if: ${{ matrix.python-version == '3.8' && contains(fromJSON('["0.23.1", "0.22.2", "0.21.2"]'), matrix.scikit-learn) }} | ||
| if: ${{ matrix.python-version == '3.8' && matrix.scikit-learn == '0.23.1' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
older versions were (no longer) included in the run matrix anyway.
We don't want to serialize as the value np.nan, we want to include the nan directly. It is an indication that the parameter was left unset.
| time.sleep(10) | ||
| continue | ||
|
|
||
| run = openml.runs.get_run(run_id, ignore_cache=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes tests fail in with the old code because run.evaluations is None. But I am not able to reproduce this locally. However, the fails do seem fewer now that I move where the run object is loaded (which makes sense, initially there is a race condition where perhaps the trace isn't yet processed in get get_run call, but is by the time the get_run_trace is called). Additionally, I simply changed the check from a length check to a None check. As far as I can tell, evaluation should be a non-empty dictionary or None, so the length check doesn't make a lot of sense (probably historically the behavior was different). I kept in an assert with the old check just to make sure my assumption is correct (and we get an error if it isn't).
The default behavior if no evaluation is present is for it to be None. So it makes sense to check for that instead. As far as I can tell, run.evaluations should always contain some items if it is not None. But I added an assert just in case.
|
@LennartPurucker I pinged you for a review since at this point I am unsure what to do. The work is done, kind of. Sometimes the tests still fail sporadically because the pytest workers get killed on GitHub actions. I have not had this happen locally. It's also not consistent - they can also pass. At this point, there are already large improvements to the tests and CI. It's probably already worth merging. From there, I can work on both new features/updates as well as debug, but then at least that can be in parallel and be discussed in separate smaller PRs. What do you think? edit: I suspect this may also be due to the timeout, though previously I have had clearer messages when the timeout functionality was triggered. I'll remove the timeouts so we can compare CI results. edit2: The one failing test (sporadically) is due to an internal timeout now. I think it's fairly safe to say that this has to be a server side issue, with the same test passing on other instances. |
I suspect they "crash" workers. This of course introduces the risk of hanging processes... But I cannot reproduce the issue locally.
LennartPurucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I agree and think we should merge this. The changes are all reasonable and important improvements.
This PR updates tests to be compatible with newer versions of scikit-learn. In the process, fixes some bugs and adds numpy 2.0 support. I updated the CI matrix to include newer versions of scikit-learn. For the most part, openml-python works fine on newer versions, it was just that the unit tests themselves had hard dependencies on specifics. I extended/updated in the most cases, but in a few it was a little unreasonable (needs to be refactored to make it less sensitive to small scikit-learn api changes). I already spent quite a bit of time on this, so I hope that the proposed changes are good enough for now, so we can work on a 0.15.0 release.
The docs fail in the check-links step. I propose to fix this in a separate PR.