Conversation
|
Should also fix #898. |
openml/datasets/dataset.py
Outdated
| # The file is likely corrupt, see #780. | ||
| # We deal with this when loading the data in `_load_data`. | ||
| return data_pickle_file, data_feather_file, feather_attribute_file | ||
| except Exception: |
There was a problem hiding this comment.
Could you specifically catch ModuleNotFoundError and any others that are expected? I would prefer not to use a catch-all as if new issues arise, we could possibly just fix them instead.
There was a problem hiding this comment.
You can add a check for the error of #898 too. And document in comments in one sentence why an error is expected. Including the issue number is good, but since they are easily summarized it is nice to be able to have the documentation within the code, e.g.
catch ModuleNotFoundError:
# 780: Pickled dataframe is likely of pandas<1.0 while attempting to load with pandas>=1.0
return ...
catch ValueError:
(maybe check for the specific message)
# 898: Pickled dataframe pickled with protocol 5 (Py3.8), but loaded with protocol 4 (<Py3.8).
return ...
There was a problem hiding this comment.
Good point, I just updated the code.
Codecov Report
@@ Coverage Diff @@
## develop #925 +/- ##
===========================================
- Coverage 88.05% 87.65% -0.41%
===========================================
Files 37 37
Lines 4363 4383 +20
===========================================
Hits 3842 3842
- Misses 521 541 +20
Continue to review full report at Codecov.
|
PGijsbers
left a comment
There was a problem hiding this comment.
See attached comments
PGijsbers
left a comment
There was a problem hiding this comment.
It might be worth to have another look to see if we can centralize data loading logic so we do not have to catch the same error in two places. That said, I am also fine merging the PR like this, one step at a time.
Closes #918
Issue #918 was caused by having a pickle file created with pandas < 1.0 and then being read with pandas >= 1.0. pandas changed the internals of handling categorical values and therefore the dataframe cannot be unpickled any more. This PR changes the behavior of OpenML-Python, and instead of crashing when failing on unpickling the dataset, it reads the data from arff and emits a warning that the pickle is not readable.