Skip to content

Conversation

@Neeratyoy
Copy link
Contributor

Reference Issue

PR #777.

What does this PR implement/fix? Explain your changes.

Makes the changes/improvements suggested by @PGijsbers that were not addressed prior to merge.

@PGijsbers
Copy link
Collaborator

Checks manually cancelled.

@Neeratyoy
Copy link
Contributor Author

Neeratyoy commented Oct 26, 2020

@PGijsbers wrt to why an experimental HistGradientBoostingClassifier is used in this example:

To handle mixed-type data frames with missing values, we need to create a transformation which imputes these missing values for both categorical and continuous types. Under the current OpenML setup, we cannot create a flow with two instances of the same class, a SimpleImputer in this case. So far, for various unit tests, we have been handling it via an alias such as this.

Creating an alias class within the testing script has issues while serializing since it has got no __version__ attribute. For our test cases, it works probably since it is imported from within openml and the __version__ is shared possibly.
To prevent importing from openml.testing in a documentation example, it was decided to bypass the need of a second imputer. @mfeurer suggested the HistGradientBoostingClassifier since it can handle numeric missing values implicitly. Therefore, for this example, a passthrough works.

@Neeratyoy Neeratyoy marked this pull request as ready for review October 26, 2020 14:38
@Neeratyoy Neeratyoy mentioned this pull request Oct 26, 2020
@PGijsbers
Copy link
Collaborator

That makes sense (even though it is a bit unfortunate). However, why not consider creating the example on a task which does not have two types of missing values? While all examples implicitly depend on some version(s) of scikit-learn, it feels using an experimental feature has exceptionally narrow support. Old versions (<=0.20) won't have this, future versions won't have this.

@Neeratyoy
Copy link
Contributor Author

Neeratyoy commented Oct 26, 2020

However, why not consider creating the example on a task which does not have two types of missing values?

Yeah I was wondering why that was not done here but then I think we need to change the design of the example.

At this point, it samples 3 tasks randomly from the suite. Do we then fix this task list? If so, do we provide a justification?

Either way, I shall try and find 3 tasks where missing values if present is restricted to one of the types. We can then re-iterate and replace the HistGradientBoostingClassifier with a more stable classifier.

On further checking, this very task is already the case where missing values exist for categorical columns but not numeric columns. Replacing the experimental model with an RF expectedly worked fine. I pushed that version since your concern with this example being compatible across versions is an apt assessment.

@mfeurer and I were earlier trying to come up with an example that allows the generic case of handling missing values across types. Given the constraints of the setup, the pipeline with the experimental model was the only way out we could think of.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll approve once I see that the tests are green.

@PGijsbers PGijsbers merged commit 4923e5b into develop Oct 29, 2020
@PGijsbers PGijsbers deleted the fix_pr777 branch October 29, 2020 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants