Columntransformer causes openml to open new setup for otherwise identical pipelines 

Hi there, 
i am running experiments where it is important that I can compare the performance of the same setup across different tasks. Generally I have no problems doing this, but whenever I am running a task that has mixed data types in its feature matrix I run into issues. 

Specifically, I am using the columntransformer to run different preprocessing steps on categorical and numerical features. I am running the same pipeline on each one of the tasks, but openml creates a new setup for each of these runs, making it impossible to find the runs belonging to each of my pipelines / parameter configurations.

The only thing I can think of that changes from run to run are the columns of the numeric and categorical features, hence the indicator mask that I am passing to the columntransformer is different from run to run. My suspicion is that this is causing openml to treat each pipeline as a different pipeline, creating a new setup for each run, even though the pipeline is technically the same.

Here is a simple example: 
```python
import openml
from sklearn.svm import SVC 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder 

# get task
task = openml.tasks.get_task(3022)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
                                                            target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]

# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id
```
8231820

```python
# get task
task = openml.tasks.get_task(23)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
                                                            target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]

# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id
```
8231821

This is not the case when I am not using the columntransformer (even though I should)
```python
# get task
task = openml.tasks.get_task(3022)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id
```
8231791

```python
# get task
task = openml.tasks.get_task(23)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id
```
8231791

Is this behavior intended? 
Is there a non-hacky solution to this problem? 

@amueller 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Columntransformer causes openml to open new setup for otherwise identical pipelines #773

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Columntransformer causes openml to open new setup for otherwise identical pipelines #773

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions