Skip to content

Columntransformer causes openml to open new setup for otherwise identical pipelines  #773

@hp2500

Description

@hp2500

Hi there,
i am running experiments where it is important that I can compare the performance of the same setup across different tasks. Generally I have no problems doing this, but whenever I am running a task that has mixed data types in its feature matrix I run into issues.

Specifically, I am using the columntransformer to run different preprocessing steps on categorical and numerical features. I am running the same pipeline on each one of the tasks, but openml creates a new setup for each of these runs, making it impossible to find the runs belonging to each of my pipelines / parameter configurations.

The only thing I can think of that changes from run to run are the columns of the numeric and categorical features, hence the indicator mask that I am passing to the columntransformer is different from run to run. My suspicion is that this is causing openml to treat each pipeline as a different pipeline, creating a new setup for each run, even though the pipeline is technically the same.

Here is a simple example:

import openml
from sklearn.svm import SVC 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder 

# get task
task = openml.tasks.get_task(3022)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
                                                            target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]

# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id

8231820

# get task
task = openml.tasks.get_task(23)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
                                                            target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]

# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id

8231821

This is not the case when I am not using the columntransformer (even though I should)

# get task
task = openml.tasks.get_task(3022)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id

8231791

# get task
task = openml.tasks.get_task(23)

# get dataset object 
data = openml.datasets.get_dataset(task.dataset_id)

# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)

# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()

# check setup 
openml.runs.get_run(run.run_id).setup_id

8231791

Is this behavior intended?
Is there a non-hacky solution to this problem?

@amueller

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions