-
-
Notifications
You must be signed in to change notification settings - Fork 211
Description
Hi there,
i am running experiments where it is important that I can compare the performance of the same setup across different tasks. Generally I have no problems doing this, but whenever I am running a task that has mixed data types in its feature matrix I run into issues.
Specifically, I am using the columntransformer to run different preprocessing steps on categorical and numerical features. I am running the same pipeline on each one of the tasks, but openml creates a new setup for each of these runs, making it impossible to find the runs belonging to each of my pipelines / parameter configurations.
The only thing I can think of that changes from run to run are the columns of the numeric and categorical features, hence the indicator mask that I am passing to the columntransformer is different from run to run. My suspicion is that this is causing openml to treat each pipeline as a different pipeline, creating a new setup for each run, even though the pipeline is technically the same.
Here is a simple example:
import openml
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# get task
task = openml.tasks.get_task(3022)
# get dataset object
data = openml.datasets.get_dataset(task.dataset_id)
# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]
# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])
# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)
# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()
# check setup
openml.runs.get_run(run.run_id).setup_id8231820
# get task
task = openml.tasks.get_task(23)
# get dataset object
data = openml.datasets.get_dataset(task.dataset_id)
# get relevant info from dataset object
X, y, categorical_indicator, attribute_names = data.get_data(dataset_format='array',
target=data.default_target_attribute)
# make indicator masks
cat = categorical_indicator
num = [not k for k in categorical_indicator]
# make columntransformer
numeric_transformer = make_pipeline(StandardScaler())
categorical_transformer = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, num), ('cat', categorical_transformer, cat)])
# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(preprocessor, clf)
# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()
# check setup
openml.runs.get_run(run.run_id).setup_id8231821
This is not the case when I am not using the columntransformer (even though I should)
# get task
task = openml.tasks.get_task(3022)
# get dataset object
data = openml.datasets.get_dataset(task.dataset_id)
# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)
# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()
# check setup
openml.runs.get_run(run.run_id).setup_id8231791
# get task
task = openml.tasks.get_task(23)
# get dataset object
data = openml.datasets.get_dataset(task.dataset_id)
# make pipeline
clf = SVC(gamma = 'scale', random_state=1)
pipe = make_pipeline(StandardScaler(), clf)
# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
run.publish()
# check setup
openml.runs.get_run(run.run_id).setup_id8231791
Is this behavior intended?
Is there a non-hacky solution to this problem?