Use sklearn's GridSearchCV with a pipeline, preprocessing just once

Question

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

scikit-learn.org/stable/modules/compose.html
– AnandJ
Commented Aug 22, 2022 at 20:23 — AnandJ, Commented Aug 22, 2022 at 20:23

Vivek Kumar · Accepted Answer · 2020-04-16 05:43:54Z

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

Do this:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

I didn't think about using GridSearchCV in the pipe itself, sounds like a brilliant idea. Thanks a lot! — Marc Garcia, Commented Apr 12, 2017 at 12:36
@MarcGarcia But do make sure to turn the refit=True, else it will throw an error, when calling clf.predict() — Vivek Kumar, Commented Apr 12, 2017 at 13:12
Doesn't this technique use all the data in the StandardScalar() instead of just the training set ? I don't see how it allows to avoid doing the splits manually. — Victor Deplasse, Commented May 12, 2017 at 23:51
@VivekKumar Ok I see that. But then during the fit(), GridSearchCV will tune the hyperparameter by a CV on the data preprocessed by StandardScaler(), so StandardScalar() will also be fitted on the validation set of GridSearchCV (not the test set passed to predict()), which isn't correct for me because the validation set shouldn't be preprocessed. — Victor Deplasse, Commented May 13, 2017 at 8:25
@ShashwatSiddhant param_grid in your case goes inside the GridSearchCV. It has nothing to do with make_pipeline here. So in your case, param_grid should only contain 'C' and 'gamma'. — Vivek Kumar, Commented Dec 2, 2019 at 9:46

Ayan Omarov · Accepted Answer · 2019-03-28 15:31:43Z

For those who stumbled upon a little bit different problem, that I had as well.

Suppose you have this pipeline:

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }

Victor Deplasse · Accepted Answer · 2017-05-13 11:14:58Z

5

It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322

answered May 13, 2017 at 11:14

Victor Deplasse

6906 silver badges10 bronze badges

Add a comment |

Robert Pollak · Accepted Answer · 2024-01-26 14:32:02Z

2

Use the memory argument to make_pipeline, e.g. together with a temporary directory:


cache_dir = tempfile.mkdtemp()
... make_pipeline(..., memory=cache_dir) ...

# after GridSearchCV
shutil.rmtree(cache_dir)

answered Jan 26 at 14:32

Robert Pollak

4,0294 gold badges32 silver badges55 bronze badges

Add a comment |

Mario · Accepted Answer · 2023-07-07 17:32:46Z

I joined the party late, but I brought a new solution/insight using Pipeline():

sub-pipeline containing your model (regression/classifier) as a single component
main pipeline made of routine components:
- pre-processing component e.g., scaler, dimension reduction, etc.
- your refitted GridSearchCV(regressor, param) with desired/best params for your model (Note: don't forget to refit=True) based on @Vivek Kumar remark ref

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within the main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline

#create and train the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')

# Create the main pipeline by chaining refitted GridSerachCV sub-pipeline

sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                               ('SGD',    grid_search),
])

# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y

sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")

Alternatively, you can use TransformedTargetRegressor (specifically if you need to descale y as @mloning commented here) and chain this component, including your regression model ref. Note:

you don't need to set transform argument unless you need descaling; please then check to related posts 1, 2, 3, 4, its score
Pay attention to this remark about not scaling here since:

... With scaling y you actually lose your units....

Here, It is recommended to:

... Do the transformation outside the pipeline. ...

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')



# Create the main pipeline using sub-pipeline made of TransformedTargetRegressor component
from sklearn.compose import TransformedTargetRegressor

TTR_sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                                   #('SGD', SGDRegressor()),
                                    ('TTR', TransformedTargetRegressor(regressor= grid_search, #SGDRegressor(),
                                                                       #transformer=MinMaxScaler(),
                                                                       #func=np.log,
                                                                       #inverse_func=np.exp,
                                                                       check_inverse=False))
])



# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y
#best_sgd_pipeline.fit(X_train, y_train)
TTR_sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="diagram")

Collectives™ on Stack Overflow

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
python
numpy
machine-learning
scikit-learn
grid-search
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged pythonnumpymachine-learningscikit-learngrid-search or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
machine-learning
scikit-learn
grid-search
or ask your own question.