Grid Searching in all the Right Places
In this last week at the Zipfian Academy, I came across a pretty awesome snippet of code. We have been doing a lot of machine learning in the last few weeks and my work-flow up to last week could be described as a seemingly infinite loop of training on data and tweaking parameters.
Grid Searching with Scikit Learn
One solution to the inefficient tweak/train work-flow is a grid search. Although feature selection and preliminary exploration of parameter space is still necessary, I will no longer be endlessly tweaking parameters. As it turns out, Scikit Learn has created a very elegant way to run its models and parameters k-fold times in a search grid with sklearn.grid_search.GridSearchCV
.
But before I run the grid search, I need to import all the necessary tools and load in the data. The data needs to be of type numpy.ndarray
. Personally, I like to do all of my feature scrubbing in another function, which I will call get_data()
in this case.
from pprint import pprint
from datetime import datetime as dt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
X, y = get_data() # user-defined
Both pprint
and datetime
will be useful for displaying the results of the search. In this example I am using sklearn.metrics.roc_auc_score
- a ROC area-under-the-curve metric with which to compare model runs.
Setting up the models and their parameters
To set up the grid search, you can select models by instantiating a Pipeline
object, and select parameters by defining a dictionary with keys as "model__kwarg"
and values as tuples.
Additionally, in order to prepare the data for training I will use sklearn.cross_validation.train_test_split
to split the data into train and test numpy arrays.
pipeline = Pipeline([
#( 'clf', LogisticRegression()),
('kNN', KNeighborsClassifier()),
])
parameters = {
#'clf__C': (1, 10),
#'clf__penalty': ('l1', 'l2'),
'kNN__n_neighbors': (5, 10, 30),
}
# split data into train and test set
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size = .10, random_state = 1)
Running the search
Next, initialize GridSearchCV
with a few keyword parameters. n_jobs
may seem like an innocuous little keyword, but looks are deceiving. n_jobs = -1
allows you to parallelize your grid search using all of the cores on your machine. Alternatively, you can set n_jobs
to -2
to use all but one of your machine’s cores - think of it as an index to a list of the cores on your system.
Also note that multi-processing requires grid search to run in a "__main__"
protected block.
if __name__ == "__main__":
# initialize grid search
grid_search = GridSearchCV(pipeline, parameters, n_jobs = -1, verbose = 1, scoring = "roc_auc")
print("\nPerforming grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = dt.now()
grid_search.fit(xtrain, ytrain)
print("done in {}\n".format(dt.now() - t0))
Well, what now?
Conveniently, GridSearchCV
stores the best score within the best_score_
instance attribute, and their respective parameters within the best_estimator_
attribute. They could be accessed and sent to stdout
like so:
print("\nBest score: {:0.3f}".format(grid_search.best_score_))
print("\nBest parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t{}: {}".format(param_name, best_parameters[param_name]))
Re-fit
And that’s it. Pretty neat, right?
One final note, if you want to make predictions with the optimal parameters set refit = True
within the GridSearchCV
instantiation to re-fit the best estimator with the entire data set.
Happy grid searching!