# dask-searchcv¶

Tools for performing hyperparameter optimization of Scikit-Learn models using Dask.

## Introduction¶

This library provides implementations of Scikit-Learn’s `GridSearchCV`

and
`RandomizedSearchCV`

. They implement many (but not all) of the same
parameters, and should be a drop-in replacement for the subset that they do
implement. For certain problems, these implementations can be more efficient
than those in Scikit-Learn, as they can avoid expensive repeated computations.

For more information, see this blogpost.

## Highlights¶

- Drop-in replacement for Scikit-Learn’s
`GridSearchCV`

and`RandomizedSearchCV`

. - Flexible Backends. Hyperparameter optimization can be done in parallel using threads, processes, or distributed across a cluster.
- Works well with Dask collections. Dask
arrays, dataframes, and delayed can be passed to
`fit`

. - Avoid repeated work. Candidate estimators with
identical parameters and inputs will only be fit once. For
composite-estimators such as
`Pipeline`

this can be significantly more efficient as it can avoid expensive repeated computations.

## Install¶

Dask-searchcv is available via `conda`

or `pip`

:

```
# Install with conda
$ conda install dask-searchcv -c conda-forge
# Install with pip
$ pip install dask-searchcv
```

## Walkthrough¶

### Drop-In Replacement¶

Dask-searchcv provides (almost) drop-in replacements for Scikit-Learn’s
`GridSearchCV`

and `RandomizedSearchCV`

. With the exception of a few
keyword arguments, the api’s are exactly the same, and often only an import
change is necessary:

```
from sklearn.datasets import load_digits
from sklearn.svm import SVC
# Fit with dask-searchcv
from dask_searchcv import GridSearchCV
param_space = {'C': [1e-4, 1, 1e4],
'gamma': [1e-3, 1, 1e3],
'class_weight': [None, 'balanced']}
model = SVC(kernel='rbf')
digits = load_digits()
search = GridSearchCV(model, param_space, cv=3)
search.fit(digits.data, digits.target)
```

### Flexible Backends¶

Dask-searchcv can use any of the dask schedulers. By default the threaded scheduler is used, but this can easily be swapped out for the multiprocessing or distributed scheduler:

```
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
search.fit(digits.data, digits.target)
```

### Works Well With Dask Collections¶

Dask collections such as `dask.array`

, `dask.dataframe`

and
`dask.delayed`

can be passed to `fit`

. This means you can use dask to do
your data loading and preprocessing as well, allowing for a clean workflow.
This also allows you to work with remote data on a cluster without ever having
to pull it locally to your computer:

```
import dask.dataframe as dd
# Load data from s3
df = dd.read_csv('s3://bucket-name/my-data-*.csv')
# Do some preprocessing steps
df['x2'] = df.x - df.x.mean()
# ...
# Pass to fit without ever leaving the cluster
search.fit(df[['x', 'x2']], df['y'])
```

### Avoid Repeated Work¶

When searching over composite estimators like `sklearn.pipeline.Pipeline`

or
`sklearn.pipeline.FeatureUnion`

, dask-searchcv will avoid fitting the same
estimator + parameter + data combination more than once. For pipelines with
expensive early steps this can be faster, as repeated work is avoided.

For example, given the following 3-stage pipeline and grid (modified from this scikit-learn example).

```
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier())])
grid = {'vect__ngram_range': [(1, 1)],
'tfidf__norm': ['l1', 'l2'],
'clf__alpha': [1e-3, 1e-4, 1e-5]}
```

the Scikit-Learn grid-search implementation looks something like (simplified):

```
scores = []
for ngram_range in parameters['vect__ngram_range']:
for norm in parameters['tfidf__norm']:
for alpha in parameters['clf__alpha']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
```

As a directed acyclic graph, this might look like:

In contrast, the dask version looks more like:

```
scores = []
for ngram_range in parameters['vect__ngram_range']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
for norm in parameters['tfidf__norm']:
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
for alpha in parameters['clf__alpha']:
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
```

With a corresponding directed acyclic graph:

Looking closely, you can see that the Scikit-Learn version ends up fitting earlier steps in the pipeline multiple times with the same parameters and data. Due to the increased flexibility of Dask over Joblib, we’re able to merge these tasks in the graph and only perform the fit step once for any parameter/data/estimator combination. For pipelines that have relatively expensive early steps, this can be a big win when performing a grid search.