2018-06-01 19:10 桂林电子科技大学算法工程师

关注

Python机器学习基础教程第六章

首先我讲解Pipeline,在我的工作流中有很多组件将步骤连接在一起,比如SelectKBest和RandomForestClassifier.在选择了最好的100个特征后,就运行我的随机森林模型,从而检查更细致的特征是否有更好的表现.Pipeline组件包就是在我的工作流中连贯上述特征选择与转换,随机森林分类器估计的操作的工具.

那么这里为什么我要用Pipeline,而不将各个步骤分开来呢?主要以下原因:

1代码可读性更强;

2减少跟踪输入模型的数据关于转换和评估的步骤;

3管道组件的增加删除修改更加方便,即插即用;

4最重要一点是,可以方便地用GridSearchCV管理工作流.

以下代码是一个启动Pipeline以及运行SelectKBest和RandomForestClassifier的过程.

import sklearn.pipeline

select = sklearn.feature_selection.SelectKBest(k=100)

clf = sklearn.ensemble.RandomForestClassifier()

steps = [('feature_selection', select),

('random_forest', clf)]

pipeline = sklearn.pipeline.Pipeline(steps)

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.33, random_state=42)### fit your pipeline on X_train and y_train

pipeline.fit( X_train, y_train )### call pipeline.predict() on your X_test data to make a set of test predictions

y_prediction = pipeline.predict( X_test )

### test your predictions using sklearn.classification_report()

report = sklearn.metrics.classification_report( y_test, y_prediction )

### and print the report

print(report)

            precision    recall  f1-score   support

          0       0.80      0.56      0.66      5007
          1       0.12      0.00      0.01       942
          2       0.69      0.92      0.79      7119

avg / total 0.69 0.72 0.68 13068

在工作流中我会构建一个步骤列表,他们是关于特征选择,模型选择的步骤,如SelectKBest,RandomForestClassifier,然后将这些步骤放进Pipeline,从而变成一个连贯的工作流程,在管道工作流中我可以按照情况随时调配各种组件,如在最后增加模型预测的评估器.

顺便说一句,我稍微改变了下评估预测模型的方式,使用的是sklearn.metrics.classification_report(),它比上一篇的cross_val_score提供的信息更多.在这里需要自己手动设置训练集和测试集的比例,而cross_val_score可以自动构建训练集和测试集的比例.

现在我来说下GridSearchCV.当我决定选择前100个特征的时候,这可能是我自定义决定的;同样道理,使用RandomForestClassifier分类器的默认参数可能不是具有最优表现的.

因此,现在用GridSearch代替以前选择100个特征,以及选择随机森林最优的参数. 这里使用n_estimators和min_samples_split.它们可以帮我在麻烦的参数设置工作与步骤的组合中,避免局部最优,尝试所有组合,最终找到最优参数设置.

GridSearchCV会构建所有的参数组合的网格,然后尝试每个组合,最后找到最优组合的模型.

import sklearn.grid_search

parameters = dict(feature_selection__k=[100, 200],
random_forest__n_estimators=[50, 100, 200],

random_forest__min_samples_split=[2, 3, 4, 5, 10])

cv = sklearn.grid_search.GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)y_predictions = cv.predict(X_test)

report = sklearn.metrics.classification_report( y_test, y_predictions )

使用sklearn的Pipeline的示例如下:

clf = RandomForestClassifier()

steps = [("my_classifier", clf)]

parameters = dict{my_classifier__min_samples_split=[2, 3, 4, 5]}

### “my_classifier” is the name of the random forest classifier in the steps list; min_samples_split is the associated sklearn parameter that I want to vary

pipe = Pipeline(steps)

cv = GridSearchCV( pipe, param_grid = parameters)

一旦网格中参数设置得当,GridSearchCV的强大之处在于它运行尝试了所有的参数组合,使得每个组合运行k=3的交叉验证模型.然后,我可以要求GridSearchCV自动返回到我预测的"最佳"参数集(也就是从它尝试过的最好的模型预测),或者我可以明确要求最佳模式/最佳使用带有GridSearchCV相关方法的参数.参数搜索是相当耗时的事情,但最后结果是一个更好的理解模型的参数表现.

当然,GridSearchCV也可以单独作为一个对象运用,而不仅是放在Pipeline里面.例如我可以单独优化SelectKBest或者RandomForestClassifier,但由于有时在分析各种步骤之间的相互作用，对优化整个管道是有用的,这部分内容也是值得探讨的地方.最后,GridSearchCV会自动交叉验证分析所有步骤,诸如特征提取也可以交叉验证,而不只是最后的算法进行交叉验证.

pipeline为方便数据处理，提供了两种模式：串行化和并行化

1.串行化，通过Pipeline类实现

通过steps参数，设定数据处理流程。格式为('key','value')，key是自己为这一step设定的名称，value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数，最后一步可有可无，一般最后一步为模型。pipe继承了最后一个类的所有方法。

[python] view plain copy

In [42]: from sklearn.pipeline import Pipeline
...: from sklearn.svm import SVC
...: from sklearn.decomposition import PCA
...: pipe=Pipeline(steps=[('pca',PCA()),('svc',SVC())])
...:
...: from sklearn.datasets import load_iris
...: iris=load_iris()
...: pipe.fit(iris.data,iris.target)
...:
Out[42]:
Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None,
random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, ***_size=200,
class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])

训练得到的是一个模型，可直接用来预测，预测时，数据会从step1开始进行转换，避免了模型用来预测的数据还要额外写代码实现。还可通过pipe.score(X,Y)得到这个模型在X训练集上的正确率。

[python] view plain copy

In [46]: pipe.predict(iris.data)
Out[46]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [47]: iris.target
Out[47]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

make_pipeline函数是Pipeline类的简单实现，只需传入每个step的类实例即可，不需自己命名，自动将类的小写设为该step的名。

[python] view plain copy

In [50]: make_pipeline(StandardScaler(),GaussianNB())
Out[50]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
True, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])
In [51]: p=make_pipeline(StandardScaler(),GaussianNB())
In [52]: p.steps
Out[52]:
[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
('gaussiannb', GaussianNB(priors=None))]

同时可以通过set_params重新设置每个类里边需传入的参数，设置方法为step的name__parma名=参数值

[python] view plain copy

In [59]: p.set_params(standardscaler__with_mean=False)
Out[59]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
False, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])

2.并行化，通过FeatureUnion实现
FeatureUnion，同样通过(key，value)对来设置，通过set_params设置参数。不同的是，每一个step分开计算，FeatureUnion最后将它们计算得到的结果合并到一块，返回的是一个数组，不具备最后一个estimator的方法。有些数据需要标准化，或者取对数，或onehot编码最后形成多个特征项，再选择重要特征，这时候FeatureUnion非常管用。

[python] view plain copy