【学习笔记】集成学习(五):Bagging
Datawhale组队学习第27期:集成学习
本次学习的指导老师萌弟的教学视频
本贴为学习记录帖,有任何问题欢迎随时交流~
部分内容可能还不完整,后期随着知识积累逐步完善。
开始时间:2021年7月22日
最新更新:2021年7月25日(Task5 Bagging)
一、Bootstrap抽样
1. 简单描述
- 有放回采样(自助采样),如:对给定样本,抽n个样本集合,重复抽m次,可以得到m个参数估计,可以估计总体的方差。
2. 含义
-
设总体 A A A服从未知分布 F F F,样本 X = { x 1 , x 2 , . . . x n } X=\{x_1, x_2,...x_n\} X={ x1,x2,...xn}是 A A A的一个样本,而样本 X X X服从的抽样分布 F ^ \hat F F^是 F F F的估计。其中, ϕ \phi ϕ是分布 F F F的一个数字特征, ψ \psi ψ是统计量 ϕ ^ \hat \phi ϕ^的抽样分布的数字特征。
-
目标是用统计量 ϕ ^ = g ( X ) \hat \phi=g(X) ϕ^=g(X)去估计 ϕ \phi ϕ。采用随机模拟的方法估计 ψ \psi ψ,从而得到统计量的数字特征(如用 ψ = V a r ( X ˉ ) \psi=\sqrt{Var(\bar X)} ψ=Var(Xˉ)去计算统计量 X ˉ \bar X Xˉ的标准误),这种方法就称为bootstrap方法。
3. 步骤
-
从 F ^ \hat F F^中有放回随机地抽取 B B B个样本量为 n n n的独立样本 Y ( b ) Y^{(b)} Y(b), Y ( b ) Y^{(b)} Y(b)称为bootstrap样本,其中 b = 1 , . . . , B b=1,...,B b=1,...,B。
-
每个独立样本 Y ( b ) Y^{(b)} Y(b)用常规的估计方法计算 ϕ ^ ( b ) \hat \phi^{(b)} ϕ^(b)去估计得到总体分布 F F F的数字特征 ϕ \phi ϕ,可以得到 B B B个统计量的数字特征。
-
ϕ ^ = { ϕ ^ ( 1 ) , . . . , ϕ ^ ( B ) } \hat \phi = \{\hat \phi^{(1)}, ...,\hat \phi^{(B)}\} ϕ^={ ϕ^(1),...,ϕ^(B)}的每个 ϕ ^ ( b ) \hat \phi^{(b)} ϕ^(b)都是 F ^ \hat F F^下的独立同分布样本,同样可以用常规的估计方法计算 ψ ^ \hat \psi ψ^去估计 F ^ \hat F F^的分布特征 ψ \psi ψ。
-
更深入的可以查阅李东风老师的《统计计算》。
二、Bagging的思想
1. Bootstrap的体现
以服从均匀分布的概率从数据集中重复且有放回地抽取新的样本,对每个样本进行训练得到模型(子模型),得到多个子模型的预测值。
2. Aggregating的体现
-
整体模型一般采用的是线性组合。
-
对于分类问题来说,通常采用投票法,可以划分成硬投票和软投票两种方式。
- 硬投票,直接计算预测值出现最多的类
- 软投票,计算所有投票结果中概率加权最大的类。
- 软投票一般比硬投票要好,但要看具体的使用场景。软投票必须是基于概率的模型,能够产生预测值。对于关联规则模型、层次聚类模型等不能使用软投票。
-
对于回归问题来说,通常是直接计算所有子模型预测值的平均值
3. sklearn中的实现
- 调用库:
sklearn.ensemble.BaggingClassifier
和sklearn.ensemble.BaggingRegressor
- 随机森林:
sklearn.ensemble.RandomForestClassifier
和sklearn.ensemble.RandomForestRegressor
三、Bagging的精度分析
1. 推导
-
推导主要参考本篇文章。
-
设定基模型为 f f f,通过给定样本 X X X进行bootstrap抽样,训练出m个基模型。
-
最终模型是m个基模型的线性组合,记为 F = ∑ i = 1 m r i f i F = \sum\limits_{i=1}^{m}r_if_i F=i=1∑mrifi,其中 r i r_i ri为第 i i i个模型的权重系数
-
f f f训练的样本来自总体 X X X,可以记 E ( f i ) = μ E(f_i)=\mu E(fi)=μ和 V a r ( f i ) = σ 2 Var(f_i) = \sigma^2 Var(fi)=σ2
E ( F ) = E ( ∑ i = 1 m r i f i ) = ∑ i = 1 m r i E ( f i ) V a r ( F ) = V a r ( ∑ i = 1 m r i f i ) = ∑ i = 1 m V a r ( r i f i ) + ∑ i ≠ j m c o v ( r i f i , r j f j ) = ∑ i = 1 m r i 2 V a r ( f i ) + ∑ i ≠ j m ρ i j r i r j V a r ( f i ) V a r ( f j ) \begin{aligned} E(F) &= E(\sum\limits_{i=1}^{m}r_if_i) \\ &= \sum\limits_{i=1}^{m}r_iE(f_i) \\ Var(F) &= Var(\sum\limits_{i=1}^{m}r_if_i) \\ &= \sum\limits_{i=1}^{m}Var(r_if_i) + \sum\limits_{i \ne j}^{m}cov(r_if_i, r_jf_j) \\ &= \sum\limits_{i=1}^{m}r_i^2Var(f_i) + \sum\limits_{i \ne j}^{m}\rho_{ij}r_ir_j\sqrt{Var(f_i)}\sqrt{Var(f_j)} \\ \end{aligned} E(F)Var(F)=E(i=1∑mrifi)=i=1∑mriE(fi)=Var(i=1∑mrifi)=i=1∑mVar(rifi)+i=j∑mcov(rifi,rjfj)=i=1∑mri2Var(fi)+i=j∑mρijrirjVar(fi)Var(fj) -
从bootstrap抽样的步骤可以知道, F F F的数字特征可以反应原始数据的数字特征,同时每个 f i f_i fi所接受的样本都是独立的,因此可以得到:
E ( F ) = μ ∑ i = 1 m r i V a r ( F ) = σ 2 ∑ i = 1 m r i 2 \begin{aligned} E(F) &= \mu \sum\limits_{i=1}^{m}r_i \\ Var(F) &= \sigma^2\sum\limits_{i=1}^{m}r_i^2 \end{aligned} E(F)Var(F)=μi=1∑mri=σ2i=1∑mri2
2. 基于偏差与方差理论分析
-
根据偏差与方差理论可以知道,测试误差主要由方差 V a r Var Var和偏差的平方 B i a s 2 Bias^2 Bias2决定的, V a r ( ϵ ) Var(\epsilon) Var(ϵ)是与任务本身有关,不纳入其中考虑。在本模型中, E ( F ) E(F) E(F)反映 B i a s Bias Bias,而 V a r ( F ) Var(F) Var(F)反映 V a r Var Var。
[ E ( F ) E ( f ) ] 2 = ( ∑ i = 1 m r i ) 2 V a r ( F ) V a r ( f ) = ∑ i = 1 m r i 2 [ E ( F ) − E ( f ) ] 2 = ( 1 − ∑ i = 1 m r i ) 2 μ 2 V a r ( F ) − V a r ( f ) = ( 1 − ∑ i = 1 m r i 2 ) σ 2 \begin{aligned} {[\frac{E(F)}{E(f)}]}^2 &= (\sum\limits_{i=1}^{m}r_i)^2 \\ \frac{Var(F)}{Var(f)} &= \sum\limits_{i=1}^{m}r_i^2 \\ {[E(F)-E(f)]}^2 &= (1 - \sum\limits_{i=1}^{m}r_i)^2\mu^2 \\ Var(F)-Var(f) &= (1 - \sum\limits_{i=1}^{m}r_i^2)\sigma^2 \\ \end{aligned} [E(f)E(F)]2Var(f)Var(F)[E(F)−E(f)]2Var(F)−Var(f)=(i=1∑mri)2=i=1∑mri2=(1−i=1∑mri)2μ2=(1−i=1∑mri2)σ2 -
相对于基模型而言,Bagging模型能够降低模型的方差,同时也有可能增大模型的偏差。而实际上,Bagging模型中的权重系数之和 ∑ i = 1 m r i ≈ 1 \sum\limits_{i=1}^{m}r_i \approx 1 i=1∑mri≈1,Bagging模型的 B i a s Bias Bias会十分接近于基模型的 B i a s Bias Bias,因此模型 V a r Var Var减少的量会大于 B i a s Bias Bias增大的量,从而达到减少测试误差(即牺牲偏差来减小测试误差)。
3. 样例
- 假设Bagging模型将每个基模型的权重系数 r i r_i ri都设为 1 m \frac{1}{m} m1,则有:
E ( F ) = μ ∑ i = 1 m 1 m = μ V a r ( F ) = σ 2 ∑ i = 1 m 1 m 2 = 1 m σ 2 \begin{aligned} E(F) &= \mu \sum\limits_{i=1}^{m}\frac{1}{m} \\ &= \mu \\ Var(F) &= \sigma^2\sum\limits_{i=1}^{m}\frac{1}{m^2} \\ &= \frac{1}{m}\sigma^2 \end{aligned} E(F)Var(F)=μi=1∑mm1=μ=σ2i=1∑mm21=m1σ2 - 显然,此时Bagging模型的 B i a s Bias Bias取决于基模型的 B i a s Bias Bias,而Bagging模型的 V a r Var Var要远小于基模型的 V a r Var Var,因此Bagging具有更高的泛化能力。
4. 基模型的选择
- 从上述推导中可以发现,Bagging模型对 B i a s Bias Bias改变不明显,这意味在这方面Bagging十分依赖基模型,因此Bagging对基模型选择的是强模型(低 B i a s Bias Bias高 V a r Var Var模型),通过对基模型的线性组合来降低 V a r Var Var来提高模型的精度。
四、Bagging应用
1. 数据来源
digits
手写数字识别数据集比较出名,其中类别标签为1~10,是十分明显的分类问题。每一个样本有64个特征,即通过8*8的一个数值矩阵来反映对应的数字标签。
- 导入数据集
from sklearn.datasets import load_digits data = load_digits()
- 获取变量(注意,这里的
data
是bunch
类型,具体说明查看官网或该文章的数据集分析)# 两种获取变量的方法,有所区别 x1 = data.images # shape为(1797, 8, 8),后两个维度是图像数值矩阵,可以用于展示图片 x2 = data.data # shape为(1797, 64),机器学习常用的数值矩阵
- 画出图像,参考官网的方法,进一步自定义图像显示。
# 定义图形显示函数 def digit_plot(n_rows, n_clos, data_set, image_list): # 判断输入是否规范 if n_clos * n_rows != len(image_list): print('显示格数与图像数量不匹配,请重新输入!') return # 设置绘图格子 _, axes = plt.subplots(nrows=n_rows, ncols=n_clos) # 只生成单个数字图像 if len(image_list) == 1: axes.imshow(data_set.images[image_list[0] - 1], cmap='Greys_r', interpolation='nearest') axes.set_title('Training: %i' % data_set.target[image_list[0] - 1]) plt.show() return # 生成多个数字图像 ax_list = [] # 初始化格子列表 for ax in axes: if isinstance(ax, np.ndarray): # axes有可能输出二维array,需要进一步取出 for ax_i in ax: ax_list.append(ax_i) else: ax_list.append(ax) for ax, image_no in zip(ax_list, image_list): ax.set_axis_off() ax.imshow(data_set.images[image_no - 1], cmap='Greys_r', interpolation='nearest') ax.set_title('Training: %i' % data_set.target[image_no - 1]) plt.show() return # 调用图像函数,生成第1、4、6、9张图,并按在2*2的格子上显示 digit_plot(2, 2, data, [1, 4, 6, 9])
- 图像显示如下:
2. 模型性能指标
- 本实验是分类问题,采用的是对应的分类指标(可以回顾上篇文章),本次主要采用
f1_score(marcro)
进行评价模型的性能。 f1_score
:- 召回率
Recall
和精度Precision
的调和平均。 - 面对多分类问题中,召回率
Recall
和精度Precision
需要转化为macro
或micro
的计算。
F 1 m a c r o = 2 × P r e c i s i o n m a c r o × R e c a l l m a c r o P r e c i s i o n m a c r o + R e c a l l m a c r o F 1 m i c r o = 2 × P r e c i s i o n m i c r o × R e c a l l m i c r o P r e c i s i o n m i c r o + R e c a l l m i c r o F1_{macro} = 2 \times \frac{Precision_{macro} \times Recall_{macro}} {Precision_{macro} + Recall_{macro}} \\ F1_{micro} = 2 \times \frac{Precision_{micro} \times Recall_{micro}} {Precision_{micro} + Recall_{micro}} \\ F1macro=2×Precisionmacro+RecallmacroPrecisionmacro×RecallmacroF1micro=2×Precisionmicro+RecallmicroPrecisionmicro×Recallmicro
- 召回率
- 调用方式:
from sklearn.metrics import f1_score f1_score(y_test, y_pred, average='macro')
3. 变量选择与数据集划分
- 划分训练集和测试集
- 测试集用于检验模型的性能
- 交叉验证的方式
- 采用
StratifiedShuffleSplit
进行分层抽样,保证每个类别都有足够的样本
- 采用
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
# 划分变量
x = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.DataFrame(data['target'], columns=['target'])
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2021)
# 交叉验证的方法设定(用于训练参数)
ran = 2021 # 设定随机种子
np.random.seed(ran)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.1, random_state=ran)
4. 模型选择
- 在该实验中,我们面对的是一个多分类问题,最直观的思路是选择K近邻或是决策树。因此,下面的模型中主要基于K近邻和决策树进行分析。
- K近邻
- 简单来说就是给定一个测试样本,找距离最近的k个训练样本点,根据这k个点的出现次数最多的类进行判别(硬投票法)
- 算法难度在
最近点
的搜索,常用KD树
- 调用代码:
sklearn.neighbors.KNeighborsClassifier
- 主要的超参数:
n_neighbors
:K值,表示邻近点的个数weights
:临近点的赋权方式,一般是采用按均匀分布uniform
或按距离distance
p
:赋权计算时采用哪种 L p L_p Lp距离
- 决策树
- 主要是依据分层和分割的方式将特征空间划分为一系列简单的区域
- 分割标准通常可以考虑:Gini系数、交叉熵
- 算法难度在于树的生成和剪枝
- 调用代码:
sklearn.tree.DecisionTreeClassifier
- 主要的超参数:
criterion
:特征划分时的度量标准,如Gini
和entropy
max_depth
:树的最大深度min_samples_leaf
:树的叶结点的最小样本数min_samples_split
:树分裂时需要的最小样本数
# 模型选择(KNN)
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
print('K近邻的得分:', f1_score(y_test, knn.predict(x_test), average='macro'))
# 模型选择(DT)
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
print('决策树的得分:', f1_score(y_test, dt.predict(x_test), average='macro'))
5. 超参数优化
- 上面的模型训练只用了默认的超参数,不同的超参数会对模型性能有影响,为了比较各种超参数对模型性能的影响,对训练集进一步划分训练集和验证集,并采用交叉验证的方法进行调参。调参思路和代码思路参考文章1和文章2:
- 单个参数初始为大范围大步长,逐步缩小步长和范围
- 利用
GridSearch
对小范围进行搜索最优参数,可以回顾文章
- 定义函数,用于展示参数与精度的关系(笔记三也有过类似的做法)
# 定义调参可视化图(用于缩小检索范围) def param_plot(estimator, x_, y_, cv_, target_param_=None, params=None): score_all_ = [] for key, values in target_param_.items(): for value in values: tg = { key: value} if params: # md_ = estimator(**params, **tg, random_state=rand_seed) md_ = estimator(**params, **tg) else: # md_ = estimator(**tg, random_state=rand_seed) md_ = estimator(**tg) score_ = cross_val_score(md_, x_, y_, cv=cv_).mean() score_all_.append([value, score_]) score_all_ = np.array(score_all_) max_ind_ = np.where(score_all_ == np.max(score_all_[:, 1]))[0][0] max_param_ = score_all_[max_ind_][0] plt.plot(score_all_[:, 0], score_all_[:, 1]) plt.axvline(max_param_, linestyle='--', color='k') plt.show() print('最优的参数及得分:', score_all_[max_ind_]) return score_all_[max_ind_]
- K近邻调参
- 大范围搜索
# KNN超参数范围检索 param1 = { 'n_neighbors': [i for i in range(1, 20)]} # 找k值的最佳范围 other_param1 = { 'weights': 'distance', 'p': 2} best_parma1 = param_plot(KNeighborsClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param1, params=other_param1) param2 = { 'p': [i for i in range(1, 10)]} # 找p的最佳范围 other_param2 = { 'weights': 'distance', 'n_neighbors': 1} best_parma2 = param_plot(KNeighborsClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param2, params=other_param2)
- 从图中可以看出,
n_neighbors
的最优值在1到3之间,p
的最优值也在1到3之间,因此可以考虑使用GridSearch
进行超参数优化。
GriGridSearchCV
调参
# KNN-GridSearch grid_param = { 'n_neighbors': [i for i in range(1, 5)], 'weights': ['distance'], 'p': [i for i in range(1, 3)]} md1 = GridSearchCV(KNeighborsClassifier(), param_grid=grid_param, cv=cv) md1.fit(x_train, y_train.values.reshape(-1,)) print("模型的最优参数:", md1.best_params_) print("KNN的得分:", f1_score(y_test, md1.predict(x_test), average='macro'))
- 决策树调参
- 大范围搜索
# max_depth param1 = { 'max_depth': [i for i in range(10, 501, 10)]} other_param1 = { 'criterion': 'gini', 'random_state': ran} param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param1, params=other_param1) param2 = { 'max_depth': [i for i in range(5, 30)]} other_param2 = { 'criterion': 'gini', 'random_state': ran} param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param2, params=other_param2) # min_samples_split param3 = { 'min_samples_split': [i for i in range(5, 101, 5)]} other_param3 = { 'criterion': 'gini', 'random_state': ran, 'max_depth': 13} param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param3, params=other_param3) param4 = { 'min_samples_split': [i for i in range(2, 20)]} other_param4 = { 'criterion': 'gini', 'random_state': ran, 'max_depth': 13} param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param4, params=other_param4) # min_samples_leaf param5 = { 'min_samples_leaf': [i for i in range(2, 40)]} other_param5 = { 'criterion': 'gini', 'random_state': ran, 'max_depth': 13, 'min_samples_split': 2} param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param5, params=other_param5)
- 确定小范围,使用
GridSearchCV()
# GridSearch grid_param2 = { 'max_depth': [i for i in range(10, 16)], 'min_samples_split': [i for i in range(2, 8)], 'min_samples_leaf': [i for i in range(2, 10)]} md2 = GridSearchCV(DecisionTreeClassifier(criterion='gini', random_state=ran), param_grid=grid_param2, cv=cv) md2.fit(x_train, y_train.values.reshape(-1, )) print("模型的最优参数:", md2.best_params_) print("决策树的得分:", f1_score(y_test, md2.predict(x_test), average='macro'))
6. 模型Bagging
-
分类问题的集成是将分类器线性组合,对多个分类器的预测值采用投票法进行判别。
-
调用函数:
sklearn.ensemble.BaggingClassifier
-
KNN-Bagging:
- 搜索最优
n_estimators
的大致范围
# KNN-Bagging param1 = { 'n_estimators': [i for i in range(10, 100, 10)]} other_param1 = { 'base_estimator': KNeighborsClassifier(**md1.best_params_), 'random_state': ran, 'n_jobs': -1} param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param1, params=other_param1) param2 = { 'n_estimators': [i for i in range(10, 50, 1)]} other_param2 = { 'base_estimator': KNeighborsClassifier(**md1.best_params_), 'random_state': ran, 'n_jobs': -1}
GredSearchCV()
搜索
knn_bagging = BaggingClassifier(KNeighborsClassifier(**md1.best_params_), random_state=ran, n_jobs=-1) grid_param3 = { 'n_estimators': [i for i in range(25, 30)]} md3 = GridSearchCV(knn_bagging, param_grid=grid_param3, cv=cv, n_jobs=-1) md3.fit(x_train, y_train.values.reshape(-1, )) print("模型的最优参数:", md3.best_params_) print("KNN-Bagging的得分:", f1_score(y_test, md3.predict(x_test), average='macro'))
- 搜索最优
-
决策树-Bagging:
- 搜索最优
n_estimators
的大致范围
param1 = { 'n_estimators': [i for i in range(10, 100, 10)]} other_param1 = { 'base_estimator': DecisionTreeClassifier(**md2.best_params_, criterion='gini', random_state=ran), 'random_state': ran, 'n_jobs': -1} param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param1, params=other_param1) param2 = { 'n_estimators': [i for i in range(20, 41)]} other_param2 = { 'base_estimator': DecisionTreeClassifier(**md2.best_params_, criterion='gini', random_state=ran), 'random_state': ran, 'n_jobs': -1} param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param2, params=other_param2)
GredSearchCV()
搜索
dt_bagging = BaggingClassifier(DecisionTreeClassifier(**md2.best_params_, criterion='gini', random_state=ran), random_state=ran, n_jobs=-1) grid_param4 = { 'n_estimators': [i for i in range(25, 51)]} md4 = GridSearchCV(dt_bagging, param_grid=grid_param4, cv=cv) md4.fit(x_train, y_train.values.reshape(-1, )) print("模型的最优参数:", md4.best_params_) print("决策树-Bagging的得分:", f1_score(y_test, md4.predict(x_test), average='macro'))
- 搜索最优
-
随机森林:决策树的一种特殊Bagging
# 随机森林 param1 = { 'n_estimators': [i for i in range(10, 300, 10)]} other_param1 = { **md2.best_params_, 'random_state': ran, 'n_jobs': -1} param_plot(RandomForestClassifier, x_train, y_train.values.reshape(-1, ), cv_=cv, target_param_=param1, params=other_param1) random_forest = RandomForestClassifier(**md2.best_params_, random_state=ran, n_jobs=-1) grid_param5 = { 'n_estimators': [i for i in range(50, 101)]} md5 = GridSearchCV(random_forest, param_grid=grid_param5, cv=cv, n_jobs=-1) md5.fit(x_train, y_train.values.reshape(-1, )) print("模型的最优参数:", md5.best_params_) print("决策树-Bagging的得分:", f1_score(y_test, md5.predict(x_test), average='macro'))
7. 模型比较
- 从模型的最终得分来看,相比于基模型,Bagging能够有效提高模型的精度。
# 最终模型评价
print("%-15s\t:%-20s" % ("K近邻的得分", f1_score(y_test, md1.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("决策树的得分", f1_score(y_test, md2.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("K近邻-Bagging的得分", f1_score(y_test, md3.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("决策树-Bagging的得分", f1_score(y_test, md4.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("随机森林的得分", f1_score(y_test, md5.predict(x_test), average='macro')))
五、参考资料
- https://github.com/datawhalechina/ensemble-learning
- https://www.bilibili.com/video/BV1Mb4y1o7ck?t=470
- https://www.bilibili.com/video/BV1X64y1m71o?t=2972
- https://zhuanlan.zhihu.com/p/86263786
- https://www.math.pku.edu.cn/teachers/lidf/docs/statcomp/html/_statcompbook/sim-bootstrap.html
- https://zhuanlan.zhihu.com/p/103136609
- https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py
- https://www.cnblogs.com/lyr999736/p/10665572.html
六、本文代码
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import os
# 设置项目路径和种子
os.chdir('D:/NewProject/datawhale/ensemble/task5')
np.random.seed(2021)
# 加载数据集
data = load_digits()
data.keys()
# 数字图像获取
x1 = data.images
x2 = data.data
# 定义图形显示函数
def digit_plot(n_rows, n_clos, data_set, image_list):
# 判断输入是否规范
if n_clos * n_rows != len(image_list):
print('显示格数与图像数量不匹配,请重新输入!')
return
# 设置绘图格子
_, axes = plt.subplots(nrows=n_rows, ncols=n_clos)
# 只生成单个图像
if len(image_list) == 1:
axes.imshow(data_set.images[image_list[0] - 1],
cmap='Greys_r', interpolation='nearest')
axes.set_title('Training: %i' % data_set.target[image_list[0] - 1])
plt.show()
return
# 自定义
ax_list = [] # 初始化格子列表
for ax in axes:
if isinstance(ax, np.ndarray): # axes有可能输出二维array,需要进一步取出
for ax_i in ax:
ax_list.append(ax_i)
else:
ax_list.append(ax)
for ax, image_no in zip(ax_list, image_list):
ax.set_axis_off()
ax.imshow(data_set.images[image_no - 1],
cmap='Greys_r', interpolation='nearest')
ax.set_title('Training: %i' % data_set.target[image_no - 1])
plt.show()
return
# 查看数字图像
digit_plot(2, 2, data, [1, 4, 6, 9])
# 划分变量
x = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.DataFrame(data['target'], columns=['target'])
# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2021, shuffle=True)
# 设定交叉验证的方法和随机种子
ran = 2021
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.1, random_state=ran)
# 模型选择(KNN)
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
print('K近邻的得分:', f1_score(y_test, knn.predict(x_test), average='macro'))
# 模型选择(DT)
dt = DecisionTreeClassifier(random_state=ran)
dt.fit(x_train, y_train)
print('决策树的得分:', f1_score(y_test, dt.predict(x_test), average='macro'))
# 定义调参可视化图(用于缩小检索范围)
def param_plot(estimator, x_, y_, cv_, target_param_=None, params=None):
score_all_ = []
for key, values in target_param_.items():
for value in values:
tg = {
key: value}
if params:
# md_ = estimator(**params, **tg, random_state=rand_seed)
md_ = estimator(**params, **tg)
else:
# md_ = estimator(**tg, random_state=rand_seed)
md_ = estimator(**tg)
score_ = cross_val_score(md_, x_, y_, cv=cv_).mean()
score_all_.append([value, score_])
score_all_ = np.array(score_all_)
max_ind_ = np.where(score_all_ == np.max(score_all_[:, 1]))[0][0]
max_param_ = score_all_[max_ind_][0]
plt.plot(score_all_[:, 0], score_all_[:, 1])
plt.axvline(max_param_, linestyle='--', color='k')
plt.title('Parameter Search: {}'.format(list(target_param_.keys())[0]))
plt.show()
print('最优的参数及得分:', score_all_[max_ind_])
return score_all_[max_ind_]
# KNN调参
param1 = {
'n_neighbors': [i for i in range(1, 20)]} # 找k值的最佳范围
other_param1 = {
'weights': 'distance', 'p': 2}
best_parma1 = param_plot(KNeighborsClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param1, params=other_param1)
param2 = {
'p': [i for i in range(1, 10)]} # 找p的最佳范围
other_param2 = {
'weights': 'distance', 'n_neighbors': 1}
best_parma2 = param_plot(KNeighborsClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param2, params=other_param2)
# KNN-GridSearch
grid_param1 = {
'n_neighbors': [i for i in range(1, 5)],
'weights': ['distance'],
'p': [i for i in range(1, 3)]}
md1 = GridSearchCV(KNeighborsClassifier(), param_grid=grid_param1, cv=cv)
md1.fit(x_train, y_train.values.reshape(-1, ))
print("模型的最优参数:", md1.best_params_)
print("KNN的得分:", f1_score(y_test, md1.predict(x_test), average='macro'))
# 决策树调参
# max_depth
param1 = {
'max_depth': [i for i in range(10, 501, 10)]}
other_param1 = {
'criterion': 'gini', 'random_state': ran}
param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param1, params=other_param1)
param2 = {
'max_depth': [i for i in range(5, 30)]}
other_param2 = {
'criterion': 'gini', 'random_state': ran}
param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param2, params=other_param2)
# min_samples_split
param3 = {
'min_samples_split': [i for i in range(5, 101, 5)]}
other_param3 = {
'criterion': 'gini', 'random_state': ran, 'max_depth': 13}
param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param3, params=other_param3)
param4 = {
'min_samples_split': [i for i in range(2, 20)]}
other_param4 = {
'criterion': 'gini', 'random_state': ran, 'max_depth': 13}
param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param4, params=other_param4)
# min_samples_leaf
param5 = {
'min_samples_leaf': [i for i in range(2, 40)]}
other_param5 = {
'criterion': 'gini', 'random_state': ran,
'max_depth': 13, 'min_samples_split': 2}
param_plot(DecisionTreeClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param5, params=other_param5)
# GridSearch
grid_param2 = {
'max_depth': [i for i in range(10, 16)],
'min_samples_split': [i for i in range(2, 8)],
'min_samples_leaf': [i for i in range(2, 10)]}
md2 = GridSearchCV(DecisionTreeClassifier(criterion='gini', random_state=ran),
param_grid=grid_param2, cv=cv)
md2.fit(x_train, y_train.values.reshape(-1, ))
print("模型的最优参数:", md2.best_params_)
print("决策树的得分:", f1_score(y_test, md2.predict(x_test), average='macro'))
# KNN-Bagging
param1 = {
'n_estimators': [i for i in range(10, 100, 10)]}
other_param1 = {
'base_estimator': KNeighborsClassifier(**md1.best_params_),
'random_state': ran,
'n_jobs': -1}
param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param1, params=other_param1)
param2 = {
'n_estimators': [i for i in range(10, 50, 1)]}
other_param2 = {
'base_estimator': KNeighborsClassifier(**md1.best_params_),
'random_state': ran,
'n_jobs': -1}
param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param2, params=other_param2)
knn_bagging = BaggingClassifier(KNeighborsClassifier(**md1.best_params_),
random_state=ran, n_jobs=-1)
grid_param3 = {
'n_estimators': [i for i in range(25, 30)]}
md3 = GridSearchCV(knn_bagging, param_grid=grid_param3, cv=cv, n_jobs=-1)
md3.fit(x_train, y_train.values.reshape(-1, ))
print("模型的最优参数:", md3.best_params_)
print("KNN-Bagging的得分:", f1_score(y_test, md3.predict(x_test), average='macro'))
# DT-Bagging
param1 = {
'n_estimators': [i for i in range(10, 100, 10)]}
other_param1 = {
'base_estimator': DecisionTreeClassifier(**md2.best_params_,
criterion='gini', random_state=ran),
'random_state': ran,
'n_jobs': -1}
param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param1, params=other_param1)
param2 = {
'n_estimators': [i for i in range(20, 41)]}
other_param2 = {
'base_estimator': DecisionTreeClassifier(**md2.best_params_,
criterion='gini', random_state=ran),
'random_state': ran,
'n_jobs': -1}
param_plot(BaggingClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param2, params=other_param2)
dt_bagging = BaggingClassifier(DecisionTreeClassifier(**md2.best_params_,
criterion='gini', random_state=ran),
random_state=ran, n_jobs=-1)
grid_param4 = {
'n_estimators': [i for i in range(25, 51)]}
md4 = GridSearchCV(dt_bagging, param_grid=grid_param4, cv=cv, n_jobs=-1)
md4.fit(x_train, y_train.values.reshape(-1, ))
print("模型的最优参数:", md4.best_params_)
print("决策树-Bagging的得分:", f1_score(y_test, md4.predict(x_test), average='macro'))
# 随机森林
param1 = {
'n_estimators': [i for i in range(10, 300, 10)]}
other_param1 = {
**md2.best_params_,
'random_state': ran,
'n_jobs': -1}
param_plot(RandomForestClassifier, x_train, y_train.values.reshape(-1, ),
cv_=cv, target_param_=param1, params=other_param1)
random_forest = RandomForestClassifier(**md2.best_params_, random_state=ran, n_jobs=-1)
grid_param5 = {
'n_estimators': [i for i in range(50, 101)]}
md5 = GridSearchCV(random_forest, param_grid=grid_param5, cv=cv, n_jobs=-1)
md5.fit(x_train, y_train.values.reshape(-1, ))
print("模型的最优参数:", md5.best_params_)
print("随机森林-Bagging的得分:", f1_score(y_test, md5.predict(x_test), average='macro'))
# 最终模型评价
print("%-15s\t:%-20s" % ("K近邻的得分", f1_score(y_test, md1.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("决策树的得分", f1_score(y_test, md2.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("K近邻-Bagging的得分", f1_score(y_test, md3.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("决策树-Bagging的得分", f1_score(y_test, md4.predict(x_test), average='macro')))
print("%-15s\t:%-20s" % ("随机森林的得分", f1_score(y_test, md5.predict(x_test), average='macro')))