网格搜索和交叉验证(Grid Search & Cross Validation)

Contents

一、背景

我们先介绍一下网络搜索和交叉验证,然后使用一份关于用户借还贷款数据,以用户是否逾期为目标,进行网格搜索和交叉验证实践,提高模型的准确率。
特征工程可以看前篇文章用户贷款逾期特征工程,本文直接使用处理后的数据进行分析。

二、网格搜索和交叉验证

1.1 什么是网格搜索

网格搜索算法是一种通过遍历给定的参数组合来优化模型表现的方法。

1.2 为什么要进行网格搜索

以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫Grid Search image1 (图片来自于网络)

对训练集再进行一次划分,分成训练集和验证集,这样划分的结果就是:原始数据划分为3份,分别为:训练集、验证集和测试集;其中训练集用来模型训练,验证集用来调整参数,而测试集用来衡量模型表现好坏。 image2 (图片来自于网络)

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)
print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))

####   grid search start
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)
        svm.fit(X_train,y_train)
        score = svm.score(X_val,y_val)
        if score > best_score:
            best_score = score
            best_parameters = {'gamma':gamma,'C':C}
svm = SVC(**best_parameters) #使用最佳参数,构建新的模型
svm.fit(X_trainval,y_trainval) #使用训练集和验证集进行训练,more data always results in good performance.
test_score = svm.score(X_test,y_test) # evaluation模型评估

####   grid search end
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Best score on test set:{:.2f}".format(test_score))

输出

Size of training set:84 size of validation set:28 size of teseting set:38
Best score on validation set:0.96
Best parameters:{'gamma': 0.001, 'C': 10}
Best score on test set:0.92

grid search 方法,其最终的表现好坏与初始数据的划分结果有很大的关系,为了处理这种情况,我们采用交叉验证的方式来减少偶然性。

2. 交叉验证( Cross Validation)

这里以K折交叉验证为例,介绍它的算法过程。 在K折交叉验证中,我们用到的数据是训练集中的所有数据。我们将训练集的所有数据平均划分成K份(通常选择K=10),取第K份作为验证集,依次进行训练。 image3 (图片来自于网络)
利用决策树预测乳腺癌的例子:

from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    data['data'], data['target'], train_size=0.8, random_state=0)

regressor = DecisionTreeClassifier(random_state=0)
parameters = {'max_depth': range(1, 6)}
scoring_fnc = make_scorer(accuracy_score)
kfold = KFold(n_splits=10)

grid = GridSearchCV(regressor, parameters, scoring_fnc, cv=kfold)
grid = grid.fit(X_train, y_train)
reg = grid.best_estimator_

print('best score: %f'%grid.best_score_)
print('best parameters:')
for key in parameters.keys():
    print('%s: %d'%(key, reg.get_params()[key]))

print('test score: %f'%reg.score(X_test, y_test))

import pandas as pd
pd.DataFrame(grid.cv_results_).T

3. 带有交叉验证的网格搜索(Grid Search with Cross Validation)

from sklearn.model_selection import cross_val_score

best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
    for C in [0.001,0.01,0.1,1,10,100]:
        svm = SVC(gamma=gamma,C=C)
        scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5折交叉验证
        score = scores.mean() #取平均数
        if score > best_score:
            best_score = score
            best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))

输出

Best score on validation set:0.97
Best parameters:{'gamma': 0.01, 'C': 100}
Score on testing set:0.97

4. 应用 sklean 的类 GridSearchCV 进行调参

sklearn设计了一个这样的类GridSearchCV,以实现交叉验证与网格搜索的结合

from sklearn.model_selection import GridSearchCV

#把要调整的参数以及其候选值 列出来;
param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
             "C":[0.001,0.01,0.1,1,10,100]}
print("Parameters:{}".format(param_grid))

grid_search = GridSearchCV(SVC(),param_grid,cv=5) #实例化一个GridSearchCV类
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)
grid_search.fit(X_train,y_train) #训练,找到最优的参数,同时使用最优的参数实例化一个新的SVC estimator。
print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))

三、将网格搜索和交叉验证应用到用户贷款逾期

下面针对各个模型进行调优 采用五折交叉验证

# n折交叉验证
n_fold = 5 
# 评价标准 
scoring = 'roc_auc'  

3.1 逻辑回归

lr = LogisticRegression()
#把要调整的参数以及其候选值 列出来;
param_grid = {'C': [1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
print("Parameters:{}".format(param_grid))

#实例化一个GridSearchCV类
grid_search = GridSearchCV(lr, param_grid = param_grid,cv = n_fold, scoring = scoring)
#训练,找到最优的参数
grid_search.fit(X_train, y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
print('Best score on test set:', grid_search.score(X_test, y_test))

输出

Parameters:{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000.0], 'penalty': ['l1', 'l2']}
Test set score:0.79
Best parameters:{'C': 0.1, 'penalty': 'l1'}
Best score on train set:0.79
Best score on test set: 0.789409606215

3.2 SVM

3.2.1 线性SVM

svm_linear = svm.SVC(kernel = 'linear', probability=True)
#把要调整的参数以及其候选值 列出来;
param_grid = {'C':[0.01,0.1,1]}
print("Parameters:{}".format(param_grid))

#实例化一个GridSearchCV类
grid_search = GridSearchCV(svm_linear, param_grid = param_grid,cv = n_fold, scoring = scoring)
#训练,找到最优的参数
grid_search.fit(X_train, y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
print('Best score on test set:', grid_search.score(X_test, y_test))

输出

Parameters:{'C': [0.01, 0.1, 1]}
Test set score:0.80
Best parameters:{'C': 0.01}
Best score on train set:0.79
Best score on test set: 0.801052130876

3.2.2 高斯SVM

svm_rbf = svm.SVC(probability=True)
#把要调整的参数以及其候选值 列出来;
param_grid = {'gamma':[0.01,0.1,1,10], 
         'C':[0.01,0.1,1]}
print("Parameters:{}".format(param_grid))

#实例化一个GridSearchCV类
grid_search = GridSearchCV(svm_rbf, param_grid = param_grid,cv = n_fold, scoring = scoring)
#训练,找到最优的参数
grid_search.fit(X_train, y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
print('Best score on test set:', grid_search.score(X_test, y_test))

输出

Parameters:{'gamma': [0.01, 0.1, 1, 10], 'C': [0.01, 0.1, 1]}
Test set score:0.79
Best parameters:{'C': 0.1, 'gamma': 0.01}
Best score on train set:0.78
Best score on test set: 0.789709729662

3.3 决策树

决策树包含很多参数,这里仅进行深度调参

dt = DecisionTreeClassifier(max_depth=8)
#把要调整的参数以及其候选值 列出来;
param_grid = {'max_depth': [3, 4, 5, 6, 7,]}
print("Parameters:{}".format(param_grid))

#实例化一个GridSearchCV类
grid_search = GridSearchCV(dt, param_grid = param_grid,cv = n_fold, scoring = scoring)
#训练,找到最优的参数
grid_search.fit(X_train, y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
print('Best score on test set:', grid_search.score(X_test, y_test))

输出

Parameters:{'max_depth': [3, 4, 5, 6, 7]}
Test set score:0.74
Best parameters:{'max_depth': 4}
Best score on train set:0.73
Best score on test set: 0.736358539928

四、小结

以上只是进行了简单调参,说明网格搜索和交叉验证在调参中的应用

References

[1] 网格搜索算法与K折交叉验证
[2] 【scikit-learn】网格搜索来进行高效的参数调优
[3] ML - 贷款用户逾期情况分析 - 模型调优

Further Reading

[1] sklearn.model_selection.GridSearchCV
[2] sklearn.model_selection.KFold


转载请注明:yezuolin的博客 » 点击阅读原文