十折交叉验证python_机器学习（十二）交叉验证实例

大家好，欢迎来到IT知识分享网。

1 交叉验证简介

1.1 交叉验证是什么交叉验证的基本思想是把在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做为验证集(validation set or test set),首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。—来自百科

1.2 为什么需要交叉验证

假设有个未知模型具有一个或多个待定的参数，且有一个数据集能够反映该模型的特征属性(训练集)。适应的过程是对模型的参数进行调整，以使模型尽可能反映训练集的特征。

如果从同一个训练样本中选择独立的样本作为验证集合，当模型因训练集过小或参数不合适而产生过拟合时，验证集的测试予以反映。

总的来说：交叉验证是一种预测模型拟合性能的方法。

2 交叉验证常见的方法

2.1 Holdout 验证将原始数据随机分为两组，一组做为训练集，一组做为验证集，利用训练集训练分类器，然后利用验证集验证模型，记录最后的分类准确率为此分类器的性能指标。

Python Code：from sklearn.model_selection import train_test_splitimport numpy as np

X = np.array([[1, 2], [3, 4],[5,6],[7, 8]])

y = np.array([1, 2, 2, 1])

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.50, random_state = 5)

print(“X_train:\n”,X_train)

print(“y_train:\n”,y_train)

print(“X_test:\n”,X_test)

print(“y_test:\n”,y_test)

输出X_train: [[5 6]

[7 8]]y_train: [2 1]X_test: [[1 2]

[3 4]]y_test: [1 2]

更好的holdout方法是将原始训练集分为三部分：训练集、验证集和测试集。训练机用于训练不同的模型，验证集用于模型选择。而测试集由于在训练模型和模型选择这两步都没有用到，对于模型来说是未知数据，因此可以用于评估模型的泛化能力。

十折交叉验证python_机器学习（十二）交叉验证实例

Holdout方法的步骤

此种方法的好处的处理简单，只需随机把原始数据分为两组即可，其实严格意义来说Hold-Out Method并不能算是CV，因为这种方法没有达到交叉的思想，由于是随机的将原始数据分组，所以最后验证集分类准确率的高低与原始数据的分组有很大的关系，所以这种方法得到的结果其实并不具有说服性。

2.2 K-fold cross-validationK次交叉验证，将训练集分割成K个子样本，一个单独的子样本被保留作为验证模型的数据，其他K-1个样本用来训练。交叉验证重复K次，每个子样本验证一次，平均K次的结果或者使用其它结合方式，最终得到一个单一估测。这个方法的优势在于，同时重复运用随机产生的子样本进行训练和验证，每次的结果验证一次，10次交叉验证是最常用的。

十折交叉验证python_机器学习（十二）交叉验证实例

10折交叉验证

Python Codefrom sklearn.model_selection import KFoldimport numpy as np

X = np.array([[1, 2], [3, 4],[5,6],[7, 8]])

y = np.array([1, 2, 2, 1])

kf = KFold(n_splits=2)for train_index, test_index in kf.split(X):

print(“Train:”, train_index, “Validation:”,test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

输出：Train: [2 3] Validation: [0 1]Train: [0 1] Validation: [2 3]

初次之外，sklearn还RepeatedKFold、StratifiedKFoldfrom sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([0, 0, 1, 1])

rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=2652124)for train_index, test_index in rkf.split(X):

print(“TRAIN:”, train_index, “TEST:”, test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

StratifiedKFold是针对非平衡数据的分层采样。分层采样就是在每一份子集中都保持原始数据集的类别比例。比如原始数据集正类：负类=3:1，这个比例也要保持在各个子集中才行。from sklearn.model_selection import StratifiedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([0, 0, 1, 1])

skf = StratifiedKFold(n_splits=2)

skf.get_n_splits(X, y)

print(skf)

for train_index, test_index in skf.split(X, y):

print(“TRAIN:”, train_index, “TEST:”, test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

2.3 Leave-One-Out Cross Validation正如名称所建议，留一验证(Leave-One-Out Cross Validation, LOOCV)意指只使用原本样本中的一项来当做验证资料，而剩余的则留下来当做训练资料。这个步骤一直持续到每个样本都被当做一次验证资料。事实上，这等同于 K-fold 交叉验证是一样的，其中K为原本样本个数。

Python Codefrom sklearn.model_selection import LeaveOneOutimport numpy as np

X = np.array([[1, 2], [3, 4],[5,6],[7, 8]])

y = np.array([1, 2, 2, 1])

loo = LeaveOneOut()

loo.get_n_splits(X)for train_index, test_index in loo.split(X):

print(“train:”, train_index, “validation:”, test_index)

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

输出：train: [1 2 3] validation: [0]train: [0 2 3] validation: [1]train: [0 1 3] validation: [2]train: [0 1 2] validation: [3]

3 交叉验证实例print(__doc__)import numpy as npfrom scipy import interpimport matplotlib.pyplot as pltfrom itertools import cyclefrom sklearn import svm, datasetsfrom sklearn.metrics import roc_curve, aucfrom sklearn.model_selection import StratifiedKFold# ############################################################################## Data IO and generation# Import some data to play withiris = datasets.load_iris()

X = iris.data

y = iris.target

X, y = X[y != 2], y[y != 2]

n_samples, n_features = X.shape# Add noisy featuresrandom_state = np.random.RandomState(0)

X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]# ############################################################################## Classification and ROC analysis# Run classifier with cross-validation and plot ROC curvescv = StratifiedKFold(n_splits=6)

classifier = svm.SVC(kernel=’linear’, probability=True,

random_state=random_state)

tprs = []

aucs = []

mean_fpr = np.linspace(0, 1, 100)

i = 0for train, test in cv.split(X, y):

probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test]) # Compute ROC curve and area the curve

fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])

tprs.append(interp(mean_fpr, fpr, tpr))

tprs[-1][0] = 0.0

roc_auc = auc(fpr, tpr)

aucs.append(roc_auc)

plt.plot(fpr, tpr, lw=1, alpha=0.3,

label=’ROC fold %d (AUC = %0.2f)’ % (i, roc_auc))

i += 1plt.plot([0, 1], [0, 1], linestyle=’–‘, lw=2, color=’r’,

label=’Chance’, alpha=.8)

mean_tpr = np.mean(tprs, axis=0)

mean_tpr[-1] = 1.0mean_auc = auc(mean_fpr, mean_tpr)

std_auc = np.std(aucs)

plt.plot(mean_fpr, mean_tpr, color=’b’,

label=r’Mean ROC (AUC = %0.2f $\pm$ %0.2f)’ % (mean_auc, std_auc),

lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)

tprs_upper = np.minimum(mean_tpr + std_tpr, 1)

tprs_lower = np.maximum(mean_tpr – std_tpr, 0)

plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color=’grey’, alpha=.2,

label=r’$\pm$ 1 std. dev.’)

plt.xlim([-0.05, 1.05])

plt.ylim([-0.05, 1.05])

plt.xlabel(‘False Positive Rate’)

plt.ylabel(‘True Positive Rate’)

plt.title(‘Receiver operating characteristic example’)

plt.legend(loc=”lower right”)

plt.show()

十折交叉验证python_机器学习（十二）交叉验证实例

作者：致Great

链接：https://www.jianshu.com/p/8767ef42ee47

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/23712.html

十折交叉验证python_机器学习（十二）交叉验证实例

相关推荐

发表回复