数据分类不平衡

打开实时脚本

This example shows how to perform classification when one class has many more observations than another. You use theRUSBoost首先是算法，因为它旨在处理这种情况。处理不平衡数据的另一种方法是使用名称值对参数'事先的'或者'Cost'。有关详细信息，请参阅Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles。

此示例使用UCI机器学习存档中的“封面类型”数据，https://archive.ics.uci.edu/ml/datasets/Covertype。数据基于海拔，土壤类型和水的距离等预测因素，对森林类型（地面覆盖）进行了分类。数据具有超过500,000个观察结果和50多个预测因素，因此培训和使用分类器很耗时。

布莱克德和迪恩[1]describe a neural net classification of this data. They quote a 70.6% classification accuracy.RUSBoost获得超过81％的分类精度。

获取数据

Import the data into your workspace. Extract the last data column into a variable namedY。

gunzip('https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'） 加载covtype.dataY = covtype(:,end); covtype(:,end) = [];

检查响应数据

表（y）

Value Count Percent 1 211840 36.46% 2 283301 48.76% 3 35754 6.15% 4 2747 0.47% 5 9493 1.63% 6 17367 2.99% 7 20510 3.53%

有数十万个数据点。4级的少于总数的0.5％。这种不平衡表明RUSBoost是一种合适的算法。

划分质量评估的数据

Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.

rng(10,'twister')％可再现性part = cvpartition(Y,'Holdout',0.5); istrain = training(part);拟合数据的％istest = test(part);质量评估的数据％列表（y（ISTRAIN））

Value Count Percent 1 105919 36.46% 2 141651 48.76% 3 17877 6.15% 4 1374 0.47% 5 4747 1.63% 6 8684 2.99% 7 10254 3.53%

创建合奏

Use deep trees for higher ensemble accuracy. To do so, set the trees to have maximal number of decision splits ofN, whereN是训练样本中的观察数。放LearnRateto0.1为了达到更高的精度。数据很大，并且有深树，创建合奏很耗时。

N = sum(istrain);% Number of observations in the training samplet = Templatetree（'maxnumsplits',N); tic rusTree = fitcensemble(covtype(istrain,:),Y(istrain),'Method','rusboost',。。。'NumLearningCycles',1000,“学习者”，t，'LearnRate',0.1,'nprint'，100）;

培训鲁斯博斯特...成年的弱学习者：100成长的弱学习者：200成长的弱学习者：300成长的弱学习者：400成长的弱学习者：500成长的弱学习者：600成长的弱学习者：700成长的弱学习者：800成长的弱学习者：900成长的弱学习者：1000

TOC

经过的时间为242.836734秒。

Inspect the classification error

Plot the classification error against the number of members in the ensemble.

数字;TIC图（损失（Rustree，covtype（istest，:)，y（iStest），'模式',“累积”)); toc

经过的时间为164.470086秒。

网格在; xlabel('Number of trees'）；ylabel('Test classification error'）；

合奏使用116棵或更多树实现了20％以下的分类误差。对于500棵或更多树，分类误差以较慢的速率减少。

Examine the confusion matrix for each class as a percentage of the true class.

tic yfit =预测（rustree，covtype（iStest，:)）;TOC

Elapsed time is 132.353489 seconds.

confusionchart(Y(istest),Yfit,'Normalization',“行规范化”,'RowSummary',“行规范化”)

All classes except class 2 have over 90% classification accuracy. But class 2 makes up close to half the data, so the overall accuracy is not that high.

Compact the ensemble

The ensemble is large. Remove the data using the袖珍的方法。

cmpctrus = compact（rustee）;sz（1）= Whos（'rusTree'）；sz(2) = whos('cmpctRus'）；[sz（1）.bytes sz（2）.bytes]

ans =1×210⁹× 1.6579 0.9423

压实的合奏约为原始合奏的一半。

将一半的树从cmpctRus。This action is likely to have minimal effect on the predictive performance, based on the observation that 500 out of 1000 trees give nearly optimal accuracy.

cmpctRus = removeLearners(cmpctRus,[500:1000]); sz(3) = whos('cmpctRus'）；sz(3).bytes

ans = 452868660

The reduced compact ensemble takes about a quarter of the memory of the full ensemble. Its overall loss rate is under 19%:

L = loss(cmpctRus,covtype(istest,:),Y(istest))

L = 0.1833

The predictive accuracy on new data might differ, because the ensemble accuracy might be biased. The bias arises because the same data used for assessing the ensemble was used for reducing the ensemble size. To obtain an unbiased estimate of requisite ensemble size, you should use cross validation. However, that procedure is time consuming.

参考

[1]Blackard, J. A. and D. J. Dean. "Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables".在Agricultur电脑和电子产品eVol. 24, Issue 3, 1999, pp. 131–151.