主要内容

使用半监督学习技术标记数据

This example shows how to use graph-based and self-training semi-supervised learning techniques to label data.

Semi-supervised learning combines aspects of supervised learning, where all of the training data is labeled, and unsupervised learning, where true labels are unknown. That is, some training observations are labeled, but the vast majority are unlabeled. Semi-supervised learning methods try to leverage the underlying structure of the data to fit labels to the unlabeled data.

统计和机器学习工具箱™提供了这些半监督的学习功能,用于分类:

  • fitsemigraph用标记和未标记的观测值作为节点构建相似图,并从标记的观测值分布标签信息到未标记的观测值。

  • 拟合迭代训练数据上的分类器。首先,该函数仅在标记的数据上训练分类器,然后使用该分类器对未标记数据进行标签预测。拟合提供预测的分数,然后将预测视为分类器下一个分类器的下一个训练周期的真实标签,如果得分高于一定阈值。此过程重复直到标签预测收敛为止。

生成数据

Generate data from two half-moon shapes. Determine which moon new points belong to by using graph-based and self-training semi-supervised techniques.

Create the custom functiontwomoons(在此示例的末尾显示)。此功能采用输入参数n并创造npoints in each of two interlaced half-moons: a top moon that is concave down and a bottom moon that is concave up.

Generate a set of 40 labeled data points by using thetwomoons功能。每个点Xis in one of the two moons, with the corresponding moon label stored in the vectorlabel.

rng('默认')%可再现性[X,label] = twomoons(20);

使用散点图可视化点。同一月亮中的点具有相同的颜色。

scatter(X(:,1),X(:,2),[],label,'filled') 标题(“标记数据”)

图包含一个轴对象。The axes object with title Labeled Data contains an object of type scatter.

Generate a set of 400 unlabeled data points by using thetwomoons功能。每个点newXbelongs to one of the two moons, but the corresponding moon label is unknown.

newX = twomoons(200);

Label Data Using Graph-Based Method

标记未标记的数据newXby using a semi-supervised graph-based method. By default,fitsemigraph从数据中构造一个相似图XnewX, and uses a label propagation technique to fit labels tonewX.

graphMdl = fitsemigraph(X,label,newX)
graphMdl = SemiSupervisedGraphModel with properties: FittedLabels: [400x1 double] LabelScores: [400x2 double] ClassNames: [1 2] ResponseName: 'Y' CategoricalPredictors: [] Method: 'labelpropagation' Properties, Methods

The function returns aSemiSupervisedGraphModelobject whoseFittedLabelsproperty contains the fitted labels for the unlabeled data and whoseLabelScores属性包含关联的标签分数。

通过使用散点图可视化拟合的标签结果。使用拟合标签设置点的颜色,并使用最大标签得分设置点的透明度。透明度较小的点以更大的置信度标记。

maxGraphScores = max(graphmdl.labelscores,[],2);RescaledGraphScores = Rescale(MaxGraphScores,0.05,0.95);散点(newx(:,1),newx(:,2),[],graphmdl.fittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',rescaledGraphScores); title([“未标记数据的标签”,"(Graph-Based)")))

图包含一个轴对象。带有标题的标签的轴对象未标记的数据(基于图)包含类型散点的对象。

这种方法似乎标记了newXpoints accurately. The two moons are visually distinct, and the points that are labeled with the most uncertainty lie on the boundary between the two shapes.

Label Data Using Self-Training Method

标记未标记的数据newX通过使用半监督的自我训练方法。默认,拟合使用带有高斯内金宝app核的支持向量机(SVM)模型来迭代标记数据。

selfSVMMdl = fitsemiself(X,label,newX)
selfSVMMdl = SemiSupervisedSelfTrainingModel with properties: FittedLabels: [400x1 double] LabelScores: [400x2 double] ClassNames: [1 2] ResponseName: 'Y' CategoricalPredictors: [] Learner: [1x1 classreg.learning.classif.CompactClassificationSVM] Properties, Methods

The function returns aSemiSupervisedSelfTrainingModelobject whoseFittedLabelsproperty contains the fitted labels for the unlabeled data and whoseLabelScores属性包含关联的标签分数。

通过使用散点图可视化拟合的标签结果。和以前一样,使用拟合的标签设置点的颜色,并使用最大标签得分设置点的透明度。

maxSVMScores = max(selfSVMMdl.LabelScores,[],2); rescaledSVMScores = rescale(maxSVMScores,0.05,0.95); scatter(newX(:,1),newX(:,2),[],selfSVMMdl.FittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',reccaledSvmScores);标题([“未标记数据的标签”,"(Self-Training: SVM)")))

图包含一个轴对象。The axes object with title Fitted Labels for Unlabeled Data (Self-Training: SVM) contains an object of type scatter.

This method, with an SVM learner, also seems to label thenewXpoints accurately. The two moons are visually distinct, and the points that are labeled with the most uncertainty lie on the boundary between the two shapes.

但是,一些学习者可能不会有效地标记未标记的数据。例如,使用树模型代替默认的SVM模型将数据标记为newX.

newX selfTreeMdl = fitsemiself (X,标签,'学习者','树');

Visualize the fitted label results.

maxTreeScores = max(selfTreeMdl.LabelScores,[],2); rescaledTreeScores = rescale(maxTreeScores,0.05,0.95); scatter(newX(:,1),newX(:,2),[],selfTreeMdl.FittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',rescaledTreeScores); title([“未标记数据的标签”,"(Self-Training: Tree)")))

图包含一个轴对象。The axes object with title Fitted Labels for Unlabeled Data (Self-Training: Tree) contains an object of type scatter.

这种方法,带有树木学习者的标签,标记了上月亮中的许多点。当您使用半监督的自训练方法时,请确保使用适合数据结构的基础学习者。

此代码创建功能twomoons.

功能[x,label] = twomoons(n)% Generate two moons, with n points in each moon.%指定两个月亮的半径和相关角度。噪声=(1/6)。*randn(n,1);半径= 1 +噪声;Angle1 = Pi + Pi/10;Angle2 = Pi/10;%以(1,0)为中心创建底部月亮。底部= linspace(-angle1,angle2,n)';bottomx1 =半径。*cos(bottomTheta) + 1;bottomx2 =半径。*sin(bottomTheta);%以(0,0)为中心创建顶部月亮。topTheta = linspace(Angle1,-angle2,n)';topx1 =半径。*cos(topTheta);topx2 =半径。*sin(toptheta);% Return the moon points and their labels.x = [bottomx1 bottomx2;topx1 topx2];label = [一个(n,1);2*一个(n,1)];end

See Also

|