使用半监督学习技术标记数据
This example shows how to use graph-based and self-training semi-supervised learning techniques to label data.
Semi-supervised learning combines aspects of supervised learning, where all of the training data is labeled, and unsupervised learning, where true labels are unknown. That is, some training observations are labeled, but the vast majority are unlabeled. Semi-supervised learning methods try to leverage the underlying structure of the data to fit labels to the unlabeled data.
统计和机器学习工具箱™提供了这些半监督的学习功能,用于分类:
fitsemigraph
用标记和未标记的观测值作为节点构建相似图,并从标记的观测值分布标签信息到未标记的观测值。拟合
迭代训练数据上的分类器。首先,该函数仅在标记的数据上训练分类器,然后使用该分类器对未标记数据进行标签预测。拟合
提供预测的分数,然后将预测视为分类器下一个分类器的下一个训练周期的真实标签,如果得分高于一定阈值。此过程重复直到标签预测收敛为止。
生成数据
Generate data from two half-moon shapes. Determine which moon new points belong to by using graph-based and self-training semi-supervised techniques.
Create the custom functiontwomoons
(在此示例的末尾显示)。此功能采用输入参数n
并创造n
points in each of two interlaced half-moons: a top moon that is concave down and a bottom moon that is concave up.
Generate a set of 40 labeled data points by using thetwomoons
功能。每个点X
is in one of the two moons, with the corresponding moon label stored in the vectorlabel
.
rng('默认')%可再现性[X,label] = twomoons(20);
使用散点图可视化点。同一月亮中的点具有相同的颜色。
scatter(X(:,1),X(:,2),[],label,'filled') 标题(“标记数据”)
Generate a set of 400 unlabeled data points by using thetwomoons
功能。每个点newX
belongs to one of the two moons, but the corresponding moon label is unknown.
newX = twomoons(200);
Label Data Using Graph-Based Method
标记未标记的数据newX
by using a semi-supervised graph-based method. By default,fitsemigraph
从数据中构造一个相似图X
和newX
, and uses a label propagation technique to fit labels tonewX
.
graphMdl = fitsemigraph(X,label,newX)
graphMdl = SemiSupervisedGraphModel with properties: FittedLabels: [400x1 double] LabelScores: [400x2 double] ClassNames: [1 2] ResponseName: 'Y' CategoricalPredictors: [] Method: 'labelpropagation' Properties, Methods
The function returns aSemiSupervisedGraphModel
object whoseFittedLabels
property contains the fitted labels for the unlabeled data and whoseLabelScores
属性包含关联的标签分数。
通过使用散点图可视化拟合的标签结果。使用拟合标签设置点的颜色,并使用最大标签得分设置点的透明度。透明度较小的点以更大的置信度标记。
maxGraphScores = max(graphmdl.labelscores,[],2);RescaledGraphScores = Rescale(MaxGraphScores,0.05,0.95);散点(newx(:,1),newx(:,2),[],graphmdl.fittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',rescaledGraphScores); title([“未标记数据的标签”,"(Graph-Based)")))
这种方法似乎标记了newX
points accurately. The two moons are visually distinct, and the points that are labeled with the most uncertainty lie on the boundary between the two shapes.
Label Data Using Self-Training Method
标记未标记的数据newX
通过使用半监督的自我训练方法。默认,拟合
使用带有高斯内金宝app核的支持向量机(SVM)模型来迭代标记数据。
selfSVMMdl = fitsemiself(X,label,newX)
selfSVMMdl = SemiSupervisedSelfTrainingModel with properties: FittedLabels: [400x1 double] LabelScores: [400x2 double] ClassNames: [1 2] ResponseName: 'Y' CategoricalPredictors: [] Learner: [1x1 classreg.learning.classif.CompactClassificationSVM] Properties, Methods
The function returns aSemiSupervisedSelfTrainingModel
object whoseFittedLabels
property contains the fitted labels for the unlabeled data and whoseLabelScores
属性包含关联的标签分数。
通过使用散点图可视化拟合的标签结果。和以前一样,使用拟合的标签设置点的颜色,并使用最大标签得分设置点的透明度。
maxSVMScores = max(selfSVMMdl.LabelScores,[],2); rescaledSVMScores = rescale(maxSVMScores,0.05,0.95); scatter(newX(:,1),newX(:,2),[],selfSVMMdl.FittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',reccaledSvmScores);标题([“未标记数据的标签”,"(Self-Training: SVM)")))
This method, with an SVM learner, also seems to label thenewX
points accurately. The two moons are visually distinct, and the points that are labeled with the most uncertainty lie on the boundary between the two shapes.
但是,一些学习者可能不会有效地标记未标记的数据。例如,使用树模型代替默认的SVM模型将数据标记为newX
.
newX selfTreeMdl = fitsemiself (X,标签,'学习者','树');
Visualize the fitted label results.
maxTreeScores = max(selfTreeMdl.LabelScores,[],2); rescaledTreeScores = rescale(maxTreeScores,0.05,0.95); scatter(newX(:,1),newX(:,2),[],selfTreeMdl.FittedLabels,'filled',...'MarkerFaceAlpha','flat','alphadata',rescaledTreeScores); title([“未标记数据的标签”,"(Self-Training: Tree)")))
这种方法,带有树木学习者的标签,标记了上月亮中的许多点。当您使用半监督的自训练方法时,请确保使用适合数据结构的基础学习者。
此代码创建功能twomoons
.
功能[x,label] = twomoons(n)% Generate two moons, with n points in each moon.%指定两个月亮的半径和相关角度。噪声=(1/6)。*randn(n,1);半径= 1 +噪声;Angle1 = Pi + Pi/10;Angle2 = Pi/10;%以(1,0)为中心创建底部月亮。底部= linspace(-angle1,angle2,n)';bottomx1 =半径。*cos(bottomTheta) + 1;bottomx2 =半径。*sin(bottomTheta);%以(0,0)为中心创建顶部月亮。topTheta = linspace(Angle1,-angle2,n)';topx1 =半径。*cos(topTheta);topx2 =半径。*sin(toptheta);% Return the moon points and their labels.x = [bottomx1 bottomx2;topx1 topx2];label = [一个(n,1);2*一个(n,1)];end