主要内容

Cluster Gaussian Mixture Data Using Soft Clustering

This example shows how to implement soft clustering on simulated data from a mixture of Gaussian distributions.

集群estimates cluster membership posterior probabilities, and then assigns each point to the cluster corresponding to the maximum posterior probability. Soft clustering is an alternative clustering method that allows some data points to belong to multiple clusters. To implement soft clustering:

  1. Assign a cluster membership score to each data point that describes how similar each point is to each cluster's archetype. For a mixture of Gaussian distributions, the cluster archetype is corresponding component mean, and the component can be the estimated cluster membership posterior probability.

  2. 通过他们的集群成员资格分数排名点。

  3. 检查分数并确定集群成员资格。

对于使用后验概率作为得分的算法,数据点是与最大后概率相对应的集群的成员。但是,如果还有其他具有相应的后验概率的群集,则靠近最大值,则数据点也可以是这些集群的成员。良好的做法是确定在聚类之前产生多个群集成员资格的分数的阈值。

This example follows from使用硬群体群集高斯混合数据.

模拟来自两个二核高斯分布的混合的数据。

rng(0,'twister')再现性的百分比mu1 = [1 2]; sigma1 = [3 .2; .2 2]; mu2 = [-1 -2]; sigma2 = [2 0; 0 1]; X = [mvnrnd(mu1,sigma1,200); mvnrnd(mu2,sigma2,100)];

Fit a two-component Gaussian mixture model (GMM). Because there are two components, suppose that any data point with cluster membership posterior probabilities in the interval [0.4,0.6] can be a member of both clusters.

gm = fitgmdist(x,2);阈值= [0.4 0.6];

Estimate component-member posterior probabilities for all data points using the fitted GMMgm. These represent cluster membership scores.

P = posterior(gm,X);

For each cluster, rank the membership scores for all data points. For each cluster, plot each data points membership score with respect to its ranking relative to all other data points.

n =尺寸(x,1);[〜,订单] =排序(p(:,1));图绘图(1:n,p(命令,1),'r-',1:n,p(命令,2),'b-') 传奇({'Cluster 1','Cluster 2'})ylabel('集群成员资格')Xlabel('Point Ranking') 标题('GMM与全共享的CoviRARE')

Figure contains an axes object. The axes object with title GMM with Full Unshared Covariances contains 2 objects of type line. These objects represent Cluster 1, Cluster 2.

尽管在数据的散点图中难以看到明确的分离,但绘制隶属分数表明拟合的分布使数据分成组的良好作用。

Plot the data and assign clusters by maximum posterior probability. Identify points that could be in either cluster.

idx = cluster(gm,x);Idxboth = find(p(:,1)> =阈值(1)&p(:,1)<=阈值(2));numinboth = numel(Idxboth)
numinboth = 7.
figure gscatter(X(:,1),X(:,2),idx,'rb','+ o',5)持有绘图(x(idxboth,1),x(idxboth,2),'ko','Markersize',10)传奇({'Cluster 1','Cluster 2','Both Clusters'},'地点','东南') 标题('散发剧情 -  GMM与完整的非共享的CoviRARE') 抓住离开

Figure contains an axes object. The axes object with title Scatter Plot - GMM with Full Unshared Covariances contains 3 objects of type line. These objects represent Cluster 1, Cluster 2, Both Clusters.

使用得分阈值间隔,七个数据点可以在任何一个群集中。

使用GMM的软聚类类似于模糊k- 群集群集,也将每个点分配给每个群集的每个群集。模糊k-means algorithm assumes that clusters are roughly spherical in shape, and all of roughly equal size. This is comparable to a Gaussian mixture distribution with a single covariance matrix that is shared across all components, and is a multiple of the identity matrix. In contrast,GMDistribution.允许您指定不同的Covariance结构。默认值是为每个组件估计一个单独的无约束协方差矩阵。更禁止的选择,更近k-means, is to estimate a shared, diagonal covariance matrix.

适合GMM到数据,但指定组件共享相同的对角线协方差矩阵。本规范类似于实现模糊k-means clustering, but provides more flexibility by allowing unequal variances for different variables.

ggshareddiag = fitgmdist(x,2,'CovType','Diagonal',...'SharedCovariance',真的');

Estimate component-member posterior probabilities for all data points using the fitted GMMgmSharedDiag. Estimate soft cluster assignments.

[idxshareddiag,〜,pshareddiag] =群集(ggshareddiag,x);idxbothshareddiag = find(pshareddiag(:,1)> =阈值(1)&...pshareddiag(:,1)<=阈值(2));numinboth = numel(idxbothshareddiag)
numinboth = 5

Assuming shared, diagonal covariances among components, five data points could be in either cluster.

对于每个集群:

  1. 为所有数据点排名成员资格分数。

  2. 绘制每个数据点隶属度相对于所有其他数据点的排名。

[〜,Ordershareddiag] =排序(Pshareddiag(:,1));图绘图(1:n,pshareddiag(Ordershareddiag,1),'r-',...1:n,PSharedDiag(orderSharedDiag,2),'b-') 传奇({'Cluster 1''Cluster 2'},'地点','东北')ylabel('集群成员资格')Xlabel('Point Ranking') 标题('GMM与共享对角线组件CoviRARS')

Figure contains an axes object. The axes object with title GMM with Shared Diagonal Component Covariances contains 2 objects of type line. These objects represent Cluster 1, Cluster 2.

Plot the data and identify the hard, clustering assignments from the GMM analysis assuming the shared, diagonal covariances among components. Also, identify those data points that could be in either cluster.

figure gscatter(X(:,1),X(:,2),idxSharedDiag,'rb','+ o',5)持有plot(X(idxBothSharedDiag,1),X(idxBothSharedDiag,2),'ko','Markersize',10)传奇({'Cluster 1','Cluster 2','Both Clusters'},'地点','东南') 标题('Scatter Plot - GMM with Shared Diagonal Component Covariances') 抓住离开

Figure contains an axes object. The axes object with title Scatter Plot - GMM with Shared Diagonal Component Covariances contains 3 objects of type line. These objects represent Cluster 1, Cluster 2, Both Clusters.

See Also

||

相关话题