Main Content

使用高斯混合模型的簇

这个主题提供了一个介绍集群with a Gaussian mixture model (GMM) using the Statistics and Machine Learning Toolbox™ function,,,,and an example that shows the effects of specifying optional parameters when fitting the GMM model usingfitgmdist

How Gaussian Mixture Models Cluster Data

Gaussian mixture models (GMMs) are often used for data clustering. You can use GMMs to perform eitherhard簇ing or柔软的簇ing on query data.

To performhard聚类,GMM将查询数据点分配给了给定数据,该数据点指向最大化组件后验概率的多元正常组件。也就是说,鉴于合适的GMM,assigns query data to the component yielding the highest posterior probability. Hard clustering assigns a data point to exactly one cluster. For an example showing how to fit a GMM to data, cluster using the fitted model, and estimate component posterior probabilities, seeCluster Gaussian Mixture Data Using Hard Clustering

此外,您可以使用GMM在数据上执行更灵活的聚类,称为柔软的(或者fuzzy)簇ing. Soft clustering methods assign a score to a data point for each cluster. The value of the score indicates the association strength of the data point to the cluster. As opposed to hard clustering methods, soft clustering methods are flexible because they can assign a data point to more than one cluster. When you perform GMM clustering, the score is the posterior probability. For an example of soft clustering with a GMM, see集群使用软Clusterin高斯混合数据g

GMM clustering can accommodate clusters that have different sizes and correlation structures within them. Therefore, in certain applications,, GMM clustering can be more appropriate than methods such ask-means clustering. Like many clustering methods, GMM clustering requires you to specify the number of clusters before fitting the model. The number of clusters specifies the number of components in the GMM.

对于GMM,请遵循以下最佳实践:

  • 考虑组件协方差结构。您可以指定对角线或完整协方差矩阵,以及所有组件是否具有相同的协方差矩阵。

  • 指定初始条件。期望最大化(EM)算法适合GMM。就像在k-means clustering algorithm, EM is sensitive to initial conditions and might converge to a local optimum. You can specify your own starting values for the parameters, specify initial cluster assignments for data points or let them be selected randomly, or specify use of thek-means++ algorithm

  • 实施正则化。例如,如果您的预测因子多于数据点,则可以正规化估计稳定性。

Fit GMM with Different Covariance Options and Initial Conditions

此示例探讨了在执行GMM聚类时指定协方差结构和初始条件的不同选项的效果。

Load Fisher's iris data set. Consider clustering the sepal measurements, and visualize the data in 2-D using the sepal measurements.

loadfisheriris;X = meas(:,1:2); [n,p] = size(X); plot(X(:,1),X(:,2),'。',,,,'MarkerSize',15);标题('Fisher''s Iris Data Set');Xlabel('萼片长度(cm)');ylabel('萼片宽度(cm)');

Figure contains an axes object. The axes object with title Fisher's Iris Data Set contains an object of type line.

The number of componentskin a GMM determines the number of subpopulations, or clusters. In this figure, it is difficult to determine if two, three, or perhaps more Gaussian components are appropriate. A GMM increases in complexity askincreases.

Specify Different Covariance Structure Options

每个高斯组件都有一个协方差矩阵。从几何上讲,协方差结构决定了在集群上绘制的置信椭圆形的形状。您可以指定所有组件的协方差矩阵是对角线还是完整,以及所有组件是否具有相同的协方差矩阵。规格的每种组合都决定椭圆形的形状和方向。

Specify three GMM components and 1000 maximum iterations for the EM algorithm. For reproducibility, set the random seed.

rng(3); k = 3;% Number of GMM componentsoptions = statset('MaxIter',,,,1000);

指定协方差结构选项。

Sigma = {'diagonal',,,,'full'};% Options for covariance matrix typenSigma = numel(Sigma); SharedCovariance = {true,false};同一或非相互协方差矩阵的%指标SCtext = {'true',,,,'false'};nsc = numel(sharedCovariance);

Create a 2-D grid covering the plane composed of extremes of the measurements. You will use this grid later to draw confidence ellipsoids over the clusters.

d = 500;% Grid lengthx1 = linspace(min(x(::,1))) -  2,max(x(:,1))+2,d);x2 = linspace(min(x(::,2)) -  2,max(x(:,2))+2,d);[x1grid,x2grid] = meshgrid(x1,x2);x0 = [x1grid(:) x2grid(:)];

指定以下内容:

  • For all combinations of the covariance structure options, fit a GMM with three components.

  • 使用拟合的GMM聚集2D网格。

  • Obtain the score that specifies a 99% probability threshold for each confidence region. This specification determines the length of the major and minor axes of the ellipsoids.

  • Color each ellipsoid using a similar color as its cluster.

threshold = sqrt(chi2inv(0.99,2)); count = 1;fori = 1:nSigmaforj = 1:nSC gmfit = fitgmdist(X,k,“协会型”,,,,Sigma{i},...“共享努力”,sharedCovariance {j},'Options',选项);% Fitted GMMclusterx = cluster(gmfit,x);% Cluster indexmahalDist = mahal(gmfit,X0);每个网格点到每个GMM组件的距离%% Draw ellipsoids over each GMM component and show clustering result.subplot(2,2,count); h1 = gscatter(X(:,1),X(:,2),clusterX); holdonform = 1:k idx = mahalDist(:,m)<=threshold; Color = h1(m).Color*0.75 - 0.5*(h1(m).Color - 1); h2 = plot(X0(idx,1),X0(idx,2),'。',,,,'Color',颜色,'MarkerSize',1);uistack(H2,'底部');end绘图(gmfit.mu(:,1),gmfit.mu(:,2),'kx',,,,'LineWidth',,,,2,,,,'MarkerSize',10)标题(sprintf(Sprintf)('Sigma是%s \ nsharedCovariance =%s',sigma {i},sctext {j}),'FontSize',,,,8) legend(h1,{'1',,,,'2',,,,'3'}) holdoffcount = count + 1;endend

图包含4个轴对象。轴对象1带有标题的sigma是对角线共享covariance = true包含7个类型行的对象。这些对象表示1、2、3。轴对象2带标题的sigma是对角线共享covariance = false包含7个类型行的对象。这些对象表示1、2、3。轴对象3带有标题的sigma为完整的共享covariance = true包含7个类型行的对象。这些对象代表1、2、3。轴对象4带标题Sigma是完整的共享COVARIANCE = false包含7个类型行的对象。这些对象表示1、2、3。

The probability threshold for the confidence region determines the length of the major and minor axes, and the covariance type determines the orientation of the axes. Note the following about options for the covariance matrices:

  • 对角线协方差矩阵表明预测因子是不相关的。椭圆的主要和次要轴平行或垂直于Xandy轴。该规范将参数总数增加 p ,每个组件的预测因子数量,但比完整的协方差规范更为简单。

  • Full covariance matricesallow for correlated predictors with no restriction to the orientation of the ellipses relative to theXandy轴。Each component increases the total number of parameters by p (( p + 1 / 2 ,,,,but captures the correlation structure among the predictors. This specification can cause overfitting.

  • 共享协方差矩阵indicate that all components have the same covariance matrix. All ellipses are the same size and have the same orientation. This specification is more parsimonious than the unshared specification because the total number of parameters increases by the number of covariance parameters for one component only.

  • Unshared covariance matricesindicate that each component has its own covariance matrix. The size and orientation of all ellipses might differ. This specification increases the number of parameters byktimes the number of covariance parameters for a component, but can capture covariance differences among components.

The figure also shows thatdoes not always preserve cluster order. If you cluster several fittedgmdistribution楷模,可以为类似组件分配不同的群集标签。

Specify Different Initial Conditions

The algorithm that fits a GMM to the data can be sensitive to initial conditions. To illustrate this sensitivity, fit four different GMMs as follows:

  1. 对于第一个GMM,将大多数数据点分配给第一个群集。

  2. 对于第二个GMM,将数据点随机分配给群集。

  3. For the third GMM, make another random assignment of data points to clusters.

  4. 对于第四个GMM,请使用k-means++ to obtain initial cluster centers.

initialCond1 = [ones(n-8,1); [2; 2; 2; 2]; [3; 3; 3; 3]];% For the first GMMinitialCond2 = randsample(1:k,n,true);% For the second GMMinitialcond3 = randsample(1:k,n,true);% For the third GMMinitialCond4 ='plus';第四个GMM的%簇0 = {initialCond1; initialCond2; initialCond3; initialCond4};

For all instances, usek= 3个组件,未共享和完整的协方差矩阵,相同的初始混合比例以及相同的初始协方差矩阵。对于稳定性,当您尝试不同的初始值集时,请增加EM算法迭代的数量。另外,在集群上汲取信区椭圆形。

converged = nan(4,1);forj = 1:4 gmfit = fitgmdist(X,k,“协会型”,,,,'full',,,,...“共享努力”,,,,false,'开始',,,,簇0{j},...'Options',选项);clusterx = cluster(gmfit,x);% Cluster indexmahalDist = mahal(gmfit,X0);每个网格点到每个GMM组件的距离%% Draw ellipsoids over each GMM component and show clustering result.subplot(2,2,j); h1 = gscatter(X(:,1),X(:,2),clusterX);每个网格点到每个GMM组件的距离%抓住on;nK = numel(unique(clusterX));form = 1:nk idx = mahaldist(:,m)<=阈值;颜色= H1(M).Color*0.75 + -0.5*(H1(M).Color -1);h2 = plot(x0(idx,1),x0(idx,2),,'。',,,,'Color',颜色,'MarkerSize',1);uistack(H2,'底部');end绘图(gmfit.mu(:,1),gmfit.mu(:,2),'kx',,,,'LineWidth',,,,2,,,,'MarkerSize',10)传奇(H1,{'1',,,,'2',,,,'3'}); holdoffconverged(j) = gmfit.Converged;百分比收敛指标end

图包含4个轴对象。Axes object 1 contains 7 objects of type line. These objects represent 1, 2, 3. Axes object 2 contains 7 objects of type line. These objects represent 1, 2, 3. Axes object 3 contains 7 objects of type line. These objects represent 1, 2, 3. Axes object 4 contains 7 objects of type line. These objects represent 1, 2, 3.

sum(converged)
ANS = 4

All algorithms converged. Each starting cluster assignment for the data points leads to a different, fitted cluster assignment. You can specify a positive integer for the name-value pair argument重复,,,,which runs the algorithm the specified number of times. Subsequently,fitgmdist选择最大可能性的拟合度。

何时正规化

Sometimes, during an iteration of the EM algorithm, a fitted covariance matrix can become ill conditioned, which means the likelihood is escaping to infinity. This problem can happen if one or more of the following conditions exist:

  • 您的预测因子多于数据点。

  • You specify fitting with too many components.

  • 变量高度相关。

To overcome this problem, you can specify a small, positive number using the'RegularizationValue'名称值对参数。fitgmdist将这个数字添加到所有c的对角元素ovariance matrices, which ensures that all matrices are positive definite. Regularizing can reduce the maximal likelihood value.

Model Fit Statistics

In most applications, the number of componentskand appropriate covariance structure Σ are unknown. One way you can tune a GMM is by comparing information criteria. Two popular information criteria are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).

Both the AIC and BIC take the optimized, negative loglikelihood and then penalize it with the number of parameters in the model (the model complexity). However, the BIC penalizes for complexity more severely than the AIC. Therefore, the AIC tends to choose more complex models that might overfit, and the BIC tends to choose simpler models that might underfit. A good practice is to look at both criteria when evaluating a model. Lower AIC or BIC values indicate better fitting models. Also, ensure that your choices forkand the covariance matrix structure are appropriate for your application.fitgmdiststores the AIC and BIC of fittedgmdistribution属性中的模型对象AICandBIC。You can access these properties by using dot notation. For an example showing how to choose the appropriate parameters, seeTune Gaussian Mixture Models

也可以看看

||

Related Topics