Documentation

集群data

Agglomerative clusters from data

Syntax

T = clusterdata(X,cutoff)
T = clusterdata(X,Name,Value)

Description

T = clusterdata(X,cutoff)returns the cluster indices (T) for each observation (row) of the data (X) while adhering to a threshold for cutting the hierarchical tree (cutoff).

T= clusterdata(X,Name,Value)集群s with additional options specified by one or moreName,Valuepair arguments.

Input Arguments

X

Matrix with two or more rows. The rows represent observations, the columns represent categories or dimensions.

cutoff

When0 < cutoff < 2,集群dataforms clusters when inconsistent values are greater thancutoff(seeinconsistent). Whencutoffis an integer ≥2,集群datainterpretscutoffas the maximum number of clusters to keep in the hierarchical tree generated bylinkage.

Name-Value Pair Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside single quotes (' '). You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

'criterion'

Either'inconsistent'or'distance'.

'cutoff'

Cutoff for inconsistent or distance measure, a positive scalar. When0 < cutoff < 2,集群dataforms clusters when inconsistent values are greater thancutoff(seeinconsistent). Whencutoffis an integer ≥2,集群datainterpretscutoffas the maximum number of clusters to keep in the hierarchical tree generated bylinkage.

'depth'

Depth for computing inconsistent values, a positive integer.

'distance'

Any of the distance metric names allowed bypdist(follow the'minkowski'option by the value of the exponentp):

Metric Description
'euclidean'

Euclidean distance (default).

'squaredeuclidean'

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviationS=nanstd(X). To specify another value forS, useD = pdist(X,'seuclidean',S).

'cityblock'

City block metric.

'minkowski'

Minkowski distance. The default exponent is 2. To specify a different exponent, useD = pdist(X,'minkowski',P), wherePis a scalar positive value of the exponent.

'chebychev'

Chebychev distance (maximum coordinate difference).

'mahalanobis'

Mahalanobis distance, using the sample covariance ofXas computed bynancov. To compute the distance with a different covariance, useD = pdist(X,'mahalanobis',C), where the matrixCis symmetric and positive definite.

的余弦

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

custom distance function

A distance function specified using @:
D = pdist(X,@distfun)

A distance function must be of form

d2 = distfun(XI,XJ)
taking as arguments a 1-by-nvectorXI, corresponding to a single row ofX, and anm2-by-nmatrixXJ, corresponding to multiple rows ofX.distfunmust accept a matrixXJwith an arbitrary number of rows.distfunmust return anm2-by-1 vector of distancesd2, whosekth element is the distance betweenXIandXJ(k,:).

'linkage'

Any of the linkage methods allowed by thelinkagefunction:

  • 'average'

  • 'centroid'

  • 'complete'

  • 'median'

  • 'single'

  • 'ward'

  • 'weighted'

For details, see the definitions in thelinkagefunction reference page.

'maxclust'

Maximum number of clusters to form, a positive integer.

'savememory'

Either'on'or'off'. When applicable, the'on'setting causes集群datato construct clusters without computing the distance matrix.savememoryis applicable when:

  • linkageis'centroid','median', or'ward'

  • distanceis'euclidean'(default)

Whensavememoryis'on',linkagerun time is proportional to the number of dimensions (number of columns ofX). Whensavememoryis'off',linkagememory requirement is proportional toN2, whereNis the number of observations. So choosing the best (least-time) setting forsavememorydepends on the problem dimensions, number of observations, and available memory. The defaultsavememorysetting is a rough approximation of an optimal setting.

Default:'on'当nXhas 20 columns or fewer, or the computer does not have enough memory to store the distance matrix; otherwise'off'

Output Arguments

T

Tis a vector of sizemcontaining a cluster number for each observation.

  • When0<cutoff<2,T = clusterdata(X,cutoff)is equivalent to:

    Y = pdist(X,'euclid'); Z = linkage(Y,'single'); T = cluster(Z,'cutoff',cutoff);
  • Whencutoffis an integer ≥2,T = clusterdata(X,cutoff)is equivalent to:

    Y = pdist(X,'euclid'); Z = linkage(Y,'single'); T = cluster(Z,'maxclust',cutoff);

Examples

collapse all

This example shows how to create a hierarchical cluster tree from sample data, and visualize the clusters using a 3-dimensional scatter plot.

Generate sample data matrices containing random numbers from the standard uniform distribution.

rngdefault;% For reproducibilityX = [gallery('uniformdata',[10 3],12);...gallery('uniformdata',[10 3],13)+1.2;...gallery('uniformdata',[10 3],14)+2.5];

Compute the distances between items and create a hierarchical cluster tree from the sample data. List all of the items in cluster 2.

T = clusterdata(X,'maxclust',3); find(T==2)
ans = 11 12 13 14 15 16 17 18 19 20

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),100,T,'filled')

This example shows how to create a hierarchical cluster tree using Ward's linkage, and visualize the clusters using a 3-dimensional scatter plot.

Create a 20,000-by-3 matrix of sample data generated from the standard uniform distribution.

rngdefault;% For reproducibilityX = rand(20000,3);

Create a hierarchical cluster tree from the sample data using Ward's linkage. Set'savememory'to'on'to construct clusters without computing the distance matrix.

c = clusterdata(X,'linkage','ward','savememory','on','maxclust',4);

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),10,c)

Tips

  • Thecentroidandmedianmethods can produce a cluster tree that is not monotonic. This occurs when the distance from the union of two clusters,rands, to a third cluster is less than the distance betweenrands. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this, use another method. The following image shows a nonmonotonic cluster tree.

    In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3. This leads to a nonmonotonic tree.

  • You can provide the outputTto other functions includingdendrogramto display the tree,集群to assign points to clusters,inconsistentto compute inconsistent measures, andcophenetto compute the cophenetic correlation coefficient.

Introduced before R2006a

Was this topic helpful?