Perform Factor Analysis on Exam Grades

打开脚本

This example shows how to perform factor analysis using Statistics and Machine Learning Toolbox™.

通常包括大量的多元数据measured variables, and sometimes those variables "overlap" in the sense that groups of them may be dependent. For example, in a decathlon, each athlete competes in 10 events, but several of them can be thought of as "speed" events, while others can be thought of as "strength" events, etc. Thus, a competitor's 10 event scores might be thought of as largely dependent on a smaller set of 3 or 4 types of athletic ability.

因子分析是一种拟合模型多元数据以估计这种相互依赖性的方法。

The Factor Analysis Model

In the factor analysis model, the measured variables depend on a smaller number of unobserved (latent) factors. Because each factor may affect several variables in common, they are known as "common factors". Each variable is assumed to depend on a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as "specific variance" because it is specific to one variable.

Specifically, factor analysis assumes that the covariance matrix of your data is of the form

SigmaX = Lambda*Lambda' + Psi

where Lambda is the matrix of loadings, and the elements of the diagonal matrix Psi are the specific variances. The function因子fits the factor analysis model using maximum likelihood.

Example: Finding Common Factors Affecting Exam Grades

120 students have each taken five exams, the first two covering mathematics, the next two on literature, and a comprehensive fifth exam. It seems reasonable that the five grades for a given student ought to be related. Some students are good at both subjects, some are good at only one, etc. The goal of this analysis is to determine if there is quantitative evidence that the students' grades on the five different exams are largely determined by only two types of ability.

First load the data, then call因子并要求具有单个共同因素的模型拟合。

load考试[loadings1，specvar1，t，stats] = factoran（成绩，1）;

因子's first two return arguments are the estimated loadings and the estimated specific variances. From the estimated loadings, you can see that the one common factor in this model puts large positive weight on all five variables, but most weight on the fifth, comprehensive exam.

Loadings1

loadings1 = 0.6021 0.6686 0.7704 0.7204 0.9153

One interpretation of this fit is that a student might be thought of in terms of their "overall ability", for which the comprehensive exam would be the best available measurement. A student's grade on a more subject-specific test would depend on their overall ability, but also on whether or not the student was strong in that area. This would explain the lower loadings for the first four exams.

从估计的特定差异来看，您可以看到该模型表明特定测试的特定学生成绩的变化远远超出了由于共同因素而变化的差异。

specVar1

specVar1 = 0.6375 0.5530 0.4065 0.4810 0.1623

A specific variance of 1 would indicate that there is不common factor component in that variable, while a specific variance of 0 would indicate that the variable isentirelydetermined by common factors. These exam grades seem to fall somewhere in between, although there is the least amount of specific variation for the comprehensive exam. This is consistent with the interpretation given above of the single common factor in this model.

The p-value returned in thestatsstructure rejects the null hypothesis of a single common factor, so we refit the model.

stats.p

ans = 0.0332

接下来，使用两个常见因素尝试更好地解释考试成绩。有了一个以上的因素，您可以旋转估计的负载以使其解释更简单，但是目前，请求一个无关的解决方案。

(Loadings2,specVar2,T,stats] = factoran(grades,2,“旋转”,'none'）；

从估计的载荷中，您可以看到第一个未息限因子在所有五个变量上都将重量大致相等，而第二个因子将前两个变量与第二个变量进行了对比。

Loadings2

Loadings2 = 0.6289 0.3485 0.6992 0.3287 0.7785 -0.2069 0.7246 -0.2070 0.8963 -0.0473

您可能会将这些因素解释为“总体能力”和“定量与定性能力”，从而扩展了对较早的一因素拟合的解释。

A plot of the variables, where each loading is a coordinate along the corresponding factor's axis, illustrates this interpretation graphically. The first two exams have a positive loading on the second factor, suggesting that they depend on "quantitative" ability, while the second two exams apparently depend on the opposite. The fifth exam has only a small loading on this second factor.

biplot(Loadings2,'varlabels',num2str((1:5)')); title('Unrotated Solution'）；xlabel('Latent Factor 1'）；ylabel('Latent Factor 2'）；

From the estimated specific variances, you can see that this two-factor model indicates somewhat less variation beyond that due to the common factors than the one-factor model did. Again, the least amount of specific variance occurs for the fifth exam.

specVar2

SpecVar2 = 0.4829 0.4031 0.3512 0.4321 0.1944

Thestatsstructure shows that there is only a single degree of freedom in this two-factor model.

stats.dfe

ans = 1

只有五个测量变量，您无法拟合具有两个以上因素的模型。

Factor Analysis from a Covariance/Correlation Matrix

You made the fits above using the raw test scores, but sometimes you might only have a sample covariance matrix that summarizes your data.因子接受协方差或相关矩阵，使用'Xtype'参数，并得出与原始数据相同的结果。

Sigma = cov(grades); [LoadingsCov,specVarCov] =...因子(Sigma,2,'Xtype','cov',“旋转”,'none'）；LoadingsCov

LoadingScov = 0.6289 0.3485 0.6992 0.3287 0.7785 -0.2069 0.7246 -0.2070 0.8963 -0.0473

Factor Rotation

有时，来自因子分析模型的估计负载可能会使某些测量变量的几个因素具有很大的重量，从而难以解释这些因素所代表的代表。因子旋转的目的是找到一个解决方案，每个变量只有少数大载荷，即受少数因子的影响，最好只有一个。

如果您将每行加载矩阵视为M维空间中一个点的坐标，则每个因子对应于坐标轴。因子旋转等同于旋转这些轴，并计算旋转坐标系中的新载荷。有多种方法可以做到这一点。一些方法离开轴正交，而另一些方法是改变它们之间角度的倾斜方法。

Varimax is one common criterion for orthogonal rotation.因子默认情况下执行varimax旋转，因此您无需明确要求。

(LoadingsVM,specVarVM,rotationVM] = factoran(grades,2);

快速检查varimax旋转矩阵由因子confirms that it is orthogonal. Varimax, in effect, rotates the factor axes in the figure above, but keeps them at right angles.

rotationVM'*rotationVM

ans = 1 0 0 1

A biplot of the five variables on the rotated factors shows the effect of varimax rotation.

biplot(LoadingsVM,'varlabels',num2str((1:5)')); title('Varimax Solution'）；xlabel('Latent Factor 1'）；ylabel('Latent Factor 2'）；

Varimax has rigidly rotated the axes in an attempt to make all of the loadings close to zero or one. The first two exams are closest to the second factor axis, while the third and fourth are closest to the first axis and the fifth exam is at an intermediate position. These two rotated factors can probably be best interpreted as "quantitative ability" and "qualitative ability". However, because none of the variables are near a factor axis, the biplot shows that orthogonal rotation has not succeeded in providing a simple set of factors.

Because the orthogonal rotation was not entirely satisfactory, you can try using promax, a common oblique rotation criterion.

(LoadingsPM,specVarPM,rotationPM] =...factoran（等级，2，“旋转”,“ promax”）；

A check on the promax rotation matrix returned by因子shows that it is not orthogonal. Promax, in effect, rotates the factor axes in the first figure separately, allowing them to have an oblique angle between them.

rotationPM'*rotationPM

ans = 1.9405 -1.3509 -1.3509 1.9405

A biplot of the variables on the new rotated factors shows the effect of promax rotation.

Biplot（LoadingsPM，'varlabels',num2str((1:5)')); title('Promax Solution'）；xlabel('Latent Factor 1'）；ylabel('Latent Factor 2'）；

Promax已经进行了轴的非刚性旋转，并且在创建“简单结构”方面做得比Varimax做得更好。前两个考试接近第二个因子轴，而第三和第四轴接近第一个轴，第五次考试处于中间位置。这使这些旋转因素的解释为“定量能力”和“定性能力”。

Instead of plotting the variables on the different sets of rotated axes, it's possible to overlay the rotated axes on an unrotated biplot to get a better idea of how the rotated and unrotated solutions are related.

h1 = biplot(Loadings2,'varlabels',num2str((1:5)')); xlabel('Latent Factor 1'）；ylabel('Latent Factor 2'）；抓住上InvRotVM = Inv（rotationVm）;h2 = line（[ -  invrotvm（1,1）invrotvm（1,1）nan -invrotvm（2,1）invrotvm（2,1）]，...(-invRotVM(1,2) invRotVM(1,2) NaN -invRotVM(2,2) invRotVM(2,2)],'颜色'，[1 0 0]）;Invrotpm = Inv（rotationpm）;h3 =线（[ -  invrotpm（1,1）invrotpm（1,1）Nan -Invrotpm（2,1）Invrotpm（2,1）]，...(-invRotPM(1,2) invRotPM(1,2) NaN -invRotPM(2,2) invRotPM(2,2)],'颜色'，[0 1 0]）;抓住off轴正方形lgndhandles = [H1（1）H1（END）H2 H3];lgndlabels = {'Variables','Unrotated Axes','Varimax Rotated Axes','Promax Rotated Axes'}; legend(lgndHandles, lgndLabels,'location','northeast','fontname','arial narrow'）；

Predicting Factor Scores

有时，能够根据其因子分数对观察进行分类非常有用。例如，如果您接受了两因素模型和对Promax旋转因素的解释，则可能需要预测学生将来对数学考试的表现如何。

由于数据是原始考试等级，而不仅仅是它们的协方差矩阵，我们可以拥有因子返回每个学生两个旋转共同因素中每个因素的价值的返回预测。

(Loadings,specVar,rotation,stats,preds] =...factoran（等级，2，“旋转”,“ promax”,'maxit'，200）;Biplot（加载，'varlabels',num2str((1:5)'),'Scores',preds); title('Predicted Factor Scores for Promax Solution'）；xlabel('Ability In Literature'）；ylabel('Ability In Mathematics'）；

This plot shows the model fit in terms of both the original variables (vectors) and the predicted scores for each observation (points). The fit suggests that, while some students do well in one subject but not the other (second and fourth quadrants), most students do either well or poorly in both mathematics and literature (first and third quadrants). You can confirm this by looking at the estimated correlation matrix of the two factors.

inv(rotation'*rotation)

ans = 1.0000 0.6962 0.6962 1.0000

因素分析和主要成分分析的比较

There is a good deal of overlap in terminology and goals between Principal Components Analysis (PCA) and Factor Analysis (FA). Much of the literature on the two methods does not distinguish between them, and some algorithms for fitting the FA model involve PCA. Both are dimension-reduction techniques, in the sense that they can be used to replace a large set of observed variables with a smaller set of new variables. They also often give similar results. However, the two methods are different in their goals and in their underlying models. Roughly speaking, you should use PCA when you simply need to summarize or approximate your data using fewer dimensions (to visualize it, for example), and you should use FA when you need an explanatory model for the correlations among your data.