Main Content

使用主组件分析拟合正交回归

This example shows how to use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as Orthogonal Regression or Total Least Squares, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error. This is in contrast to the usual regression assumption that predictor variables are measured exactly, and only the response variable has an error component.

例如,给定两个数据向量x和y,您可以拟合一条线,该线将与每个点(x(i),y(i))的垂直距离最小化到该线。更普遍的是,在观察到的变量的情况下,您可以在p维空间(r

In this example, we fit a plane and a line through some data on three observed variables. It's easy to do the same thing for any number of variables, and for any dimension of model, although visualizing a fit in higher dimensions would obviously not be straightforward.

Fitting a Plane to 3-D Data

First, we generate some trivariate normal data for the example. Two of the variables are fairly strongly correlated.

rng(5,'twister');x = mvnrnd([0 0 0],[1 .2 .7; .2 1 0; .7 0 1],50);plot3(x(:,1),x(:,2),x(:,3),,,'bo');网格on;maxlim = max(abs(x(:)))*1.1;轴([ -  Maxlim Maxlim -Maxlim Maxlim -Maxlim Maxlim]);轴square查看(-9,12);

图包含一个轴对象。轴对象包含一个类型行的对象。

Next, we fit a plane to the data using PCA. The coefficients for the first two principal components define vectors that form a basis for the plane. The third PC is orthogonal to the first two, and its coefficients define the normal vector of the plane.

[coeff,score,roots] = pca(X); basis = coeff(:,1:2)
基础=3×20.6774 -0.0790 0.2193 0.9707 0.7022 -0.2269
正常= coeff(:,3)
正常=3×10.7314 -0.0982 -0.6749

That's all there is to the fit. But let's look closer at the results, and plot the fit along with the data.

Because the first two components explain as much of the variance in the data as is possible with two dimensions, the plane is the best 2-D linear approximation to the data. Equivalently, the third component explains the least amount of variation in the data, and it is the error term in the regression. The latent roots (or eigenvalues) from the PCA define the amount of explained variance for each component.

pctexplaining = roots'./ sum(roots)
pctExplained =1×30.6226 0.2976 0.0798

主成分分数的前两个坐标给出了每个点在平面的坐标系中的投影。为了根据原始坐标系获取拟合点的坐标,我们将每个PC系数向量乘以相应的分数,并添加到数据的平均值中。残差仅仅是原始数据减去拟合点。

[n,p] = size(x);卑鄙=平均值(x,1);Xfit = repmat(卑鄙的,n,1) +分数(:,1:2)*coeff(:,1:2)';残差= x -xfit;

The equation of the fitted plane, satisfied by each of the fitted points inxfit,,,,is([x1 x2 x3] - 卑鄙)*normal = 0。The plane passes through the pointmeanX,及其垂直距离是meanX*normal。从每个点的垂直距离X到平面,即残差的规范,是每个中心点的点乘积,正常到平面。拟合的平面最小化了平方错误的总和。

error = abs((x -repmat(卑鄙,n,1))*normal);sse = sum(错误。^2)
SSE = 15.5142

为了可视化拟合,我们可以绘制平面,原始数据及其对平面的投影。

[xgrid,ygrid] = meshgrid(linspace(min(X(:,1)),max(X(:,1)),5),...linspace(min(x(::,2)),max(x(::,2)),5);ZGRID =(1/normal(3))。h =网格(XGRID,YGRID,ZGRID,'edgeColor',[0 0 0],'facealpha',,,,0); holdon上面=(x-repmat(eyx,n,1))*normal <0;下方=〜上面;nabove = sum(上);x1 = [x(上方,1)xfit(上方,1)nan*ons(nabove,1)];x2 = [x(上方,2)xfit(上方,2)nan*ons(nabove,1)];x3 = [x(上方,3)xfit(上方,3)nan*ons(nabove,1)];plot3(x1',x2',x3',' - ',,,,X(above,1),X(above,2),X(above,3),'o',,,,'Color',[0 .7 0]);nbelow = sum(下图);x1 = [x(下面,1)xfit(下面,1)nan*ons(nbelow,1)];x2 = [x(下面,2)xfit(下面,2)nan*ons(nbelow,1)];x3 = [x(下面,3)xfit(下面,3)nan*ons(nbelow,1)];plot3(x1',x2',x3',' - ',x(下方,1),x(下方,2),x(下面,3),,'o',,,,'Color',,,,[1 0 0]); holdoffmaxlim = max(abs(x(:)))*1.1;轴([ -  Maxlim Maxlim -Maxlim Maxlim -Maxlim Maxlim]);轴square查看(-9,12);

图包含一个轴对象。轴对象包含53个类型表面的对象。

绿点在飞机上方,红点在下方。

将线路拟合到3-D数据

将直线拟合到数据甚至更简单,并且由于PCA的嵌套属性,我们可以使用已经计算的组件。定义线路的方向向量由第一个主组件的系数给出。第二个和第三台PC与第一个PC是正交的,它们的系数定义了垂直于该线路的方向。描述该行的最简单方程是meanX + t*dirVect,,,,wheret参数化的位置以及the line.

dirVect = coeff(:,1)
dirVect =3×10.6774 0.2193 0.7022

The first coordinate of the principal component scores gives the projection of each point onto the line. As with the 2-D fit, the PC coefficient vectors multiplied by the scores the gives the fitted points in the original coordinate system.

Xfit1 = repmat(卑鄙的,n,1) + score(:,1)*coeff(:,1)';

绘制行,原始数据及其对线的投影。

t= [min(score(:,1))-.2, max(score(:,1))+.2]; endpts = [meanX + t(1)*dirVect'; meanX + t(2)*dirVect']; plot3(endpts(:,1),endpts(:,2),endpts(:,3),'k-');X1 = [X(:,1) Xfit1(:,1) nan*ones(n,1)]; X2 = [X(:,2) Xfit1(:,2) nan*ones(n,1)]; X3 = [X(:,3) Xfit1(:,3) nan*ones(n,1)]; holdonplot3(x1',x2',x3','b-',,,,X(:,1),X(:,2),X(:,3),'bo');holdoffmaxlim = max(abs(x(:)))*1.1;轴([ -  Maxlim Maxlim -Maxlim Maxlim -Maxlim Maxlim]);轴square查看(-9,12);网格on

图包含一个轴对象。The axes object contains 52 objects of type line.

While it appears that many of the projections in this plot are not perpendicular to the line, that's just because we're plotting 3-D data in two dimensions. In a liveMATLAB®figure window, you could interactively rotate the plot to different perspectives to verify that the projections are indeed perpendicular, and to get a better feel for how the line fits the data.