主要内容

使用分位数回归检测异常值

This example shows how to detect outliers using quantile random forest. Quantile random forest can detect outliers with respect to the conditional distribution of Y given X 。但是,此方法无法检测预测数据数据中的异常值。有关使用一袋决策树的预测数据中的离群值检测,请参见折叠财产的TreeBaggermodel.

Anoutlieris an observation that is located far enough from most of the other observations in a data set and can be considered anomalous. Causes of outlying observations include inherent variability or measurement error. Outliers significant affect estimates and inference, so it is important to detect them and decide whether to remove them or consider a robust analysis.

统计和机器学习工具箱™ provides several functions to detect outliers, including:

  • zscore— Computezscores of observations.

  • trimmean- 数据的估计平均值,不包括异常值。

  • 箱形图— Draw box plot of data.

  • Probplot— Draw probability plot.

  • Rubustcov— Estimate robust covariance of multivariate data.

  • FITCSVM— Fit a one-class support vector machine (SVM) to determine which observations are located far from the decision boundary.

  • dbscan- 使用基于密度的使用噪声(DBSCAN)算法的基于密度的空间聚类来识别群集并识别离群值。

Also, MATLAB® provides theisoutlierfunction, which finds outliers in data.

To demonstrate outlier detection, this example:

  1. Generates data from a nonlinear model with heteroscedasticity and simulates a few outliers.

  2. 种植回归树的分位森林。

  3. Estimates conditional quartiles ( Q 1 , Q 2 , 和 Q 3 ) and the interquartile range ( I Q R )在预测变量的范围内。

  4. Compares the observations to the栅栏,数量 F 1 = Q 1 - 1 5 I Q R F 2 = Q 3 + 1 5 I Q R 。任何小于 F 1 or greater than F 2 is an outlier.

生成数据

Generate 500 observations from the model

y t = 1 0 + 3 t + t sin ( 2 t ) + ε t

t is uniformly distributed between 0 and 4 π , 和 ε t N ( 0 , t + 0 0 1 ) 。Store the data in a table.

n = 500;rng('默认');%可再现性t = randsample(linspace(0,4*pi,1e6),n,true)'; epsilon = randn(n,1).*sqrt((t+0.01)); y = 10 + 3*t + t.*sin(2*t) + epsilon; Tbl = table(t,y);

将五个观测值沿随机垂直方向移动,升至响应值的90%。

numout = 5;[〜,idx] = datasampe(tbl,numout);tbl.y(idx)= tbl.y(idx) + randsample([ -  1 1],numout,true)'。*(0.9*tbl.y(idx));

绘制数据的散点图并识别异常值。

数字;情节(tbl.t,tbl.y,'.');抓住plot(Tbl.t(idx),Tbl.y(idx),'*');axis紧的;ylabel ('y');xlabel('t');标题(“数据的散点图”);传奇('数据','Simulated outliers','地点','NorthWest');

图包含一个轴对象。带有标题散点图的轴对象包含2个类型行的对象。这些对象表示数据,模拟异常值。

Grow Quantile Random Forest

使用一袋200种回归树的袋TreeBagger

MDL= TreeBagger(200,Tbl,'y','Method','regression');

MDLis aTreeBagger合奏。

Predict Conditional Quartiles and Interquartile Ranges

Using quantile regression, estimate the conditional quartiles of 50 equally spaced values within the range oft

tau= [0.25 0.5 0.75]; predT = linspace(0,4*pi,50)'; quartiles = quantilePredict(Mdl,predT,'Quantile',tau);

quartiles是条件四分位数的500 x-3矩阵。行对应于观察t,列对应于tau

On the scatter plot of the data, plot the conditional mean and median responses.

meanY = predict(Mdl,predT); plot(predT,[quartiles(:,2) meanY],'LineWidth',2); legend('数据','Simulated outliers','Median response',“平均响应”,。。。'地点','NorthWest');抓住离开;

图包含一个轴对象。The axes object with title Scatter Plot of Data contains 4 objects of type line. These objects represent Data, Simulated outliers, Median response, Mean response.

Although the conditional mean and median curves are close, the simulated outliers can affect the mean curve.

计算条件 I Q R , F 1 , 和 F 2

iqr =四重奏(:,3) - 四重奏(:,1);k = 1.5;f1 =四重奏(:,1)-K*iqr;f2 =四重奏(:,3) + k*iqr;

k = 1.5意味着所有观察都小于f1or greater thanf2are considered outliers, but this threshold does not disambiguate from extreme outliers. Ak3identifies extreme outliers.

Compare Observations to Fences

绘制观测和围栏。

数字;情节(tbl.t,tbl.y,'.');抓住plot(Tbl.t(idx),Tbl.y(idx),'*');plot(predT,[f1 f2]); legend('数据','Simulated outliers','F_1','F_2','地点','NorthWest');axis紧的标题(“使用分位数回归的离群值检测”) 抓住离开

图包含一个轴对象。The axes object with title Outlier Detection Using Quantile Regression contains 4 objects of type line. These objects represent Data, Simulated outliers, F_1, F_2.

所有模拟的异常值都落在外面 [ F 1 , F 2 ] ,并且某些观察结果也不在此间隔之外。

See Also

Classes

Functions

相关话题