Main Content

Detect Outliers Using Quantile Regression

This example shows how to detect outliers using quantile random forest. Quantile random forest can detect outliers with respect to the conditional distribution of Y given X . However, this method cannot detect outliers in the predictor data. For outlier detection in the predictor data using a bag of decision trees, see theOutlierMeasure财产的TreeBaggermodel.

Anoutlieris an observation that is located far enough from most of the other observations in a data set and can be considered anomalous. Causes of outlying observations include inherent variability or measurement error. Outliers significant affect estimates and inference, so it is important to detect them and decide whether to remove them or consider a robust analysis.

Statistics and Machine Learning Toolbox™ provides several functions to detect outliers, including:

  • zscore— Computezscores of observations.

  • trimmean— Estimate mean of data, excluding outliers.

  • boxplot— Draw box plot of data.

  • probplot— Draw probability plot.

  • robustcov— Estimate robust covariance of multivariate data.

  • fitcsvm— Fit a one-class support vector machine (SVM) to determine which observations are located far from the decision boundary.

  • dbscan— Partition observations into clusters and identify outliers using the density-based spatial clustering of application with noise (DBSCAN) algorithm.

Also, MATLAB® provides theisoutlierfunction, which finds outliers in data.

To demonstrate outlier detection, this example:

  1. Generates data from a nonlinear model with heteroscedasticity and simulates a few outliers.

  2. Grows a quantile random forest of regression trees.

  3. Estimates conditional quartiles ( Q 1 , Q 2 , and Q 3 ) and the interquartile range ( I Q R ) within the ranges of the predictor variables.

  4. Compares the observations to thefences, which are the quantities F 1 = Q 1 - 1 . 5 I Q R and F 2 = Q 3 + 1 . 5 I Q R . Any observation that is less than F 1 or greater than F 2 is an outlier.

Generate Data

Generate 500 observations from the model

y t = 1 0 + 3 t + t sin ( 2 t ) + ε t .

t is uniformly distributed between 0 and 4 π , and ε t N ( 0 , t + 0 . 0 1 ) . Store the data in a table.

n = 500; rng('default');% For reproducibilityt = randsample(linspace(0,4*pi,1e6),n,true)'; epsilon = randn(n,1).*sqrt((t+0.01)); y = 10 + 3*t + t.*sin(2*t) + epsilon; Tbl = table(t,y);

Move five observations in a random vertical direction by 90% of the value of the response.

numOut = 5; [~,idx] = datasample(Tbl,numOut); Tbl.y(idx) = Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx));

Draw a scatter plot of the data and identify the outliers.

figure; plot(Tbl.t,Tbl.y,'.'); holdonplot(Tbl.t(idx),Tbl.y(idx),'*'); axistight;ylabel ('y'); xlabel('t'); title('Scatter Plot of Data'); legend('Data','Simulated outliers','Location','NorthWest');

图包含一个坐标轴对象。坐标轴对象with title Scatter Plot of Data contains 2 objects of type line. These objects represent Data, Simulated outliers.

Grow Quantile Random Forest

Grow a bag of 200 regression trees usingTreeBagger.

Mdl = TreeBagger(200,Tbl,'y','Method','regression');

Mdlis aTreeBaggerensemble.

Predict Conditional Quartiles and Interquartile Ranges

Using quantile regression, estimate the conditional quartiles of 50 equally spaced values within the range oft.

tau = [0.25 0.5 0.75]; predT = linspace(0,4*pi,50)'; quartiles = quantilePredict(Mdl,predT,'Quantile',tau);

quartilesis a 500-by-3 matrix of conditional quartiles. Rows correspond to the observations int, and columns correspond to the probabilities intau.

On the scatter plot of the data, plot the conditional mean and median responses.

meanY = predict(Mdl,predT); plot(predT,[quartiles(:,2) meanY],'LineWidth',2); legend('Data','Simulated outliers','Median response','Mean response',...'Location','NorthWest'); holdoff;

图包含一个坐标轴对象。坐标轴对象with title Scatter Plot of Data contains 4 objects of type line. These objects represent Data, Simulated outliers, Median response, Mean response.

Although the conditional mean and median curves are close, the simulated outliers can affect the mean curve.

Compute the conditional I Q R , F 1 , and F 2 .

iqr = quartiles(:,3) - quartiles(:,1); k = 1.5; f1 = quartiles(:,1) - k*iqr; f2 = quartiles(:,3) + k*iqr;

k = 1.5means that all observations less thanf1or greater thanf2are considered outliers, but this threshold does not disambiguate from extreme outliers. Akof3identifies extreme outliers.

Compare Observations to Fences

Plot the observations and the fences.

figure; plot(Tbl.t,Tbl.y,'.'); holdonplot(Tbl.t(idx),Tbl.y(idx),'*'); plot(predT,[f1 f2]); legend('Data','Simulated outliers','F_1','F_2','Location','NorthWest'); axistighttitle('Outlier Detection Using Quantile Regression') holdoff

图包含一个坐标轴对象。坐标轴对象with title Outlier Detection Using Quantile Regression contains 4 objects of type line. These objects represent Data, Simulated outliers, F_1, F_2.

All simulated outliers fall outside [ F 1 , F 2 ] , and some observations are outside this interval as well.

See Also

Classes

Functions

Related Topics