isoutlier
Find outliers in data
Syntax
Description
returns a logical array whose elements areTF
= isoutlier(A
)true
when an outlier is detected in the corresponding element ofA
。By default, an outlier is a value that is more than three scaled我dian absolute deviations (MAD)away from the median. IfA
is a matrix or table, thenisoutlier
operates on each column separately. IfA
is a multidimensional array, thenisoutlier
operates along the first dimension whose size does not equal 1.
specifies a moving method for detecting local outliers according to a window length defined byTF
= isoutlier(A
,movmethod
,window
)window
。For example,isoutlier(A,'movmedian',5)
returnstrue
for all elements more than three local scaled MAD from the local median within a sliding window containing five elements.
specifies additional parameters for detecting outliers using one or more name-value pair arguments. For example,TF
= isoutlier(___,Name,Value
)isoutlier(A,'SamplePoints',t)
detects outliers inA
relative to the corresponding elements of a time vectort
。
Examples
Detect Outliers in Vector
Find the outliers in a vector of data. A logical 1 in the output indicates the location of an outlier.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; TF = isoutlier(A)
TF =1x15 logical array0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Detect Outliers using Mean
Define outliers as points more than three standard deviations from the mean, and find the locations of outliers in a vector.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; TF = isoutlier(A,'mean')
TF =1x15 logical array0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
Detect Outliers with Sliding Window
Create a vector of data containing a local outlier.
x = -2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data inA
。
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);
Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the locations of the outliers inA
relative to the points int
with a window size of 5 hours. Plot the data and detected outliers.
TF = isoutlier(A,'movmedian',hours(5),'SamplePoints',t); plot(t,A,t(TF),A(TF),'x') legend('Data','Outlier')
Matrix of Data
Find outliers for each row of a matrix.
Create a matrix of data containing outliers along the diagonal.
A = magic(5) + diag(200*ones(1,5))
A =5×5217 24 1 8 15 23 205 7 14 16 4 6 213 20 22 10 12 19 221 3 11 18 25 2 209
Find the locations of outliers based on the data in each row.
TF = isoutlier(A,2)
TF =5x5 logical array1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
Compute Outlier Thresholds
Create a vector of data containing an outlier. Find and plot the location of the outlier, and the thresholds and center value determined by the outlier method. The center value is the median of the data, and the upper and lower thresholds are three scaled MAD above and below the median.
x = 1:10; A = [60 59 49 49 58 100 61 57 48 58]; [TF,L,U,C] = isoutlier(A); plot(x,A,x(TF),A(TF),'x',x,L*ones(1,10),x,U*ones(1,10),x,C*ones(1,10)) legend('Original Data','Outlier','Lower Threshold','Upper Threshold','Center Value')
Input Arguments
A
—Input data
vector|matrix|multidimensional array|table|timetable
Input data, specified as a vector, matrix, multidimensional array, table, or timetable.
IfA
is a table, then its variables must be of typedouble
orsingle
, or you can use the'DataVariables'
name-value pair to listdouble
orsingle
variables explicitly. Specifying variables is useful when you are working with a table that contains variables with data types other thandouble
orsingle
。
IfA
is a timetable, thenisoutlier
operates only on the table elements. Row times must be unique and listed in ascending order.
Data Types:double
|single
|table
|timetable
我thod
—Method for detecting outliers
'median'
(default) |'mean'
|'quartiles'
|'grubbs'
|'gesd'
Method for detecting outliers, specified as one of the following:
Method | Description |
---|---|
'median' |
Returnstrue for elements more than three scaled MAD from the median. The scaled MAD is defined asc*median(abs(A-median(A))) , wherec=-1/(sqrt(2)*erfcinv(3/2)) 。 |
'mean' |
Returnstrue for elements more than three standard deviations from the mean. This method is faster but less robust than'median' 。 |
'quartiles' |
Returnstrue for elements more than 1.5 interquartile ranges above the upper quartile or below the lower quartile. This method is useful when the data inA is not normally distributed. |
'grubbs' |
Applies Grubbs’s test for outliers, which removes one outlier per iteration based on hypothesis testing. This method assumes that the data inA is normally distributed. |
'gesd' |
Applies the generalized extreme Studentized deviate test for outliers. This iterative method is similar to'grubbs' , but can perform better when there are multiple outliers masking each other. |
threshold
—Percentile thresholds
two-element row vector
Percentile thresholds, specified as a two-element row vector whose elements are in the interval [0,100]. The first element indicates the lower percentile threshold and the second element indicates the upper percentile threshold. For example, a threshold of[10 90]
defines outliers as points below the 10th percentile and above the 90th percentile. The first element ofthreshold
must be less than the second element.
movmethod
—Moving method
'movmedian'
|'movmean'
Moving method for detecting outliers, specified as one of the following:
Method | Description |
---|---|
'movmedian' |
Returnstrue for elements more than three local scaled MAD from the local median over a window length specified bywindow 。This method is also known as aHampel filter。 |
'movmean' |
Returnstrue for elements more than three local standard deviations from the local mean over a window length specified bywindow 。 |
window
—Window length
positive integer scalar|two-element vector of positive integers|positive duration scalar|two-element vector of positive durations
Window length, specified as a positive integer scalar, a two-element vector of positive integers, a positive duration scalar, or a two-element vector of positive durations.
Whenwindow
is a positive integer scalar, the window is centered about the current element and containswindow-1
neighboring elements. Ifwindow
is even, then the window is centered about the current and previous elements.
Whenwindow
is a two-element vector of positive integers[b f]
, the window contains the current element,b
elements backward, andf
elements forward.
WhenA
is a timetable or'SamplePoints'
is specified as adatetime
orduration
vector, thenwindow
must be of typeduration
, and the windows are computed relative to the sample points.
Data Types:double
|single
|int8
|int16
|int32
|int64
|uint8
|uint16
|uint32
|uint64
|duration
dim
—Dimension to operate along
positive integer scalar
Dimension to operate along, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.
Consider a matrixA
。
isoutlier(A,1)
detects outliers based on the data in each column ofA
。
isoutlier(A,2)
detects outliers based on the data in each row ofA
。
WhenA
is a table or timetable,dim
is not supported.isoutlier
operates along each table or timetable variable separately.
Data Types:double
|single
|int8
|int16
|int32
|int64
|uint8
|uint16
|uint32
|uint64
Name-Value Arguments
Specify optional comma-separated pairs ofName,Value
arguments.Name
is the argument name andValue
is the corresponding value.Name
must appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN
。
isoutlier(A,'mean','ThresholdFactor',4)
SamplePoints
—Sample points
vector|table variable name|scalar|function handle|tablevartype
subscript
Sample points, specified as the comma-separated pair consisting of'SamplePoints'
and either a vector of sample point values or one of the options in the following table when the input data is a table. The sample points represent thex-axis locations of the data, and must be sorted and contain unique elements. Sample points do not need to be uniformly sampled. The vector[1 2 3 ...]
is the default.
When the input data is a table, you can specify the sample points as a table variable using one of the following options.
Option for Table Input | Description | Examples |
---|---|---|
Variable name | A character vector or scalar string specifying a single table variable name |
|
Scalar variable index | A scalar table variable index |
|
Logical vector | A logical vector whose elements each correspond to a table variable, where |
|
Function handle | A function handle that takes a table variable as input and returns a logical scalar, which must be |
|
vartype subscript |
A table subscript generated by the |
|
Note
This name-value pair is not supported when the input data is atimetable
。Timetables always use the vector of row times as the sample points. To use different sample points, you must edit the timetable so that the row times contain the desired sample points.
Moving windows are defined relative to the sample points. For example, ift
is a vector of times corresponding to the input data, thenisoutlier(rand(1,10),'movmean',3,'SamplePoints',t)
has a window that represents the time interval betweent(i)-1.5
andt(i)+1.5
。
When the sample points vector has data typedatetime
orduration
, then the moving window length must have typeduration
。
Example:isoutlier(A,'SamplePoints',0:0.1:10)
Example:isoutlier(T,'SamplePoints',"Var1")
Data Types:single
|double
|datetime
|duration
DataVariables
—Table variables to operate on
table variable name|scalar|vector|cell array|function handle|tablevartype
subscript
Table variables to operate on, specified as the comma-separated pair consisting of'DataVariables'
and one of the options in this table. The'DataVariables'
value indicates which variables of the input table to examine for outliers. The data type associated with the indicated variables must bedouble
orsingle
。没有指定表中的其他变量'DataVariables'
are not operated on, so the output containsfalse
values for those variables.
Option | Description | Examples |
---|---|---|
Variable name | A character vector or scalar string specifying a single table variable name |
|
Vector of variable names | A cell array of character vectors or string array where each element is a table variable name |
|
Scalar or vector of variable indices | A scalar or vector of table variable indices |
|
Logical vector | A logical vector whose elements each correspond to a table variable, where |
|
Function handle | A function handle that takes a table variable as input and returns a logical scalar |
|
vartype subscript |
A table subscript generated by the |
|
Example:isoutlier(T,'DataVariables',["Var1" "Var2" "Var4"])
ThresholdFactor
—Detection threshold factor
nonnegative scalar
Detection threshold factor, specified as the comma-separated pair consisting of'ThresholdFactor'
and a nonnegative scalar.
For methods'median'
and'movmedian'
, the detection threshold factor replaces the number of scaled MAD, which is 3 by default.
For methods'mean'
and'movmean'
, the detection threshold factor replaces the number of standard deviations from the mean, which is 3 by default.
For methods'grubbs'
and'gesd'
检测阈值的因素是一个标量让依ng from 0 to 1. Values close to 0 result in a smaller number of outliers and values close to 1 result in a larger number of outliers. The default detection threshold factor is 0.05.
For the'quartiles'
方法,检测阈值因子代替the number of interquartile ranges, which is 1.5 by default.
This name-value pair is not supported when the specified method is'percentiles'
。
Data Types:double
|single
|int8
|int16
|int32
|int64
|uint8
|uint16
|uint32
|uint64
MaxNumOutliers
—Maximum outlier count
positive integer
Maximum outlier count, for the'gesd'
我thod only, specified as the comma-separated pair consisting of'MaxNumOutliers'
and a positive integer. The'MaxNumOutliers'
value specifies the maximum number of outliers returned by the'gesd'
我thod. For example,isoutlier(A,'gesd','MaxNumOutliers',5)
returns no more than five outliers.
The default value for'MaxNumOutliers'
is the integer nearest to 10 percent of the number of elements inA
。Setting a larger value for the maximum number of outliers can ensure that all outliers are detected, but at the cost of reduced computational efficiency.
The'gesd'
我thod assumes the non-outlier input data is sampled from an approximate normal distribution. When the data is not sampled in this way, the number of returned outliers might exceed the'MaxNumOutliers'
value.
Data Types:double
|single
|int8
|int16
|int32
|int64
|uint8
|uint16
|uint32
|uint64
Output Arguments
TF
— Outlier indicator
vector | matrix | multidimensional array
Outlier indicator, returned as a vector, matrix, or multidimensional array. An element ofTF
istrue
when the corresponding element ofA
is an outlier andfalse
otherwise.TF
is the same size asA
。
Data Types:logical
L
— Lower threshold
scalar | vector | matrix | multidimensional array | table | timetable
Lower threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the lower value of the default outlier detection method is three scaled MAD below the median of the input data.L
has the same size asA
in all dimensions except for the operating dimension where the length is 1.
Data Types:double
|single
|table
|timetable
U
— Upper threshold
scalar | vector | matrix | multidimensional array | table | timetable
Upper threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the upper value of the default outlier detection method is three scaled MAD above the median of the input data.U
has the same size asA
in all dimensions except for the operating dimension where the length is 1.
Data Types:double
|single
|table
|timetable
C
— Center value
scalar | vector | matrix | multidimensional array | table | timetable
Center value used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the center value of the default outlier detection method is the median of the input data.C
has the same size asA
in all dimensions except for the operating dimension where the length is 1.
Data Types:double
|single
|table
|timetable
More About
Median Absolute Deviation
For a random variable vectorAmade up ofNscalar observations, the median absolute deviation (MAD) is defined as
fori = 1,2,...,N。
The scaled MAD is defined asc*median(abs(A-median(A)))
wherec=-1/(sqrt(2)*erfcinv(3/2))
。
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
Usage notes and limitations:
The
'percentiles'
,'grubbs'
, and'gesd'
我thods are not supported.The
'movmedian'
and'movmean'
我thods do not support tall timetables.The
'SamplePoints'
and'MaxNumOutliers'
name-value pairs are not supported.The value of
'DataVariables'
cannot be a function handle.Computation of
isoutlier(A)
,isoutlier(A,'median',...)
, orisoutlier(A,'quartiles',...)
along the first dimension is only supported for tall column vectorsA
。
For more information, seeTall Arrays。
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
The
'movmean'
and'movmedian'
我thods for detecting outliers do not support timetable input data, datetime'SamplePoints'
values, or duration'SamplePoints'
values.String and character array inputs must be constant.
Thread-Based Environment
Run code in the background using MATLAB®backgroundPool
or accelerate code with Parallel Computing Toolbox™ThreadPool
。
This function fully supports thread-based environments. For more information, seeRun MATLAB Functions in Thread-Based Environment。
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
Usage notes and limitations:
The
'movmedian'
moving method is not supported.The
'SamplePoints'
and'DataVariables'
name-value pairs are not supported.
For more information, seeRun MATLAB Functions on a GPU(Parallel Computing Toolbox)。
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select:。
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina(Español)
- Canada(English)
- United States(English)
Europe
- Belgium(English)
- Denmark(English)
- Deutschland(Deutsch)
- España(Español)
- Finland(English)
- France(Français)
- Ireland(English)
- Italia(Italiano)
- Luxembourg(English)
- Netherlands(English)
- Norway(English)
- Österreich(Deutsch)
- Portugal(English)
- Sweden(English)
- Switzerland
- United Kingdom(English)