Histograms of Tall Arrays
This example shows how to use直方图
和直方图2
to analyze and visualize data contained in a tall array.
Create Tall Table
使用AirlinesMall.CSV
data set. Treat'na'
值为缺少数据,以便将它们替换为NaN
values. Select a subset of the variables to work with. Convert the datastore into a tall table.
varnames = {'arrdelay','DepDelay',“年”,'Month'}; ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','na',...'SelectedVariableNames',varnames);T =高(DS)
T = Mx4 tall table ArrDelay DepDelay Year Month ________ ________ ____ _____ 8 12 1987 10 8 1 1987 10 21 20 1987 10 13 12 1987 10 4 -1 1987 10 59 63 1987 10 3 -2 1987 10 11 -1 1987 10 : : : : : : : :
Plot Histogram of Arrival Delays
绘制直方图ArrDelay
可变以检查到达延迟的频率分布。
h = histogram(T.ArrDelay);
使用本地MATLAB会话评估高高的表达: - 通过2:完成0.87秒-Pass 2 of 2:在2.2秒内完成的0.71秒评估完成
title(“航班到达延误,1987 - 2008”)xlabel(“到达延迟(分钟)”)ylabel('频率')
The arrival delay is most frequently a small number near 0, so these values dominate the plot and make it difficult to see other details.
Adjust Bin Limits of Histogram
Restrict the histogram bin limits to plot only arrival delays between -50 and 150 minutes. After you create a histogram object from a tall array, you cannot change any properties that would require recomputing the bins, includingBinWidth
和BinLimits
. Also, you cannot use莫尔宾斯
或者fewerbins
调整垃圾箱的数量。在这些情况下,使用直方图
从高阵列中的原始数据重建直方图。
图直方图(T.arrdelay,“二手限制”,[-50,150])
使用本地MATLAB会话评估高高的表达: - 通过2:完成0.51秒 - 第2秒:完成在0.37秒的评估中,以1.3秒完成
title(“飞行到达延误在-50到150分钟之间,1987年至2008年')xlabel(“到达延迟(分钟)”)ylabel('频率')
From this plot, it appears that long delays might be more common than initially expected. To investigate further, find the probability of an arrival delay that is one hour or greater.
Probability of Delays One Hour or Greater
The original histogram returned an objecth
其中包含bin值Values
property and the bin edges in the薄荷
property. You can use these properties to perform in-memory calculations.
Determine which bins contain arrival delays of one hour (60 minutes) or more. Remove the last bin edge from the logical index vector so that it is the same length as the vector of bin values.
idx = h.binedges> = 60;idx(end)= [];
利用idx
to retrieve the value associated with each selected bin. Add the bin values together, divide by the total number of samples, and multiply by 100 to determine the overall probability of a delay greater than or equal to one hour. Since the total number of samples is computed from the original data set, usegather
to explicitly evaluate the calculation and return an in-memory scalar.
N = numel(T.ArrDelay); P = gather(sum(h.Values(idx))*100/N)
P = 4.4809
总体而言,到达一小时或更长时间到达的几率约为4.5%。
按月绘制延迟的双变量直方图
绘制到达延迟的双变量直方图,该延迟为60分钟或更长时间。该图研究了季节性如何影响到达延迟。
figure h2 = histogram2(T.Month,T.ArrDelay,[12 50],'ybinlimits',[60 1100],...'Normalization','可能性','FaceColor','flat');
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.71 sec Evaluation completed in 0.87 sec Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.79 sec Evaluation completed in 0.86 sec
title('Probability of arrival delays 1 hour or greater (by month)')xlabel(“月(1-12)”)ylabel(“到达延迟(分钟)”) zlabel('可能性') xticks(1:12) view(-126,23)
Delay Statistics by Month
利用the bivariate histogram object to calculate the probability of having an arrival delay one hour or greater in each month, and the mean arrival delay for each month. Put the results in a table with the variableP
containing the probability information and the variableMeanByMonth
containing the mean arrival delay.
monthNames = {'扬','feb','Mar','apr','May','Jun',...'Jul','Aug','Sep','oct','Nov','dec'}';g = FindGroup(t.month);m = splitapply(@(x)平均值(x,'omitnan'),t.arrdelay,g);delaybymonth =表(月名,sum(h2.values,2)*100,收集(m),,...'VariableNames',{'Month','P',``})
使用本地MATLAB会话评估高高的表达: - 通过2:Of 2:在0.41秒完成 - 第2秒:完成在2秒内完成的0.99秒评估完成
delayByMonth=12×3 tableMonth P MeanByMonth _______ ______ ___________ {'Jan'} 9.6497 8.5954 {'Feb'} 7.7058 7.3275 {'Mar'} 9.0543 7.5536 {'Apr'} 7.2504 6.0081 {'May'} 7.4256 5.2949 {'Jun'} 10.35 10.264 {'Jul'} 10.228 8.7797 {'Aug'} 8.5989 7.4522 {'Sep'} 5.4116 3.6308 {'Oct'} 6.042 4.6059 {'Nov'} 6.9002 5.2835 {'Dec'} 11.384 10.571
结果表明,12月假期的航班有11.4%的延迟时间超过一个小时,但平均延迟了10.5分钟。紧随其后的是六月和七月的夏季,大约有10%的机会被延迟一个小时或更长时间,平均延迟约为9或10分钟。