Read and Analyze Large Tabular Text File
This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.
Create a Datastore
Create a datastore from the sample fileairlinesmall.csv
using thetabularTextDatastore
function. When you create the datastore, you can specify that the text,NA
, in the data is treated as missing data.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
You can modify the properties of the datastore by changing its properties. Modify theMissingValue
property to specify that missing values are treated as 0.
ds.MissingValue = 0;
In this example, select the variable for the arrival delay,ArrDelay
, as the variable of interest.
ds.SelectedVariableNames ='ArrDelay';
Preview the data using thepreview
function. This function does not affect the state of the datastore.
data = preview(ds)
data=8×1 tableArrDelay ________ 8 8 21 13 4 59 3 11
Read Subsets of Data
By default,read
reads from aTabularTextDatastore
20000 rows at a time. To read a different number of rows in each call toread
, modify theReadSize
property ofds
.
ds.ReadSize = 15000;
Read subsets of the data fromds
using theread
function in awhile
loop. The loop executes untilhasdata(ds)
returnsfalse
.
sums = []; counts = [];whilehasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay);end
Compute the average arrival delay.
avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
Reset the datastore to allow rereading of the data.
reset(ds)
Read One File at a Time
A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting theReadSize
property to'file'
.
ds.ReadSize ='file';
When you change the value ofReadSize
from a number to'file'
or vice versa, MATLAB® resets the datastore.
Read fromds
using theread
function in awhile
loop, as before, and compute the average arrival delay.
sums = []; counts = [];whilehasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay);endavgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
See Also
tabularTextDatastore
|tall
|mapreduce