Main Content

Read and Analyze Large Tabular Text File

This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.

Create a Datastore

Create a datastore from the sample fileairlinesmall.csvusing thetabularTextDatastorefunction. When you create the datastore, you can specify that the text,NA, in the data is treated as missing data.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');

You can modify the properties of the datastore by changing its properties. Modify theMissingValueproperty to specify that missing values are treated as 0.

ds.MissingValue = 0;

In this example, select the variable for the arrival delay,ArrDelay, as the variable of interest.

ds.SelectedVariableNames ='ArrDelay';

Preview the data using thepreviewfunction. This function does not affect the state of the datastore.

data = preview(ds)
data=8×1 tableArrDelay ________ 8 8 21 13 4 59 3 11

Read Subsets of Data

By default,readreads from aTabularTextDatastore20000 rows at a time. To read a different number of rows in each call toread, modify theReadSizeproperty ofds.

ds.ReadSize = 15000;

Read subsets of the data fromdsusing thereadfunction in awhileloop. The loop executes untilhasdata(ds)returnsfalse.

sums = []; counts = [];whilehasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay);end

Compute the average arrival delay.

avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670

Reset the datastore to allow rereading of the data.

reset(ds)

Read One File at a Time

A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting theReadSizeproperty to'file'.

ds.ReadSize ='file';

When you change the value ofReadSizefrom a number to'file'or vice versa, MATLAB® resets the datastore.

Read fromdsusing thereadfunction in awhileloop, as before, and compute the average arrival delay.

sums = []; counts = [];whilehasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay);endavgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670

See Also

||

Related Topics