主要内容

使用数据存储入门

What Is a Datastore?

A datastore is an object for reading a single file or a collection of files or data. The datastore acts as a repository for data that has the same structure and formatting. For example, each file in a datastore must contain data of the same type (such as numeric or text) appearing in the same order, and separated by the same delimiter.

数据存储是有用的:

  • 集合中的每个文件开启t be too large to fit in memory. A datastore allows you to read and analyze data from each file in smaller portions that do fit in memory.

  • Files in the collection have arbitrary names. A datastore acts as a repository for files in one or more folders. The files are not required to have sequential names.

You can create a datastore based on the type of data or application. The different types of datastores contain properties pertinent to the type of data that they support. For example, see the following table for a list of MATLAB®数据存储。有关数据存储的完整列表,请参阅Select Datastore for File Format or Application.

Type of File or Data 数据存储型
Text files containing column-oriented data, including CSV files. tabulartextdatastore.
Image files, including formats that are supported byImread.如JPEG和PNG。 ImageDatastore
Spreadsheet files with a supported Excel®format such as.xlsx. SpreadsheetDatastore
Key-value pair data that are inputs to or outputs ofMapreduce.. KeyValuedAtastore.
Parquet files containing column-oriented data. ParquetDatastore
Custom file formats. Requires a provided function for reading data. FileDatastore
数据存储区检查点tallarrays. TallDatastore

Create and Read from a Datastore

Use thetabulartextdatastore.function to create a datastore from the sample fileAirlinesmall.csv., which contains departure and arrival information about individual airline flights. The result is atabulartextdatastore.目的。

ds = tabularTextDatastore('airlinesmall.csv')
DS = tabulartextdataStore具有属性:文件:{'... \ matlab \ toolbox \ matlab \ demos \ airlinesmall.csv'}文件夹:{'... \ matlab \ toolbox \ matlab \ demos'} fileencoding:'UTF-8'extractfilesystemroots:{} preservevariablenames:false readvariablenames:true variablenames:{'年','月','dayofmonth'...和26更多} datetimelocale:en_us text formature属性:num honderlines:0 delimiter:','rowdelimiter:'\ r \ r \ n'instamissing:'''遗漏值:nan高级文本格式属性:textscanformats:{'%f','%f','%f'和26更多} texttype:'char'epentonentcharacters:'eedd''commentstyle:''fhiteSpace:'\ b \ t'multipledelimitersasone:虚假属性控制通过预览,read,readall:selectedvariablenames:{'年','月','dayofmonth'和26更多}SENSELESFORMATS:{'%f','%f','%f'和26更多} readsize:20000行OutputType:'table'Rowimes:[]写入特定属性:SupportedOutputFormats:[“TXT”“CSV金宝app“”XLSX“”XLS“”镶木地板“”Parq“] DefaultOutputFormat:“TXT”

After creating the datastore, you can preview the data without having to load it all into memory. You can specify variables (columns) of interest using theSelectedVariableNamesproperty to preview or read only those variables.

ds.SelectedVariableNames = {'DepTime','DepDelay'}; preview(ds)
ans = 8×2 table DepTime DepDelay _______ ________ 642 12 1021 1 2055 20 1332 12 629 -1 1446 63 928 -2 859 -1

You can specify the values in your data which represent missing values. InAirlinesmall.csv., missing values are represented byNA..

ds.TreatAsMissing ='NA';

If all of the data in the datastore for the variables of interest fit in memory, you can read it using the读all功能。

T = readall(ds);

否则,使用读取在内存中的较小子集中读取数据的数据功能。By default, thefunction reads from atabulartextdatastore.20,000 rows at a time. However, you can change this value by assigning a new value to theReadSizeproperty.

ds.ReadSize = 15000;

重新读取之前将数据存储重置为初始状态,使用reset功能。通过致电在A内的功能循环,您可以对每个数据子集执行中间计算,然后在最后聚合中间结果。此代码计算最大值DepDelayvariable.

reset(ds) X = [];hasdata(ds) T = read(ds); X(end+1) = max(T.DepDelay);endmaxDelay = max(X)
maxdelay = 1438.

If the data in each individual file fits in memory, you can specify that each call toshould read one complete file rather than a specific number of rows.

reset(ds) ds.ReadSize ='file'; X = [];hasdata(ds) T = read(ds); X(end+1) = max(T.DepDelay);endmaxdelay = max(x);

除了在数据存储区中读取数据子集外,您还可以使用地图并将函数缩小到数据存储Mapreduce.or create a tall array usingtall. For more information, seeGetting Started with MapReduceTall Arrays for Out-of-Memory Data.

See Also

||||||

相关话题