主要内容

从数据存储开始

What Is a Datastore?

数据存储是用于读取单个文件或文件集合的对象。数据存储充当具有相同结构和格式化的数据存储库。例如,数据存储中的每个文件必须包含以相同顺序出现的相同类型(例如数字或文本)的数据,并由同一定界符分隔。

当以下情况下,数据存储很有用

  • 集合中的每个文件可能太大fit in memory. A datastore allows you to read and analyze data from each file in smaller portions that do fit in memory.

  • Files in the collection have arbitrary names. A datastore acts as a repository for files in one or more folders. The files are not required to have sequential names.

You can create a datastore based on the type of data or application. The different types of datastores contain properties pertinent to the type of data that they support. For example, see the following table for a list of MATLAB®数据存储。有关数据存储的完整列表,请参阅Select Datastore for File Format or Application.

Type of File or Data 数据存储类型
Text files containing column-oriented data, including CSV files. Tabulartextdatastore
Image files, including formats that are supported byimread例如JPEG和PNG。 ImageDatastore
Spreadsheet files with a supported Excel®format such as.xlsx. SpreadsheetDatastore
Key-value pair data that are inputs to or outputs ofMapReduce. KeyValueDatastore
Parquet files containing column-oriented data. ParquetDatastore
Custom file formats. Requires a provided function for reading data. FileDatastore
用于检查点的数据存储tallarrays. TallDatastore

Create and Read from a Datastore

Use theTabulartextdatastorefunction to create a datastore from the sample fileAirlinesMall.CSV, which contains departure and arrival information about individual airline flights. The result is aTabulartextdatastore目的。

ds = tabularTextDatastore('airlinesmall.csv')
ds = tabulartextdatastore带有属性:文件:{'... \ matlab \ toolbox \ matlab \ emos \ airlinesmall.csv'}文件片:{'... \ matlab \ toolbox \ matlab \ matlab \ matlab \ demos \ demos'} fileencoding:'utf-8 88' AlternateFileSystemRoots: {} PreserveVariableNames: false ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\ r \ n'treatsmissing:''丢失值:NAN高级文本格式属性:textScanformats:{'%f','%f','%f'...和26多26个} textType:'char'''offonentCharacters:'eedd“评论”:''whitespace:'\ b \ t'multipledelimitersasone:控制由预览返回的表返回的表的false属性,读取,readall:selected variablenames:{'Year','','','dayofmonth'...和26多26个}selectedFormats:{'%f','%f','%f'...和26 more} readsize:20000行outputType:'table'rowtimes:[]特定于特定的属性:supportedOutputputformats:[txt“ csv”金宝app csv'“”“ xlsx”“ xls”“ parquet”“ parq”] defAultOutputFormat:“ TXT”

After creating the datastore, you can preview the data without having to load it all into memory. You can specify variables (columns) of interest using theSelectedVariableNamesproperty to preview or read only those variables.

ds.SelectedVariableNames = {'DepTime','DepDelay'}; preview(ds)
ans = 8×2 table DepTime DepDelay _______ ________ 642 12 1021 1 2055 20 1332 12 629 -1 1446 63 928 -2 859 -1

You can specify the values in your data which represent missing values. InAirlinesMall.CSV, missing values are represented byNA.

ds.TreatAsMissing ='NA';

If all of the data in the datastore for the variables of interest fit in memory, you can read it using the读all功能。

T = readall(ds);

否则,使用较小的子集中读取确实适合内存的数据功能。By default, thefunction reads from aTabulartextdatastore20,000 rows at a time. However, you can change this value by assigning a new value to theReadSizeproperty.

ds.ReadSize = 15000;

在重新阅读之前,将数据存储重置为初始状态,使用reset功能。通过打电话在a中的功能尽管循环,您可以对数据的每个子集执行中间计算,然后在最后汇总中间结果。此代码计算DepDelayvariable.

reset(ds) X = [];尽管hasdata(ds) T = read(ds); X(end+1) = max(T.DepDelay);endmaxDelay = max(X)
MaxDelay = 1438

If the data in each individual file fits in memory, you can specify that each call toshould read one complete file rather than a specific number of rows.

reset(ds) ds.ReadSize ='file'; X = [];尽管hasdata(ds) T = read(ds); X(end+1) = max(T.DepDelay);endmaxDelay = max(x);

除了在数据存储中读取数据子集外,您还可以使用地图应用地图并将功能降低到数据存储中MapReduceor create a tall array usingtall. For more information, seeGetting Started with MapReduceTall Arrays for Out-of-Memory Data.

See Also

||||||

相关话题