Loren on the Art of MATLAB

Turn ideas into MATLAB

Note

Loren on the Art of MATLABhas been archived and will not be updated.

Using memmapfile to Navigate through “Big Data” Binary Files

This week, Ken Atwell from MATLAB product management weighs in with using amemmapfileas a way to navigate through binary files of "big data".

memmapfile(for "memory-mapped file") is used to access binary files without needing to resort to low-level file I/O functions likefread. It includes an ability to declare the structure of your binary data, freely mixing data types and sizes. Originally targeted at easing the reading of lists of records,memmapfilealso has application in big data. Today's post will examine column-wise access of big binary files, and how to navigate through metadata that sometimes is at the beginning of binary files.

Contents

Experiment Parameters

To get started, create a potentially large 2D matrix that is stored on disk.numRowsandnumColumnscan be changed to experiment with different sizes. To keep things simple and snappy here, the matrix is under a gigabyte in size. This is hardly "big data", and you can adjust the parameters here to create a larger problem. Do note that, of course, the disk space required to run this code will grow with the matrix size you create.

scratchFolder = tempdir; numRows = 1e5; numColumns = 1e3;

Create Test File

Create the scratch file. This can take from a moment to many minutes to run, depending on the sizes declared above. Because data of typedoubleis being created, the file will consume8*numRows*numColumnsbytes of free disk space.

The value of[r,c]in the matrix is set to bec*1,000,000+r. This will make it easy to glance at our output and recognize that we are getting the values that are expected.

filename = ['mmf'int2str(numRows)'x'int2str(numColumns)'.dat']; filename = fullfile(scratchFolder, filename); f = fopen(filename,'w');forcolNum = 1:numColumns column = (1:numRows)' + colNum*1000000; fwrite(f,column,'double');endfclose(f);

memmapfilefor Entire Data Set

To create a memory-mapped file, we callmemmapfile和these two arguments:

  1. The filename containing the data
  2. The'Format'of the data, which is a cell array with three components: a. The data type (doublein this example), b. the size of the data (a matrix of sizenumRowsbynumColumnsin this example), and c. a name to assign to this data (mfor "matrix" in this example)

This is basic usage ofmemmapfile, and it encapsulates the entire data set in a single access.When working with "big data", you will want to avoid singular accesses like this.If the size of the data is large enough, your computer may become unresponsive ("thrash") as it busily creates swap space in an effort to read in the entire matrix. Theifstatement is here to prevent you from doing this accidentally. If you are experimenting with data sizes larger than the physical memory available in your computer, you will want to skip this step.

% Prevent a memory-busting matrix from being created.ifnumRows*numColumns*8 > 1e9 error('Size possibly too big; are you sure you want to do this?')endmm = memmapfile(filename,'Format', {'double', [numRows numColumns],'m'}); m = mm.Data.m;%#ok

Regardless, clearmto free up whatever memory was used.

clear('m');

memmapfilewith Columnwise Access

Here is a smarter way to access the big data a column at a time. Instead of creating a single variable that isnumRows * numColumnslarge, we create anumRows * 1vector, which is repeatednumColumnstimes (note this code is now using the optional'Repeat'argument tomemmapfile). This subtle difference allows the big matrix to be read in one column at a time, presumably staying within available memory. The variable is namedmjto indicate the 'j''th column of data.

mm = memmapfile(filename,'Format', {'double', [numRows 1],'mj'},...'Repeat', numColumns);

The code spot-checks the 17th column.

if~isequal(mm.Data(17).mj, (1:numRows)' + 17*1000000) error('The data was not read back in correctly!');end

memmapfileallows for creative uses of 'Repeat' if your application need it. For example, rather than a vector of an entire column, you can read in blocks of half a column:

memmapfile(filename,'Format', {'double', [numRows/2 1],'mj'},'Repeat', numColumns*2);

or blocks containing multiple columns:

memmapfile(filename,'Format', {'double', [numRows*10 1],'mj'},'Repeat', numColumns/10);

Of course, first ensure that your data's size is evenly divisible by these multiples, or you will create amemmapfilethat does not accurately reflect the actual file that underlies it.

A note about memory-mapped files and virtual memory: If your application loops over many columns of memory-mapped data, you may find that memory usage as reported by theWindows Task Manageror theOS X Activity Monitorwill begin to climb. This can be a little misleading. Whilememmapfilewill consume sections of your computer's virtual memory space (only of practical consequence if you are still using a 32-bit version of MATLAB), physical memory (RAM) will not be used. The assignment ofmabove has the potential to fail only because that operation is pulling the contents of the entirememmapfileinto a workspace variable, and workspace variables (includingans) reside in RAM. A comprehensive discussion of virtual memory is beyond the scope of this blog; theWikipedia article on virtual memoryis a starting point if you want to learn more.

Data File with XML Header

The above code assumes that the matrix appears at the very beginning of the data file. However, a number of data files begin with some form of metadata, followed by the "payload", the data itself.

For this blog, a file with some metadata followed by the "real" data will be created. The metadata is expressed using XML-style formatting. This particular format was created for this post, but it is representative of actual metadata. Typically, the metadata indicates an offset into the file where the actual data begins, which is expressed here in theheaderLengthattribute in the first line of the header. What follows next is avarto declare the name, type, and size of the variable contained in the file. This file will contain only one variable, but conceptually the file could contain multiple variables.

strNumC = int2str(numColumns); strNumR = int2str(numRows); header = [...''char(10)...' ','strNumC'"/>'char(10)...''char(10)...];% Insert header lengthheader = strrep(header,'00000000', sprintf('%08.0f'、长度(头)));disp(头)
  
filename = ['mmf'int2str(numRows)'x'int2str(numColumns)'_header.dat']; filename = fullfile(scratchFolder, filename); f = fopen(filename,'w'); fwrite(f, header,'char');forcolNum = 1:numColumns column = (1:numRows)' + colNum*1000000; fwrite(f, column,'double');endfclose(f);

Read XML Header

The header will now be read back in and parsed. Whilexlmreadcould be used to get a DOM node to traverse the XML data structure,regular expressionscan often be used as a quick and dirty way to scrape information from XML. If you are unfamiliar with regular expressions, it is sufficient for this example just to understand that:

  • (\d+)extracts a string of digits
  • (\w+)extracts a word (an alphanumeric string)
  • \s+skips over whitespace

The first line of the file is read to determine the length of the header (extracted by a regular expression), and then the full header is read using this information. Finally, a second, more complex regular expression is used to extract the name, type, and size information for the variable contained in the binary data "blob" that follows the header.

f = fopen(filename,'r'); firstLine = fgetl(f); fclose(f); firstLine%#ok
得力= < datFile headerLength = 00000095 >
% Get the length and convert the string to a doubleheaderLength = regexp(firstLine,'headerLength=(\d+)','tokens'); headerLength = (str2double(headerLength{1}{1}))%#ok
headerLength = 95
f = fopen(filename,'r'); header = fread(f, headerLength,'char=>char')'; fclose(f);% Scan the metadata for type, size, and namevars = regexp(header,'name="(\w+)"\s+type="(\w+)"\s+size="(\d+),(\d+)"',...'tokens');

Create the Memory-mapped File

Lastly, create amemmapfilefor the variable . The cell array returned byregexpis transformed into a new cell array that matches the expected input arguments to thememmapfilefunction.

% Reorganize the data from XML into the form expected by memmapfilemmfFormater = {...'Format',...{vars{1}{2},...[str2double(vars{1}{3}), 1],...vars{1}{1}}...'Repeat', str2double(vars{1}{4})}; mm = memmapfile(filename,'Offset', headerLength, mmfFormater{:}); mj = mm.Data(17).mj;% Check the 17th columnif~isequal(mj, (1:numRows)' + 17*1000000) error('The matrix ''mj'' was not read in correctly!');end

Conclusion

I hope this blog will be useful to those readers struggling to import big blocks of binary data into MATLAB. Though not covered in this post,memmapfilecan also be used to load row-major data, and 2D "tiles" of data.

When you are done experimenting, remember to delete the scratch files you have been creating.

Have you usedmemmapfileor some other technique to incrementally read from large binary files? Share your tipshere!




Published with MATLAB® R2013a


Comments

To leave a comment, please clickhereto sign in to your MathWorks Account or create a new one.