深度学习与Big Data

Typically, training deep neural networks requires large amounts of data that often do not fit in memory. You do not need multiple computers to solve problems using data sets too large to fit in memory. Instead, you can divide your training data into mini-batches that contain a portion of the data set. By iterating over the mini-batches, networks can learn from large data sets without needing to load all data into memory at once.

If your data is too large to fit in memory, use a datastore to work with mini-batches of data for training and inference. MATLAB^®provides many different types of datastore tailored for different applications. For more information about datastores for different applications, seeDatastores for Deep Learning.

augmentedImageDatastoreis specifically designed to preprocess and augment batches of image data for machine learning and computer vision applications. For an example showing how to useaugmentedImageDatastoreto manage image data during training, seeTrain Network with Augmented Images

Work with Big Data in Parallel

If you want to use large amounts of data to train a network, it can be helpful to train in parallel. Doing so can reduce the time it takes to train a network, because you can train using multiple mini-batches at the same time.

It is recommended to train using a GPU or multiple GPUs. Only use single CPU or multiple CPUs if you do not have a GPU. CPUs are normally much slower that GPUs for both training and inference. Running on a single GPU typically offers much better performance than running on multiple CPU cores.

For more information about training in parallel, seeScale Up Deep Learning in Parallel, on GPUs, and in the Cloud.

Preprocess Data in Background

When you train in parallel, you can fetch and preprocess your data in the background. This can be particularly useful if you want to preprocess your mini-batches during training, such as when using thetransformfunction to apply a mini-batch preprocessing function to your datastore.

When you train a network using thetrainNetworkfunction, you can fetch and preprocess data in the background by enabling background dispatch:

Set theDispatchInBackgroundproperty of the datastore totrue.
Set the'DispatchInBackground'training option totrueusing thetrainingOptionsfunction.

During training, some workers are used for preprocessing data instead of network training computations. You can fine-tune the training computation and data dispatch loads between workers by specifying the'WorkerLoad'name-value argument using thetrainingOptionsfunction. For advanced options, you can try modifying the number of workers of the parallel pool.

You can use a built-in mini-batch datastore, such asaugmentedImageDatastore,denoisingImageDatastore(Image Processing Toolbox), orpixelLabelImageDatastore(Computer Vision Toolbox). You can also use a custom mini-batch datastore with background dispatch enabled. For more information on creating custom mini-batch datastores, seeDevelop Custom Mini-Batch Datastore.

For more information about datastore requirement for background dispatching, see使用数据存储并行训练和背景Dispatching

Work with Big Data in the Cloud

Storing data in the cloud can make it easier for you to access for cloud applications without needing to upload or download large amounts of data each time you create cloud resources. Both AWS^®and Azure^®offer data storage services, such as AWS S3 and Azure Blob Storage, respectively.

To avoid the time and cost associated with transferring large quantities of data, it is recommended that you set up cloud resources for your deep learning applications using the same cloud provider and region that you use to store your data in the cloud.

To access data stored in the cloud from MATLAB, you must configure your machine with your access credentials. You can configure access from inside MATLAB using environment variables. For more information on how to set environment variables to access cloud data from your client MATLAB, seeWork with Remote Data. For more information on how to set environment variables on parallel workers in a remote cluster, seeSet Environment Variables on Workers(Parallel Computing Toolbox).

For an example showing how to upload data to the cloud, seeUpload Deep Learning Data to the Cloud.

For more information about deep learning in the cloud, seeDeep Learning in the Cloud

Preprocess Data for Custom Training Loops

When you train a network using a custom training loop, you can process your data in the background by usingminibatchqueueand enabling background dispatch. Aminibatchqueueobject iterates over adatastoreto prepare mini-batches for custom training loops. Enable background dispatch when your mini-batches require heavy preprocessing.

To enable background dispatch, you must:

Set theDispatchInBackgroundproperty of the datastore totrue.
Set theDispatchInBackgroundproperty of theminibatchqueuetotrue.

When you use this option, MATLAB opens a local parallel pool to use for preprocessing your data. Data preprocessing for custom training loops is supported when training using local resources only. For example, use this option when training using a single GPU in your local machine.

For more information about datastore requirements for background dispatching, see使用数据存储并行训练和背景Dispatching