Deep Learning with Big Data

通常，训练深度神经网络需要大量的数据，这些数据通常不适合记忆。您不需要多台计算机来使用太大而无法适合内存的数据集解决问题。相反，您可以将培训数据分为包含数据集的一部分的迷你批次。通过在迷你批次上进行迭代，网络可以从大型数据集中学习，而无需一次将所有数据加载到内存中。

如果您的数据太大而无法容纳内存，请使用数据存储与小型数据进行培训和推理。MATLAB^®provides many different types of datastore tailored for different applications. For more information about datastores for different applications, seeDatastores for Deep Learning。

augmentedImageDatastoreis specifically designed to preprocess and augment batches of image data for machine learning and computer vision applications. For an example showing how to useaugmentedImageDatastoreto manage image data during training, see带有增强图像的火车网络

Work with Big Data in Parallel

If you want to use large amounts of data to train a network, it can be helpful to train in parallel. Doing so can reduce the time it takes to train a network, because you can train using multiple mini-batches at the same time.

It is recommended to train using a GPU or multiple GPUs. Only use single CPU or multiple CPUs if you do not have a GPU. CPUs are normally much slower that GPUs for both training and inference. Running on a single GPU typically offers much better performance than running on multiple CPU cores.

有关并行培训的更多信息，请参阅Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud。

Preprocess Data in Background

当您并行训练时，您可以在后台获取和预处理数据。如果您想在培训期间预处理小批次，例如在使用时，这可能特别有用转换功能将迷你批次预处理功能应用于数据存储。

当您使用网络训练网络时trainNetworkfunction, you can fetch and preprocess data in the background by enabling background dispatch:

Set theDispatchInBackground属性datastore to真的。
Set theDispatchInBackground培训选项真的使用训练功能。

在培训期间，一些工人用于预处理数据，而不是网络培训计算。您可以通过指定工人之间的培训计算和数据调度负载WorkerLoadtraining option using the训练功能。对于高级选项，您可以尝试修改并行池的工人数量。

您可以使用内置的迷你批次数据存储augmentedImageDatastore,denoisingImageDatastore（图像处理工具箱）, orpixelLabelImageDatastore(Computer Vision Toolbox)。You can also use a custom mini-batch datastore with background dispatch enabled. For more information on creating custom mini-batch datastores, seeDevelop Custom Mini-Batch Datastore。

For more information about datastore requirement for background dispatching, see使用数据存储并行训练和背景Dispatching

Work with Big Data in the Cloud

Storing data in the cloud can make it easier for you to access for cloud applications without needing to upload or download large amounts of data each time you create cloud resources. Both AWS^®和Azure^®分别提供数据存储服务，例如AWS S3和Azure Blob存储。

To avoid the time and cost associated with transferring large quantities of data, it is recommended that you set up cloud resources for your deep learning applications using the same cloud provider and region that you use to store your data in the cloud.

To access data stored in the cloud from MATLAB, you must configure your machine with your access credentials. You can configure access from inside MATLAB using environment variables. For more information on how to set environment variables to access cloud data from your client MATLAB, seeWork with Remote Data。有关如何在远程群集中的并行工人上设置环境变量的更多信息，请参见Set Environment Variables on Workers(Parallel Computing Toolbox)。

For an example showing how to upload data to the cloud, seeUpload Deep Learning Data to the Cloud。

For more information about deep learning in the cloud, see云中的深度学习

定制培训循环的预处理数据

When you train a network using a custom training loop, you can process your data in the background by usingMinibatchqueueand enabling background dispatch. AMinibatchqueueobject iterates over adatastoreto prepare mini-batches for custom training loops. Enable background dispatch when your mini-batches require heavy preprocessing.

To enable background dispatch, you must:

Set theDispatchInBackground属性datastore to真的。
Set theDispatchInBackground属性Minibatchqueueto真的。

当您使用此选项时，MATLAB将打开一个本地并行池，用于预处理数据。仅使用本地资源培训时，支持用于定制培训循环的数据预处理。金宝app例如，在本地计算机中使用单个GPU训练时使用此选项。

For more information about datastore requirements for background dispatching, see使用数据存储并行训练和背景Dispatching