Train Agents Using Parallel Computing and GPUs

If you have Parallel Computing Toolbox™ software, you can run parallel simulations on multicore processors or GPUs. If you additionally haveMATLAB^®Parallel Server™software, you can run parallel simulations on computer clusters or cloud resources.

Independently on which devices you use to simulate or train the agent, once the agent has been trained, you can generate code to deploy the optimal policy on a CPU or GPU. This is explained in more detail inDeploy Trained Reinforcement Learning Policies.

Using Multiple Processes

When you train agents using parallel computing, the parallel pool client (the MATLAB process that starts the training) sends copies of both its agent and environment to each parallel worker. Each worker simulates the agent within the environment and sends their simulation data back to the client. The client agent learns from the data sent by the workers and sends the updated policy parameters back to the workers.

To create a parallel pool ofNworkers, use the following syntax.

pool = parpool(N);

If you do not create a parallel pool usingparpool(Parallel Computing Toolbox),trainfunction automatically creates one using your default parallel pool preferences. For more information on specifying these preferences, seeSpecify Your Parallel Preferences(Parallel Computing Toolbox). Note that using a parallel pool of thread workers, such aspool = parpool("threads"), is not supported.

To train an agent using multiple processes you must pass to thetrainfunction anrlTrainingOptionsobject in whichUseParallelis set totrue.

For more information on configuring your training to use parallel computing, see theUseParallelandParallelizationOptionsoptions inrlTrainingOptions.

Note that parallel simulation and training of environments containing multiple agents is not supported.

For an example that trains an agent using parallel computing in MATLAB, seeTrain AC Agent to Balance Cart-Pole System Using Parallel Computing. For an example that trains an agent using parallel computing in Simulink^®, seeTrain DQN Agent for Lane Keeping Assist Using Parallel ComputingandTrain Biped Robot to Walk Using Reinforcement Learning Agents.

Agent-Specific Parallel Training Considerations

For off-policy agents, such as DDPG and DQN agents, do not use all of your cores for parallel training. For example, if your CPU has six cores, train with four workers. Doing so provides more resources for the parallel pool client to compute gradients based on the experiences sent back from the workers. Limiting the number of workers is not necessary for on-policy agents, such as AC and PG agents, when the gradients are computed on the workers.

Gradient-Based Parallelization (AC and PG Agents)

培训交流和PG代理并行,DataToSendFromWorkersproperty of theParallelTrainingobject (contained in the training options object) must be set to"gradients".

This configures the training so that both the environment simulation and gradient computations are done by the workers. Specifically, workers simulate the agent against the environment, compute the gradients from experiences, and send the gradients to the client. The client averages the gradients, updates the network parameters and sends the updated parameters back to the workers to they can continue simulating the agent with the new parameters.

With gradient-based parallelization, you can in principle achieve a speed improvement which is nearly linear in the number of workers. However, this option requires synchronous training (that is theModeproperty of therlTrainingOptionsobject that you pass to thetrainfunction must be set to"sync"). This means that workers must pause execution until all workers are finished, and as a result the training only advances as fast as the slowest worker allows.

When AC agents are trained in parallel, a warning is generated if theNumStepToLookAheadproperty of the AC agent option object and theStepsUntilDataIsSentproperty of theParallelizationOptionsobject are set to different values.

经验并行化(DQN DDPG,PPO, TD3, and SAC agents)

To train DQN, DDPG, PPO, TD3, and SAC agents in parallel, theDataToSendFromWorkersproperty of theParallelizationOptionsobject (contained in the training options object) must be set to"experiences". This option does not require synchronous training (that is theModeproperty of therlTrainingOptionsobject that you pass to thetrainfunction can be set to"async").

This configures the training so that the environment simulation is done by the workers and the learning is done by the client. Specifically, the workers simulate the agent against the environment, and send experience data (observation, action, reward, next observation, and a termination signal) to the client. The client then computes the gradients from experiences, updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent with the new parameters .

Experience-based parallelization can reduce training time only when the computational cost of simulating the environment is high compared to the cost of optimizing network parameters. Otherwise, when the environment simulation is fast enough, the workers lie idle waiting for the client to learn and send back the updated parameters.

In other words, experience-based parallelization can improve sample efficiency (intended as the number of samples an agent can process within a given time) only when the ratioR之间的环境复杂性和步勒arning complexity is large. If both environment simulation and learning are similarly computationally expensive, experience-based parallelization is unlikely to improve sample efficiency. However, in this case, for off-policy agents, you can reduce the mini-batch size to makeRlarger, thereby improving sample efficiency.

To enforce contiguity in the experience buffer when training DQN, DDPG, TD3, or SAC agents in parallel, set theNumStepsToLookAheadproperty or the corresponding agent option object to1. A different value causes an error when parallel training is attempted.

Using GPUs

When using deep neural network function approximators for your actor or critic representation, you can speed up training by performing representation operations (such as gradient computation and prediction), on a local GPU rather than a CPU. To do so, when creating a critic or actor representation, use anrlRepresentationOptionsobject in which theUseDeviceoption is set to"gpu"instead of"cpu".

opt = rlRepresentationOptions('UseDevice',"gpu");

The"gpu"option requires both Parallel Computing Toolbox software and a CUDA^®enabled NVIDIA^®GPU. For more information on supported GPUs seeGPU Support by Release(Parallel Computing Toolbox).

You can usegpuDevice(Parallel Computing Toolbox)to query or select a local GPU device to be used with MATLAB.

Using GPUs is likely to be beneficial when the deep neural network in the actor or critic representation uses operations such as multiple convolutional layers on input images or has large batch sizes.

For an example on how to train an agent using the GPU, seeTrain DDPG Agent to Swing Up and Balance Pendulum with Image Observation.

Using both Multiple Processes and GPUs

You can also train agents using both multiple processes and a local GPU (previously selected usinggpuDevice(Parallel Computing Toolbox)) at the same time. Specifically, you can create a critic or actor using anrlRepresentationOptionsobject in which theUseDeviceoption is set to"gpu". You can then use the critic and actor to create an agent, and then train the agent using multiple processes. This is done by creating anrlTrainingOptionsobject in whichUseParallelis set totrueand passing it to thetrainfunction.

For gradient-based parallelization, (which must run in synchronous mode) the environment simulation is done by the workers, which use their local GPU to calculate the gradients and perform a prediction step. The gradients are then sent back to the parallel pool client process which calculates the averages, updates the network parameters and sends them back to the workers so they continue to simulate the agent, with the new parameters, against the environment.

For experience-based parallelization, (which can run in asynchronous mode), the workers simulate the agent against the environment, and send experiences data back to the parallel pool client. The client then uses its local GPU to compute the gradients from the experiences, then updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent, with the new parameters, against the environment.

Note that when using both parallel processing and GPU to train PPO agents, the workers use their local GPU to compute the advantages, and then send processed experience trajectories (which include advantages, targets and action probabilities) back to the client.