火车多个代理执行协作助教k

This example uses:

Open Live Script

This example shows how to set up a multi-agent training session on a Simulink® environment. In the example, you train two agents to collaboratively perform the task of moving an object.

The environment in this example is a frictionless two dimensional surface containing elements represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision. All elements within the environment have mass and obey Newton's laws of motion. In addition, contact forces between the elements and the environment boundaries are modeled as spring and mass damper systems. The elements can move on the surface through the application of externally applied forces in the X and Y directions. There is no motion in the third dimension and the total energy of the system is conserved.

Set the random seed and create the set of parameters required for this example.

rng(10) rlCollaborativeTaskParams

Open the Simulink model.

mdl ="rlCollaborativeTask"; open_system(mdl)

For this environment:

The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.
The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.
The agents share the same observations for positions, velocities of A, B, and C and the action values from the last time step.
The simulation terminates when object C moves outside the circular ring.
At each time step, the agents receive the following reward:

$\begin{array}{l} r_{A} = r_{global} + r_{local, A} \\ r_{B} = r_{global} + r_{local, B} \\ r_{global} = 0.001 d_{c} \\ r_{local, A} = - 0.005 d_{AC} - 0.008 u_{A}^{2} \\ r_{local, B} = - 0.005 d_{BC} - 0.008 u_{B}^{2} \end{array}$

Here:

$r_{A}$ and $r_{B}$ are the rewards received by agents A and B, respectively.
$r_{global}$ is a team reward that is received by both agents as object C moves closer towards the boundary of the ring.
$r_{local, A}$ and $r_{local, B}$ are local penalties received by agents A and B based on their distances from object C and the magnitude of the action from the last time step.
$d_{C}$ is the distance of object C from the center of the ring.
$d_{AC}$ and $d_{BC}$ are the distances between agent A and object C and agent B and object C, respectively.
The agents apply external forces on the robots that result in motion. $u_{A}$ and $u_{B}$ are the action values of the two agents A and B from the last time step. The range of action values is between -1 and 1.

Environment

To create a multi-agent environment, specify the block paths of the agents using a string array. Also, specify the observation and action specification objects using cell arrays. The order of the specification objects in the cell array must match the order specified in the block path array. When agents are available in the MATLAB workspace at the time of environment creation, the observation and action specification arrays are optional. For more information on creating multi-agent environments, seerlSimulinkEnv.

Create the I/O specifications for the environment. In this example, the agents are homogeneous and have the same I/O specifications.

% Number of observationsnumObs = 16;% Number of actionsnumAct = 2;% Maximum value of externally applied force (N)maxF = 1.0;% I/O specifications for each agentoinfo = rlNumericSpec([numObs,1]); ainfo = rlNumericSpec([numAct,1],...UpperLimit= maxF,...LowerLimit= -maxF); oinfo.Name ="observations"; ainfo.Name ="forces";

Create the Simulink environment interface.

blks = ["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"]; obsInfos = {oinfo,oinfo}; actInfos = {ainfo,ainfo}; env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);

Specify a reset function for the environment. The reset functionresetRobotsensures that the robots start from random initial positions at the beginning of each episode.

env.ResetFcn = @(in) resetRobots(in,RA,RB,RC,boundaryR);

Agents

This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The agents apply external forces on the robots that result in motion. To learn more about PPO agents, seeProximal Policy Optimization (PPO) Agents.

The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-term rewards.

Specify the agent options for this example.

agentOptions = rlPPOAgentOptions(...ExperienceHorizon=600,...ClipFactor=0.2,...EntropyLossWeight=0.01,...MiniBatchSize=300,...NumEpoch=4,...AdvantageEstimateMethod="gae",...GAEFactor=0.95,...SampleTime=Ts,...DiscountFactor=0.99); agentOptions.ActorOptimizerOptions.LearnRate = 1e-4; agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;

Create the agents using the default agent creation syntax. For more information seerlPPOAgent.

agentA = rlPPOAgent(oinfo, ainfo,...rlAgentInitializationOptions(NumHiddenUnit= 200), agentOptions); agentB = rlPPOAgent(oinfo, ainfo,...rlAgentInitializationOptions(NumHiddenUnit= 200), agentOptions);

Training

To train multiple agents, you can pass an array of agents to thetrainfunction. The order of agents in the array must match the order of agent block paths specified during environment creation. Doing so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.

你可以训练多个代理在一个分散的or centralized manner. In decentralized training, agents collect their own set of experiences during the episodes and learn independently from those experiences. In centralized training, the agents share the collected experiences and learn from them together. The actor and critic functions are synchronized between the agents after trajectory completion.

To configure a multi-agent training, you can create agent groups and specify a learning strategy for each group through therlMultiAgentTrainingOptionsobject. Each agent group may contain unique agent indices, and the learning strategy can be"centralized"or"decentralized". For example, you can use the following command to configure training for three agent groups with different learning strategies. The agents with indices[1,2]and[3,4]learn in a centralized manner while agent4learns in a decentralized manner.

opts = rlMultiAgentTrainingOptions(AgentGroups= {[1,2], 4, [3,5]}, LearningStrategy= ["centralized","decentralized","centralized"])

For more information on multi-agent training, typehelp rlMultiAgentTrainingOptionsin MATLAB.

You can perform decentralized or centralized training by running one of the following sections using the Run Section button.

1. Decentralized Training

To configure decentralized multi-agent training for this example:

Automatically assign agent groups using theAgentGroups=autooption. This allocates each agent in a separate group.
Specify the"decentralized"learning strategy.
Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or more.

trainOpts = rlMultiAgentTrainingOptions(...AgentGroups="auto",...LearningStrategy="decentralized",...MaxEpisodes=1000,...MaxStepsPerEpisode=600,...ScoreAveragingWindowLength=30,...StopTrainingCriteria="AverageReward",...StopTrainingValue=-10);

Train the agents using thetrainfunction. Training can take several hours to complete depending on the available computational power. To save time, load the MAT filedecentralizedAgents.matwhich contains a set of pretrained agents. To train the agents yourself, setdoTrainingtotrue.

doTraining = false;ifdoTraining decentralizedTrainResults = train([agentA,agentB],env,trainOpts);elseload("decentralizedAgents.mat");end

The following figure shows a snapshot of decentralized training progress. You can expect different results due to randomness in the training process.

2. Centralized Training

To configure centralized multi-agent training for this example:

Allocate both agents (with indices1and2) in a single group. You can do this by specifying the agent indices in the"AgentGroups"option.
Specify the"centralized"learning strategy.
Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or more.

trainOpts = rlMultiAgentTrainingOptions(...AgentGroups={[1,2]},...LearningStrategy="centralized",...MaxEpisodes=1000,...MaxStepsPerEpisode=600,...ScoreAveragingWindowLength=30,...StopTrainingCriteria="AverageReward",...StopTrainingValue=-10);

Train the agents using thetrainfunction. Training can take several hours to complete depending on the available computational power. To save time, load the MAT filecentralizedAgents.matwhich contains a set of pretrained agents. To train the agents yourself, setdoTrainingtotrue.

doTraining = false;ifdoTraining centralizedTrainResults = train([agentA,agentB],env,trainOpts);elseload("centralizedAgents.mat");end

The following figure shows a snapshot of centralized training progress. You can expect different results due to randomness in the training process.

Simulation

Once the training is finished, simulate the trained agents with the environment.

simOptions = rlSimulationOptions(MaxSteps=300); exp = sim(env,[agentA agentB],simOptions);

Figure Multi Agent Collaborative Task contains an axes object. The axes object with xlabel X (m), ylabel Y (m) contains 5 objects of type rectangle, text.

For more information on agent simulation, seerlSimulationOptionsandsim.

火车多个代理执行协作助教k

Environment

Agents

Training

1. Decentralized Training

2. Centralized Training

Simulation

See Also

Functions

Objects

Blocks

Related Examples

More About