主要内容

Train DQN Agent to Balance Cart-Pole System

此示例显示了如何训练深Q学习网络(DQN)代理以平衡以MATLAB®建模的卡车杆系统。

For more information on DQN agents, seeDeep Q-Network Agents。有关在Simulink®中训练DQN代理的示例,请参见金宝appTrain DQN Agent to Swing Up and Balance Pendulum

Cart-Pole MATLAB Environment

The reinforcement learning environment for this example is a pole attached to an unactuated joint on a cart, which moves along a frictionless track. The training goal is to make the pole stand upright without falling over.

对于这个环境:

  • 向上平衡的杆位置是0弧度,向下的悬挂位置是pi弧度。

  • The pole starts upright with an initial angle between –0.05 and 0.05 radians.

  • 从试剂到环境的力信号为–10至10 n。

  • The observations from the environment are the position and velocity of the cart, the pole angle, and the pole angle derivative.

  • The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more than 2.4 m from the original position.

  • 每个时间步长保持直立的时间步骤,提供+1的奖励。当杆掉落时,将罚款为–5。

有关此模型的更多信息,请参阅Load Predefined Control System Environments

Create Environment Interface

Create a predefined environment interface for the system.

env = rlPredefinedEnv(“ Cartpole-Discrete”)
env = CartPoleDiscreteAction with properties: Gravity: 9.8000 MassCart: 1 MassPole: 0.1000 Length: 0.5000 MaxForce: 10 Ts: 0.0200 ThetaThresholdRadians: 0.2094 XThreshold: 2.4000 RewardForNotFalling: 1 PenaltyForFalling: -5 State: [4x1 double]

该界面具有离散的动作空间,代理可以将两个可能的力值之一应用于–10或10N。

Get the observation and action specification information.

obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "CartPole States" Description: "x, dx, theta, dtheta" Dimension: [4 1] DataType: "double"
actInfo = getActionInfo(env)
actinfo = rlfinitesetsetspec带有属性:元素:[-10 10]名称:“ cartpole Action”描述:[0x0 String] dimension:[1 1] datatype:“ double””

修复随机发电机种子以获得可重复性。

rng(0)

Create DQN Agent

DQN代理使用价值功能的评论家近似观察和行动近似长期的奖励。

DQN agents can use multi-output Q-value critic approximators, which are generally more efficient. A multi-output approximator has observations as inputs and state-action values as outputs. Each output element represents the expected cumulative long-term reward for taking the corresponding discrete action from the state indicated by the observation inputs.

为了创建评论家,首先创建一个具有一个输入(4维观察到的状态)和一个带有两个元素的输出向量(一个用于10 n个动作,另一个用于–10 N动作)的深神经网络。有关创建基于神经网络的价值功能表示表示的更多信息,请参见Create Policies and Value Functions

dnn = [ featureInputLayer(obsInfo.Dimension(1),'Normalization','none','Name','状态')完整连接的layer(24,,'Name',``评论家'') reluLayer('Name',``评论家'')完整连接的layer(24,,'Name',“评论家Statefc2”) reluLayer('Name',``评论家'') fullyConnectedLayer(length(actInfo.Elements),'Name','output')]; dnn = dlnetwork(dnn);

View the network configuration.

figure plot(layerGraph(dnn))

Figure contains an axes object. The axes object contains an object of type graphplot.

使用使用评论家优化器的一些培训选项rlOptimizerOptions

criticOpts = rlOptimizerOptions(“LearnRate”,0.001,'GradientThreshold',1);

使用指定的神经网络和选项创建评论家表示。有关更多信息,请参阅rlVectorQValueFunction

评论家= rlvectorqvaluefunction(DNN,Obsinfo,actinfo);

To create the DQN agent, first specify the DQN agent options usingrldqnagentoptions

agentOpts = rlDQNAgentOptions(。。。'UseDoubleDQN',错误的,。。。“ TargetsmoothFactor”,1,。。。“ targetupdateFquency',4,。。。'ExperienceBufferLength',100000,。。。``批评'',评论家,。。。'MiniBatchSize',256);

然后,使用指定的评论家表示和代理选项创建DQN代理。有关更多信息,请参阅rldqnagent

agent = rldqnagent(评论家,代理);

Train Agent

要培训代理商,请首先指定培训选项。对于此示例,请使用以下选项:

  • Run one training session containing at most 1000 episodes, with each episode lasting at most 500 time steps.

  • Display the training progress in the Episode Manager dialog box (set thePlotsoption) and disable the command line display (set the冗长选项false).

  • 当代理收到大于480的移动平均累积奖励时,停止训练。此时,代理可以在直立位置平衡卡车孔系统。

有关更多信息,请参阅rltrainingoptions

trainOpts = rlTrainingOptions(。。。“ maxepisodes”,1000,。。。'MaxStepsperperepisode',500,。。。'Verbose',错误的,。。。“阴谋”,'training-progress',。。。'StopTrainingCriteria','AverageReward',。。。“停止训练值”,480);

您可以通过使用plotfunction during training or simulation.

情节(Env)

Figure Cart Pole Visualizer contains an axes object. The axes object contains 6 objects of type line, polygon.

使用train功能。训练该代理是一个计算密集的过程,需要几分钟才能完成。为了节省此示例时的时间,请通过设置加载预处理的代理doTrainingtofalse。自己训练代理商,设置doTrainingto真的

dotraining = false;ifdoTraining% Train the agent.trainingstats = train(代理,env,trainopts);else% Load the pretrained agent for the example.load('MATLABCartpoleDQNMulti.mat','agent')end

模拟DQN代理

To validate the performance of the trained agent, simulate it within the cart-pole environment. For more information on agent simulation, seerlsimulationptionssim。The agent can balance the cart-pole even when the simulation time increases to 500 steps.

simoptions = rlSimulationOptions(“ maxsteps',500); experience = sim(env,agent,simOptions);

Figure Cart Pole Visualizer contains an axes object. The axes object contains 6 objects of type line, polygon.

totalReward = sum(experience.Reward)
总奖励= 500

See Also

相关话题