Train DQN Agent to Balance Cart-Pole System
此示例显示了如何训练深Q学习网络(DQN)代理以平衡以MATLAB®建模的卡车杆系统。
For more information on DQN agents, seeDeep Q-Network Agents。有关在Simulink®中训练DQN代理的示例,请参见金宝appTrain DQN Agent to Swing Up and Balance Pendulum。
Cart-Pole MATLAB Environment
The reinforcement learning environment for this example is a pole attached to an unactuated joint on a cart, which moves along a frictionless track. The training goal is to make the pole stand upright without falling over.
对于这个环境:
向上平衡的杆位置是
0
弧度,向下的悬挂位置是pi
弧度。The pole starts upright with an initial angle between –0.05 and 0.05 radians.
从试剂到环境的力信号为–10至10 n。
The observations from the environment are the position and velocity of the cart, the pole angle, and the pole angle derivative.
The episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more than 2.4 m from the original position.
每个时间步长保持直立的时间步骤,提供+1的奖励。当杆掉落时,将罚款为–5。
有关此模型的更多信息,请参阅Load Predefined Control System Environments。
Create Environment Interface
Create a predefined environment interface for the system.
env = rlPredefinedEnv(“ Cartpole-Discrete”)
env = CartPoleDiscreteAction with properties: Gravity: 9.8000 MassCart: 1 MassPole: 0.1000 Length: 0.5000 MaxForce: 10 Ts: 0.0200 ThetaThresholdRadians: 0.2094 XThreshold: 2.4000 RewardForNotFalling: 1 PenaltyForFalling: -5 State: [4x1 double]
该界面具有离散的动作空间,代理可以将两个可能的力值之一应用于–10或10N。
Get the observation and action specification information.
obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "CartPole States" Description: "x, dx, theta, dtheta" Dimension: [4 1] DataType: "double"
actInfo = getActionInfo(env)
actinfo = rlfinitesetsetspec带有属性:元素:[-10 10]名称:“ cartpole Action”描述:[0x0 String] dimension:[1 1] datatype:“ double””
修复随机发电机种子以获得可重复性。
rng(0)
Create DQN Agent
DQN代理使用价值功能的评论家近似观察和行动近似长期的奖励。
DQN agents can use multi-output Q-value critic approximators, which are generally more efficient. A multi-output approximator has observations as inputs and state-action values as outputs. Each output element represents the expected cumulative long-term reward for taking the corresponding discrete action from the state indicated by the observation inputs.
为了创建评论家,首先创建一个具有一个输入(4维观察到的状态)和一个带有两个元素的输出向量(一个用于10 n个动作,另一个用于–10 N动作)的深神经网络。有关创建基于神经网络的价值功能表示表示的更多信息,请参见Create Policies and Value Functions。
dnn = [ featureInputLayer(obsInfo.Dimension(1),'Normalization','none','Name','状态')完整连接的layer(24,,'Name',``评论家'') reluLayer('Name',``评论家'')完整连接的layer(24,,'Name',“评论家Statefc2”) reluLayer('Name',``评论家'') fullyConnectedLayer(length(actInfo.Elements),'Name','output')]; dnn = dlnetwork(dnn);
View the network configuration.
figure plot(layerGraph(dnn))
使用使用评论家优化器的一些培训选项rlOptimizerOptions
。
criticOpts = rlOptimizerOptions(“LearnRate”,0.001,'GradientThreshold',1);
使用指定的神经网络和选项创建评论家表示。有关更多信息,请参阅rlVectorQValueFunction
。
评论家= rlvectorqvaluefunction(DNN,Obsinfo,actinfo);
To create the DQN agent, first specify the DQN agent options usingrldqnagentoptions
。
agentOpts = rlDQNAgentOptions(。。。'UseDoubleDQN',错误的,。。。“ TargetsmoothFactor”,1,。。。“ targetupdateFquency',4,。。。'ExperienceBufferLength',100000,。。。``批评'',评论家,。。。'MiniBatchSize',256);
然后,使用指定的评论家表示和代理选项创建DQN代理。有关更多信息,请参阅rldqnagent
。
agent = rldqnagent(评论家,代理);
Train Agent
要培训代理商,请首先指定培训选项。对于此示例,请使用以下选项:
Run one training session containing at most 1000 episodes, with each episode lasting at most 500 time steps.
Display the training progress in the Episode Manager dialog box (set the
Plots
option) and disable the command line display (set the冗长
选项false
).当代理收到大于480的移动平均累积奖励时,停止训练。此时,代理可以在直立位置平衡卡车孔系统。
有关更多信息,请参阅rltrainingoptions
。
trainOpts = rlTrainingOptions(。。。“ maxepisodes”,1000,。。。'MaxStepsperperepisode',500,。。。'Verbose',错误的,。。。“阴谋”,'training-progress',。。。'StopTrainingCriteria','AverageReward',。。。“停止训练值”,480);
您可以通过使用plot
function during training or simulation.
情节(Env)
使用train
功能。训练该代理是一个计算密集的过程,需要几分钟才能完成。为了节省此示例时的时间,请通过设置加载预处理的代理doTraining
tofalse
。自己训练代理商,设置doTraining
to真的
。
dotraining = false;ifdoTraining% Train the agent.trainingstats = train(代理,env,trainopts);else% Load the pretrained agent for the example.load('MATLABCartpoleDQNMulti.mat','agent')end
模拟DQN代理
To validate the performance of the trained agent, simulate it within the cart-pole environment. For more information on agent simulation, seerlsimulationptions
和sim
。The agent can balance the cart-pole even when the simulation time increases to 500 steps.
simoptions = rlSimulationOptions(“ maxsteps',500); experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
总奖励= 500