训练DQN代理以摆动和平衡摆

This example uses:

Open Live Script

此示例显示了如何训练深Q学习网络（DQN）代理以摆动并平衡以Simulink®建模的摆。金宝app

For more information on DQN agents, see深Q网络代理。For an example that trains a DQN agent in MATLAB®, see训练DQN代理以平衡卡车孔系统。

Pendulum Swing-up Model

此示例的增强学习环境是一种简单的无摩擦摆，最初悬挂在向下的位置。训练目标是使摆锤直立，而不会使用最小的控制努力掉下来。

打开模型。

mdl ='rlSimplePendulumModel'; open_system(mdl)

对于此模型：

向上的巴兰ced pendulum position is 0 radians, and the downward hanging position ispiradians.
从试剂到环境的扭矩动作信号为–2至2 n·m。
来自环境的观察结果是摆角的正弦，摆角的余弦和摆角衍生物。
奖励 $r_{t}$ ，在每个时间步中提供的是

$r_{t} = - （（ {θ_{t}}^{2} + 0 。 1 {\bar{θ_{t}}}^{2} + 0 。 001 你_{t - 1}^{2} ）$

这里：

$θ_{t}$ 是从直立位置的位移角度。
$\bar{θ_{t}}$ is the derivative of the displacement angle.
$你_{t - 1}$ 是前一个时间步骤的控制工作。

For more information on this model, seeLoad Predefined Simulink Environments。

创建环境接口

Create a predefined environment interface for the pendulum.

env = rlpredefinedenv（“简单的Pendulummodel-Discrete”）

env = SimulinkEnvWithAgent with properties: Model : rlSimplePendulumModel AgentBlock : rlSimplePendulumModel/RL Agent ResetFcn : [] UseFastRestart : on

该界面具有离散的动作空间，代理可以将三个可能的扭矩值之一应用于摆：–2、0或2 n·m。

要将摆的初始条件定义为向下悬挂，请使用匿名函数句柄指定环境重置函数。此重置功能设置模型工作区变量theta0至pi。

env.Resetfcn = @（in）setVariable（in，'theta0'，，，，pi,'Workspace'，，，，mdl);

Get the observation and action specification information from the environment

obsInfo = getObservationInfo (env)

obsInfo = rlnumericspec具有属性：lowerlimit：-inf upperlimit：inf name：“观察”描述：[0x0字符串]维度：[3 1] datatype：“ double”

actinfo = getActioninfo（env）

actInfo = rlFiniteSetSpec with properties: Elements: [3x1 double] Name: "torque" Description: [0x0 string] Dimension: [1 1] DataType: "double"

Specify the simulation timeTFand the agent sample timeTS片刻之间。

TS = 0.05;TF = 20;

Fix the random generator seed for reproducibility.

RNG（0）

创建DQN代理

DQN代理使用价值函数评论家在给定观察和行动的情况下近似长期奖励。

Since DQN has a discrete action space, it can rely on a multi-output critic approximator, which is generally a more efficient option than relying on a comparable single-output approximator. A multi-output approximator has only the observation as input and an output vector having as many elements as the number of possible discrete actions. Each output element represents the expected cumulative long-term reward following from the observation given as input, when the corresponding discrete action is taken.

To create the critic, first create a deep neural network with an input vector of three elements ( for the sine, cosine, and derivative of the pendulum angle) and one output vector with three elements (–2, 0, or 2 Nm actions). For more information on creating a deep neural network value function representation, see创建策略和价值功能。

dnn = [featurenputlayer（3，'正常化'，，，，'没有任何'，，，，'姓名'，，，，'state'）fullyConnectedLayer(24,'姓名'，，，，'CriticStateFC1'）relulayer（'姓名'，，，，'CriticRelu1'）fullyConnectedLayer(48,'姓名'，，，，'CriticStateFC2'）relulayer（'姓名'，，，，'CriticCommonRelu'）完整连接的layer（3，'姓名'，，，，'输出'）;dnn = dlnetwork（dnn）;

查看评论家网络配置。

图图（layergraph（dnn））

图包含一个轴对象。轴对象包含类型图形图的对象。

Specify options for the critic optimizer usingrloptimizerOptions。

评论家= rloptimizeroptions（“学习率”，0.001，'GradientThreshold'，，，，1）;

使用指定的深神经网络和选项创建评论家表示。您还必须为评论家指定观察和行动信息。有关更多信息，请参阅rlvectorqvaluefunction。

critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);

要创建DQN代理，请首先使用rlDQNAgentOptions。

agentOptions = rlDQNAgentOptions(...'SampleTime'，，，，TS，，，，...'CriticOptimizerOptions'，，，，criticOpts,...“ ExperienceBufferLength”，，，，3000,...“二手rubedqn”，，，，错误的）;

Then, create the DQN agent using the specified critic representation and agent options. For more information, seerlDQNAgent。

agent = rldqnagent（评论家，代理）;

火车代理

To train the agent, first specify the training options. For this example, use the following options.

Run each training for at most 1000 episodes, with each episode lasting at most500时间步骤。
在“情节经理”对话框中显示培训进度（设置情节选项）并禁用命令行显示（设置Verboseoption to错误的）。
Stop training when the agent receives an average cumulative reward greater than –1100 over five consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright position using minimal control effort.
为每个情节保存代理的副本，累积奖励大于–1100。

有关更多信息，请参阅rlTrainingOptions。

火车ingOptions = rlTrainingOptions(...'MaxEpisodes'，1000，...'MaxStepsPerEpisode'，500，...“ ScoreaveragingWindowLength”，，，，5,...'Verbose'，，，，错误的，，，，...“绘图”，，，，“训练过程”，，，，...“停止训练有素”，，，，“平均”，，，，...'StopTrainingValue'，-1100，...“ saveagentCriteria'，，，，“ epipodereward'，，，，...“ saveagentvalue”，，，，-1100）;

Train the agent using the火车function. Training this agent is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by settingdotraining至错误的。To train the agent yourself, setdotraining至tr你e。

dotraining=错误的;如果dotraining％训练代理。triendstats = train（代理，环境，培训）；别的％为示例加载预验证的代理。加载（'金宝appsimulinkpendulumdqnmulti.mat'，，，，'代理人'）;结尾

Simulate DQN Agent

要验证受过训练的代理的性能，请在摆环境中进行模拟。有关代理模拟的更多信息，请参阅rlSimulationOptionsandSIM。

SIMOptions = rlSimulationOptions('MaxSteps'，500）;经验= SIM（Env，代理，SimOptions）;

图简单的摆动器包含一个轴对象。轴对象包含2个类型线的对象，矩形。

也可以看看

rlDQNAgent