Main Content

Create Agent Using Deep Network Designer and Train Using Image Observations

此示例显示了如何创建一个深Q学习网络(DQN)代理,该网络可以摆动并平衡以MATLAB®建模的摆。在此示例中,您可以使用深网设计师。For more information on DQN agents, see深Q网络代理(Reinforcement Learning Toolbox)

带有图像MATLAB环境的摆锤摇摆

The reinforcement learning environment for this example is a simple frictionless pendulum that initially hangs in a downward position. The training goal is to make the pendulum stand upright without falling over using minimal control effort.

对于这个环境:

  • The upward balanced pendulum position is0弧度,向下的悬挂位置是piradians.

  • 从试剂到环境的扭矩动作信号为–2至2 n·m。

  • The observations from the environment are the simplified grayscale image of the pendulum and the pendulum angle derivative.

  • The reward r t , provided at every time step, is

r t = - ( θ t 2 + 0 1 θ t - 2 + 0 001 u t - 1 2 )

Here:

  • θ t 是从直立位置的位移角度。

  • θ t - is the derivative of the displacement angle.

  • u t - 1 是前一个时间步骤的控制工作。

For more information on this model, see训练DDPG代理以摆动和平衡摆与图像观察(Reinforcement Learning Toolbox)

创建环境接口

Create a predefined environment interface for the pendulum.

env = rlPredefinedEnv(“简单pendulumwithimage-discrete”);

The interface has two observations. The first observation, named“吊坠”,是50 x 50的灰度图像。

obsinfo = getObservationinfo(env);obsinfo(1)
ans = rlnumericspec具有属性:lowerLimit:0上限:1名:“ pendimage”描述:[0x0字符串]尺寸:[50 50]数据类型:“ double”

The second observation, named"angularRate",是摆的角速度。

obsInfo(2)
ans = rlnumericspec具有属性:lowerLimit:-inf upperlimit:inf名称:“ angularrate”描述:[0x0字符串]尺寸:[1 1] datatype:“ double”

The interface has a discrete action space where the agent can apply one of five possible torque values to the pendulum: –2, –1, 0, 1, or 2 N·m.

actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: [-2 -1 0 1 2] Name: "torque" Description: [0x0 string] Dimension: [1 1] DataType: "double"

修复随机发电机种子以获得可重复性。

RNG(0)

Construct Critic Network Using Deep Network Designer

DQN代理使用评论家的价值函数表示,在给定观察和行动的情况下,近似长期的奖励。对于这种环境,评论家是一个深层神经网络,具有三个输入(两个观察结果和一个动作)和一个输出。有关创建深神网络值函数表示的更多信息,请参见创建策略和价值功能表示(Reinforcement Learning Toolbox)

You can construct the critic network interactively by using the深网设计师app. To do so, you first create separate input paths for each observation and action. These paths learn lower-level features from their respective inputs. You then create a common output path that combines the outputs from the input paths.

Create Image Observation Path

To create the image observation path, first drag anImageInputlayer来自图层库pane to the canvas. Set the layerInputSize50,50,1for the image observation, and setNormalization没有任何

第二,拖动卷积2Dlayer到画布和连接输入of this layer to the output of the imageInputLayer。与2过滤器(NumFiltersproperty) that have a height and width of10(FilterSize属性),并大步5in the horizontal and vertical directions (大步property).

Finally, complete the image path network with two sets ofreLULayerandfullyConnectedLayerlayers. The output sizes of the first and secondfullyConnectedLayerlayers are 400 and 300, respectively.

创建所有输入路径和输出路径

以类似的方式构建其他输入路径和输出路径。对于此示例,请使用以下选项。

角velocity path (scalar input):

  • ImageInputlayer— SetInputSize1,1andNormalization没有任何

  • fullyConnectedLayer— SetOutputSize400

  • reLULayer

  • fullyConnectedLayer— SetOutputSize300

Action path (scalar input):

  • ImageInputlayer— SetInputSize1,1andNormalization没有任何

  • fullyConnectedLayer— SetOutputSize300

输出路径:

  • additionLayer- 将所有输入路径的输出连接到该层的输入。

  • reLULayer

  • fullyConnectedLayer— SetOutputSize1对于标量值函数。

出口Network from Deep Network Designer

将网络导出到MATLAB工作区,在深网设计师, 点击出口深网设计师将网络作为包含网络层的新变量导出。您可以使用此层网络变量创建评论家表示。

另外,要为网络生成等效的MATLAB代码,请单击出口> Generate Code

The generated code is as follows.

lgraph = layergraph();Templayers = [ImageInputlayer([1 1 1],,"Name","angularRate","Normalization","none") fullyConnectedLayer(400,"Name","dtheta_fc1") reluLayer("Name",“ dtheta_relu1”) fullyConnectedLayer(300,"Name","dtheta_fc2");lgraph = addlayers(lgraph,templayers);Templayers = [ImageInputlayer([1 1 1],,"Name","torque","Normalization","none") fullyConnectedLayer(300,"Name","torque_fc1");lgraph = addlayers(lgraph,templayers);Templayers = [ImageInputlayer([50 50 1],,"Name",“吊坠”,"Normalization","none") convolution2dLayer([10 10],2,"Name",“img_conv1”,“填充”,"same",“大步”,[5 5])relulayer("Name",“ relu_1”) fullyConnectedLayer(400,"Name",“ critis_​​theta_fc1”) reluLayer("Name",“ theta_relu1”) fullyConnectedLayer(300,"Name",“ critis_​​theta_fc2”);lgraph = addlayers(lgraph,templayers);tempLayers = [ additionLayer(3,"Name","addition") reluLayer("Name",“ relu_2”)完整连接的layer(1,"Name","stateValue");lgraph = addlayers(lgraph,templayers);lgraph =连接器(lgraph,"torque_fc1","addition/in3"); lgraph = connectLayers(lgraph,“ critis_​​theta_fc2”,"addition/in1"); lgraph = connectLayers(lgraph,"dtheta_fc2","addition/in2");

View the critic network configuration.

figure plot(lgraph)

图包含一个轴对象。轴对象包含类型图形图的对象。

Specify options for the critic representation usingrlRepresentationOptions(Reinforcement Learning Toolbox)

criticOpts = rlRepresentationOptions('LearnRate',1E-03,'GradientThreshold',1);

使用指定的深神经网络创建评论家代表lgraphand options. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, seeRLQVALUEREERSENTATION(Reinforcement Learning Toolbox)

critic = rlQValueRepresentation(lgraph,obsInfo,actInfo,...“观察”,{'pendimage','angularRate'},,'Action',{“扭矩”},,criticOpts);

要创建DQN代理,请首先使用rlDQNAgentOptions(Reinforcement Learning Toolbox)

agentOpts = rlDQNAgentOptions(...'UseDoubleDQN',false,...'TargetUpdateMethod',"smoothing",...'TargetSmoothFactor',1e-3,...'ExperienceBufferLength',1e6,...'DiscountFactor',0.99,...'采样时间',env.Ts,...“ MINIBATCHSIZE”,64);Agentopts.epsilongreedyexploration.epsilondecay = 1e-5;

Then, create the DQN agent using the specified critic representation and agent options. For more information, seerlDQNAgent(Reinforcement Learning Toolbox)

agent = rlDQNAgent(critic,agentOpts);

Train Agent

要培训代理商,请首先指定培训选项。对于此示例,请使用以下选项。

  • 每次训练最多5000集,每集最多持续500个时间步骤。

  • 在“情节经理”对话框中显示培训进度(设置情节option) and disable the command line display (set theVerboseoption tofalse).

  • Stop training when the agent receives an average cumulative reward greater than –1000 over the default window length of five consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright position using minimal control effort.

有关更多信息,请参阅rlTrainingOptions(Reinforcement Learning Toolbox)

trainOpts = rlTrainingOptions(...“ maxepisodes”,5000,...'MaxStepsPerEpisode',500,...“冗长”,false,...“阴谋”,“训练过程”,...'StopTrainingCriteria',“平均”,...'StopTrainingValue',-1000);

您可以在训练或模拟过程中可视化摆系统plot功能。

plot(env)

Figure Simple Pendulum Visualizer contains 2 axes objects. Axes object 1 contains 2 objects of type line, rectangle. Axes object 2 contains an object of type image.

Train the agent using thetrain(Reinforcement Learning Toolbox)功能。这是一个计算密集型过程,需要几个小时才能完成。为了节省此示例时的时间,请通过设置加载预处理的代理doTrainingfalse。To train the agent yourself, setdoTrainingtrue

doTraining = false;ifdoTraining%训练代理。trainingstats = train(代理,env,trainopts);else示例的负载预告额。load('MATLABPendImageDQN.mat','代理人');end

Simulate DQN Agent

To validate the performance of the trained agent, simulate it within the pendulum environment. For more information on agent simulation, seerlSimulationOptions(Reinforcement Learning Toolbox)andsim(Reinforcement Learning Toolbox)

simOptions = rlSimulationOptions('MaxSteps',500); experience = sim(env,agent,simOptions);

Figure Simple Pendulum Visualizer contains 2 axes objects. Axes object 1 contains 2 objects of type line, rectangle. Axes object 2 contains an object of type image.

总奖励= sum(经验。奖励)
到talReward = -888.9802

See Also

|(Reinforcement Learning Toolbox)

Related Topics