此示例显示了如何创建一个深Q学习网络(DQN)代理,该网络可以摆动并平衡以MATLAB®建模的摆。在此示例中,您可以使用深网设计师。For more information on DQN agents, see深Q网络代理(Reinforcement Learning Toolbox)。
The reinforcement learning environment for this example is a simple frictionless pendulum that initially hangs in a downward position. The training goal is to make the pendulum stand upright without falling over using minimal control effort.
对于这个环境:
The upward balanced pendulum position is0
弧度,向下的悬挂位置是pi
radians.
从试剂到环境的扭矩动作信号为–2至2 n·m。
The observations from the environment are the simplified grayscale image of the pendulum and the pendulum angle derivative.
The reward , provided at every time step, is
Here:
是从直立位置的位移角度。
is the derivative of the displacement angle.
是前一个时间步骤的控制工作。
For more information on this model, see训练DDPG代理以摆动和平衡摆与图像观察(Reinforcement Learning Toolbox)。
Create a predefined environment interface for the pendulum.
env = rlPredefinedEnv(“简单pendulumwithimage-discrete”);
The interface has two observations. The first observation, named“吊坠”
,是50 x 50的灰度图像。
obsinfo = getObservationinfo(env);obsinfo(1)
ans = rlnumericspec具有属性:lowerLimit:0上限:1名:“ pendimage”描述:[0x0字符串]尺寸:[50 50]数据类型:“ double”
The second observation, named"angularRate"
,是摆的角速度。
obsInfo(2)
ans = rlnumericspec具有属性:lowerLimit:-inf upperlimit:inf名称:“ angularrate”描述:[0x0字符串]尺寸:[1 1] datatype:“ double”
The interface has a discrete action space where the agent can apply one of five possible torque values to the pendulum: –2, –1, 0, 1, or 2 N·m.
actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: [-2 -1 0 1 2] Name: "torque" Description: [0x0 string] Dimension: [1 1] DataType: "double"
修复随机发电机种子以获得可重复性。
RNG(0)
DQN代理使用评论家的价值函数表示,在给定观察和行动的情况下,近似长期的奖励。对于这种环境,评论家是一个深层神经网络,具有三个输入(两个观察结果和一个动作)和一个输出。有关创建深神网络值函数表示的更多信息,请参见创建策略和价值功能表示(Reinforcement Learning Toolbox)。
You can construct the critic network interactively by using the深网设计师app. To do so, you first create separate input paths for each observation and action. These paths learn lower-level features from their respective inputs. You then create a common output path that combines the outputs from the input paths.
Create Image Observation Path
To create the image observation path, first drag anImageInputlayer
来自图层库pane to the canvas. Set the layerInputSize到50,50,1
for the image observation, and setNormalization到没有任何
。
第二,拖动卷积2Dlayer
到画布和连接输入of this layer to the output of the imageInputLayer
。与2
过滤器(NumFiltersproperty) that have a height and width of10
(FilterSize属性),并大步5
in the horizontal and vertical directions (大步property).
Finally, complete the image path network with two sets ofreLULayer
andfullyConnectedLayer
layers. The output sizes of the first and secondfullyConnectedLayer
layers are 400 and 300, respectively.
创建所有输入路径和输出路径
以类似的方式构建其他输入路径和输出路径。对于此示例,请使用以下选项。
角velocity path (scalar input):
ImageInputlayer
— SetInputSize到1,1
andNormalization到没有任何
。
fullyConnectedLayer
— SetOutputSize到400
。
reLULayer
fullyConnectedLayer
— SetOutputSize到300
。
Action path (scalar input):
ImageInputlayer
— SetInputSize到1,1
andNormalization到没有任何
。
fullyConnectedLayer
— SetOutputSize到300
。
输出路径:
additionLayer
- 将所有输入路径的输出连接到该层的输入。
reLULayer
fullyConnectedLayer
— SetOutputSize到1
对于标量值函数。
将网络导出到MATLAB工作区,在深网设计师, 点击出口。深网设计师将网络作为包含网络层的新变量导出。您可以使用此层网络变量创建评论家表示。
另外,要为网络生成等效的MATLAB代码,请单击出口> Generate Code。
The generated code is as follows.
lgraph = layergraph();Templayers = [ImageInputlayer([1 1 1],,"Name","angularRate","Normalization","none") fullyConnectedLayer(400,"Name","dtheta_fc1") reluLayer("Name",“ dtheta_relu1”) fullyConnectedLayer(300,"Name","dtheta_fc2");lgraph = addlayers(lgraph,templayers);Templayers = [ImageInputlayer([1 1 1],,"Name","torque","Normalization","none") fullyConnectedLayer(300,"Name","torque_fc1");lgraph = addlayers(lgraph,templayers);Templayers = [ImageInputlayer([50 50 1],,"Name",“吊坠”,"Normalization","none") convolution2dLayer([10 10],2,"Name",“img_conv1”,“填充”,"same",“大步”,[5 5])relulayer("Name",“ relu_1”) fullyConnectedLayer(400,"Name",“ critis_theta_fc1”) reluLayer("Name",“ theta_relu1”) fullyConnectedLayer(300,"Name",“ critis_theta_fc2”);lgraph = addlayers(lgraph,templayers);tempLayers = [ additionLayer(3,"Name","addition") reluLayer("Name",“ relu_2”)完整连接的layer(1,"Name","stateValue");lgraph = addlayers(lgraph,templayers);lgraph =连接器(lgraph,"torque_fc1","addition/in3"); lgraph = connectLayers(lgraph,“ critis_theta_fc2”,"addition/in1"); lgraph = connectLayers(lgraph,"dtheta_fc2","addition/in2");
View the critic network configuration.
figure plot(lgraph)
Specify options for the critic representation usingrlRepresentationOptions
(Reinforcement Learning Toolbox)。
criticOpts = rlRepresentationOptions('LearnRate',1E-03,'GradientThreshold',1);
使用指定的深神经网络创建评论家代表lgraph
and options. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, seeRLQVALUEREERSENTATION
(Reinforcement Learning Toolbox)。
critic = rlQValueRepresentation(lgraph,obsInfo,actInfo,...“观察”,{'pendimage','angularRate'},,'Action',{“扭矩”},,criticOpts);
要创建DQN代理,请首先使用rlDQNAgentOptions
(Reinforcement Learning Toolbox)。
agentOpts = rlDQNAgentOptions(...'UseDoubleDQN',false,...'TargetUpdateMethod',"smoothing",...'TargetSmoothFactor',1e-3,...'ExperienceBufferLength',1e6,...'DiscountFactor',0.99,...'采样时间',env.Ts,...“ MINIBATCHSIZE”,64);Agentopts.epsilongreedyexploration.epsilondecay = 1e-5;
Then, create the DQN agent using the specified critic representation and agent options. For more information, seerlDQNAgent
(Reinforcement Learning Toolbox)。
agent = rlDQNAgent(critic,agentOpts);
要培训代理商,请首先指定培训选项。对于此示例,请使用以下选项。
每次训练最多5000集,每集最多持续500个时间步骤。
在“情节经理”对话框中显示培训进度(设置情节
option) and disable the command line display (set theVerbose
option tofalse
).
Stop training when the agent receives an average cumulative reward greater than –1000 over the default window length of five consecutive episodes. At this point, the agent can quickly balance the pendulum in the upright position using minimal control effort.
有关更多信息,请参阅rlTrainingOptions
(Reinforcement Learning Toolbox)。
trainOpts = rlTrainingOptions(...“ maxepisodes”,5000,...'MaxStepsPerEpisode',500,...“冗长”,false,...“阴谋”,“训练过程”,...'StopTrainingCriteria',“平均”,...'StopTrainingValue',-1000);
您可以在训练或模拟过程中可视化摆系统plot
功能。
plot(env)
Train the agent using thetrain
(Reinforcement Learning Toolbox)功能。这是一个计算密集型过程,需要几个小时才能完成。为了节省此示例时的时间,请通过设置加载预处理的代理doTraining
到false
。To train the agent yourself, setdoTraining
到true
。
doTraining = false;ifdoTraining%训练代理。trainingstats = train(代理,env,trainopts);else示例的负载预告额。load('MATLABPendImageDQN.mat','代理人');end
To validate the performance of the trained agent, simulate it within the pendulum environment. For more information on agent simulation, seerlSimulationOptions
(Reinforcement Learning Toolbox)andsim
(Reinforcement Learning Toolbox)。
simOptions = rlSimulationOptions('MaxSteps',500); experience = sim(env,agent,simOptions);
总奖励= sum(经验。奖励)
到talReward = -888.9802
深网设计师|rlDQNAgent
(Reinforcement Learning Toolbox)