这个例子演示了如何在据code class="literal">水缸据/code>金宝appSimulink®模型到钢筋学习深度确定性政策梯度(DDPG)代理。有关在Matlab®中列出DDPG代理的示例,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ug/train-ddpg-agent-to-balance-double-integrator-system.html" class="a">火车DDPG代理控制双积分系统据/a>.据/p>
这个例子的原始模型是水箱模型。目标是控制水箱中的水的水平。有关水箱模型的更多信息,请参阅据a href="//www.tatmou.com/fr/help/slcontrol/gs/watertank-simulink-model.html" class="a">Watertank 金宝appSimulink模型据/a>(金宝appSimulink Control Design)据/span>.据/p>
修改原型号,修改如下:据/p>
删除PID控制器。据/p> 插入RL代理块。据/p> 连接观察向量据span class="inlineequation">
,在那里据span class="inlineequation">
是水箱的高度,据span class="inlineequation">
, 和据span class="inlineequation">
为参考高度。据/p> 设置奖励据span class="inlineequation">
.据/p> 配置终止信号,使模拟停止如果据span class="inlineequation">
或据span class="inlineequation">
.据/p> 得到的模型是据code class="literal">Rlwatertank.slx.据/code>.有关此模型和更改的更多信息,请参见据a href="//www.tatmou.com/fr/help/reinforcement-learning/ug/create-simulink-environments-for-reinforcement-learning.html" class="a">创建Simul金宝appink强化学习环境据/a>.据/p>
创建环境模型包括定义以下内容:据/p>
动作和观察信号是主体与环境交互的信号。有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rl.util.rlnumericspec.html" class="a"> 奖励信号是代理用来衡量其成功的信号。有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ug/define-reward-signals.html" class="a">定义奖励信号据/a>.据/p> 定义观察规范据code class="literal">obsinfo.据/code>和行动规范据code class="literal">actInfo据/code>.据/p>
构建环境接口对象。据/p>
设置自定义重置函数,随机化模型的参考值。据/p>
指定模拟时间据code class="literal">特遣部队据/code>和代理采样时间据code class="literal">Ts据/code>片刻之间。据/p>
修复随机生成器种子的再现性。据/p>
给定观察和行动,DDPG代理使用批判价值函数表示近似长期奖励。要创建批评家,首先要创建一个深度神经网络,它有两个输入,一个是观察和行动,一个是输出。有关创建深度神经网络值函数表示的更多信息,请参见据a href="//www.tatmou.com/fr/help/reinforcement-learning/ug/create-policy-and-value-function-representations.html" class="a">创建策略和价值功能表示据/a>.据/p>
查看批评批评网络配置。据/p>
指定使用批评者的选项据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rlrepresentationoptions.html" class="a"> 使用指定的深度神经网络和选项创建批评家表示。您还必须为评论家指定操作和观察规范,这是从环境接口获得的。有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rlqvaluerepresentation.html" class="a"> 鉴于观察,DDPG代理决定使用演员表示采取的行动。要创建演员,首先创建一个输入的深度神经网络,一个输入,观察和一个输出,动作。据/p>
以与评论家类似的方式构建演员。有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rldeterministicactorrepresentation.html" class="a"> 要创建DDPG代理,首先使用DDPG代理选项使用据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rlddpgagentoptions.html" class="a"> 然后,使用指定的Actor表示,批评者表示和代理选项创建DDPG代理。有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rlddpgagent.html" class="a"> 要培训代理,首先指定培训选项。对于本例,使用以下选项:据/p>
每次训练最多跑一次据code class="literal">5000据/code>集。指定每一集最多持续时间据code class="literal">CEIL(TF / TS)据/code>(即据code class="literal">200.据/code>)时间的步骤。据/p> 在“插曲管理器”对话框中显示培训进度(设置据code class="literal">情节据/code>选项)并禁用命令行显示(设置据code class="literal">verb据/code>选项据code class="literal">错误的据/code>)。据/p> 当代理收到平均累积奖励时停止培训大于据code class="literal">800据/code>超过据code class="literal">20.据/code>连续发作。此时,药剂可以控制罐中的水平。据/p> 有关更多信息,请参阅据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rltrainingoptions.html" class="a"> 训练代理人使用据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rl.agent.rlqagent.train.html" class="a"> 通过仿真验证了该模型的有效性。据/p>
open_system (据span style="color:#A020F0">'rlwatertank'据/span>)据/pre>
创建环境接口据/h3>
rlnumericspec.据/code>和据a href="//www.tatmou.com/fr/help/reinforcement-learning/ref/rl.util.rlfinitesetspec.html" class="a">
rlFiniteSetSpec据/code>.据/p>
obsInfo = rlNumericSpec([3 1],据span style="color:#0000FF">...据/span>“LowerLimit”据/span>,[-inf -inf 0]',据span style="color:#0000FF">...据/span>'上限'据/span>,[inf inf inf]');obsInfo。Name =据span style="color:#A020F0">“观察”据/span>;obsInfo。描述=据span style="color:#A020F0">“综合误差、误差和测量高度”据/span>;numObservations = obsInfo.Dimension (1);actInfo = rlNumericSpec([1 1]);actInfo。Name =据span style="color:#A020F0">'流动'据/span>;numActions = actInfo.Dimension (1);据/pre>
env = rl金宝appSimulinkEnv (据span style="color:#A020F0">'rlwatertank'据/span>那据span style="color:#A020F0">“rlwatertank / RL代理”据/span>那据span style="color:#0000FF">...据/span>Obsinfo,Actinfo);据/pre>
env.resetfcn = @(in)localresetfcn(in);据/pre>
ts = 1.0;tf = 200;据/pre>
rng (0)据/pre>
创建DDPG代理据/h3>
statepath = [featureInputLayer(numobservations,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“状态”据/span>)全连接列(50,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评福尔福克'据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticRelu1”据/span>) fullyConnectedLayer (25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批评福尔2'据/span>)];ActionPath = [featuredupputlayer(nations,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“行动”据/span>) fullyConnectedLayer (25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticActionFC1”据/span>)];commonpath = [附加层(2,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“添加”据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'批判杂志'据/span>) fullyConnectedLayer (1,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticOutput”据/span>)];批判性= layergraph();criticNetwork = addLayers(criticNetwork,statePath);批评网络= addlayers(批判性,ActionPath);批评网络= addlayers(批判性,commonpath);批评网络= ConnectLayers(批评者,据span style="color:#A020F0">'批评福尔2'据/span>那据span style="color:#A020F0">'添加/ in1'据/span>);批评网络= ConnectLayers(批评者,据span style="color:#A020F0">“CriticActionFC1”据/span>那据span style="color:#A020F0">“添加/ in2”据/span>);据/pre>
图情节(批评性)据/pre>
rlrepresentationOptions.据/code>.据/p>
批评= rlrepresentationOptions(据span style="color:#A020F0">'学习'据/span>1 e 03据span style="color:#A020F0">'gradientthreshold'据/span>1);据/pre>
rlQValueRepresentation据/code>.据/p>
评论家= rlqvalueerepresentation(批评,undernfo,Actinfo,据span style="color:#A020F0">“观察”据/span>,{据span style="color:#A020F0">“状态”据/span>},据span style="color:#A020F0">“行动”据/span>,{据span style="color:#A020F0">“行动”据/span>},批评);据/pre>
rlDeterministicActorRepresentation据/code>.据/p>
actorNetwork = [featureInputLayer(numobobservations,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“状态”据/span>) fullyConnectedLayer (3据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“actorFC”据/span>) tanhLayer (据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'actortanh'据/span>) fullyConnectedLayer (numActions据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“行动”据/span>));actorOptions = rlRepresentationOptions (据span style="color:#A020F0">'学习'据/span>1 e-04据span style="color:#A020F0">'gradientthreshold'据/span>1);演员= rlDeterministicActorRepresentation (actorNetwork obsInfo actInfo,据span style="color:#A020F0">“观察”据/span>,{据span style="color:#A020F0">“状态”据/span>},据span style="color:#A020F0">“行动”据/span>,{据span style="color:#A020F0">“行动”据/span>}, actorOptions);据/pre>
rlddpgagentoptions.据/code>.据/p>
agentOpts = rlDDPGAgentOptions (据span style="color:#0000FF">...据/span>“SampleTime”据/span>,ts,据span style="color:#0000FF">...据/span>'targetsmoothfactor'据/span>1 e - 3,据span style="color:#0000FF">...据/span>'贴花因子'据/span>,1.0,据span style="color:#0000FF">...据/span>“MiniBatchSize”据/span>,64,据span style="color:#0000FF">...据/span>'经验BufferLength'据/span>,1E6);agentopts.noiseOptions.variance = 0.3;代理.NoiseOptions.varecedecayrate = 1E-5;据/pre>
rlDDPGAgent据/code>.据/p>
代理= rlddpgagent(演员,批评者,代理商);据/pre>
火车代理据/h3>
rltringOptions.据/code>.据/p>
maxepisodes = 5000;maxsteps =装天花板(Tf / Ts);trainOpts = rlTrainingOptions (据span style="color:#0000FF">...据/span>“MaxEpisodes”据/span>maxepisodes,据span style="color:#0000FF">...据/span>“MaxStepsPerEpisode”据/span>,maxsteps,据span style="color:#0000FF">...据/span>'scoreaveragingwindowlength'据/span>, 20岁,据span style="color:#0000FF">...据/span>“详细”据/span>,错误的,据span style="color:#0000FF">...据/span>'plots'据/span>那据span style="color:#A020F0">“训练进步”据/span>那据span style="color:#0000FF">...据/span>'stoptrinaincriteria'据/span>那据span style="color:#A020F0">“AverageReward”据/span>那据span style="color:#0000FF">...据/span>'stoptriningvalue'据/span>,800);据/pre>
火车据/code>函数。培训是一个计算密集型的过程,需要几分钟才能完成。为了节省运行此示例的时间,请通过设置加载预先训练过的代理据code class="literal">用圆形据/code>来据code class="literal">错误的据/code>.训练代理人,套装据code class="literal">用圆形据/code>来据code class="literal">真的据/code>.据/p>
doTraining = false;据span style="color:#0000FF">如果据/span>用圆形据span style="color:#228B22">培训代理商。据/span>trainingStats =火车(代理,env, trainOpts);据span style="color:#0000FF">别的据/span>%加载示例的预训练代理。据/span>负载(据span style="color:#A020F0">“WaterTankDDPG.mat”据/span>那据span style="color:#A020F0">“代理”据/span>)据span style="color:#0000FF">结尾据/span>
验证培训代理据/h3>
Simopts = RlsimulationOptions(据span style="color:#A020F0">“MaxSteps”据/span>,maxsteps,据span style="color:#A020F0">“StopOnError”据/span>那据span style="color:#A020F0">'在'据/span>);体验= SIM(ENV,Agent,Simopts);据/pre>
本地函数据/h3>
功能据/span>= localResetFcn(中)据span style="color:#228B22">%随机化参考信号据/span>黑色= sprintf (据span style="color:#A020F0">“rlwatertank /期望\ nWater级别”据/span>);H = 3*randn + 10;据span style="color:#0000FF">尽管据/span>H <= 0 || >= 20 H = 3*randn + 10;据span style="color:#0000FF">结尾据/span>在= setBlockParameter(IN,BLK,据span style="color:#A020F0">'价值'据/span>,num2str(h));据span style="color:#228B22">%随机化初始高度据/span>H = 3*randn + 10;据span style="color:#0000FF">尽管据/span>H <= 0 || >= 20 H = 3*randn + 10;据span style="color:#0000FF">结尾据/span>BLK =据span style="color:#A020F0">“rlwatertank /水箱系统/ H”据/span>;在= setBlockParameter(IN,BLK,据span style="color:#A020F0">“InitialCondition”据/span>,num2str(h));据span style="color:#0000FF">结尾据/span>