这个例子演示了如何在据code class="literal">watertank据/code>金宝appSimulink®模型到钢筋学习深度确定性政策梯度(DDPG)代理。有关在Matlab®中列出DDPG代理的示例,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ug/train-ddpg-agent-to-balance-double-integrator-system.html" class="a">培训DDPG Agent控制双积分系统据/a>.据/p>
这个例子的原始模型是水箱模型。目标是控制水箱中的水的水平。有关水箱模型的更多信息,请参阅据a href="//www.tatmou.com/au/help/slcontrol/gs/watertank-simulink-model.html" class="a">Watertank 金宝appSimulink模型据/a>(金宝appSimulink Control Design)据/span>.据/p>
修改原型号,修改如下:据/p>
删除PID控制器。据/p> 插入RL代理块。据/p> 连接观测向量据span class="inlineequation">
,在那里据span class="inlineequation">
是坦克的高度,据span class="inlineequation">
,据span class="inlineequation">
为参考高度。据/p> 设置奖励据span class="inlineequation">
.据/p> 配置终止信号,使模拟停止如果据span class="inlineequation">
或据span class="inlineequation">
.据/p> 得到的模型是据code class="literal">Rlwatertank.slx.据/code>.有关此模型和更改的更多信息,请参见据a href="//www.tatmou.com/au/help/reinforcement-learning/ug/create-simulink-environments-for-reinforcement-learning.html" class="a">创建Simul金宝appink强化学习环境据/a>.据/p>
创建环境模型包括定义以下内容:据/p>
动作和观察信号是主体与环境交互的信号。有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rl.util.rlnumericspec.html" class="a"> 奖励信号是代理用来衡量其成功的信号。有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ug/define-reward-signals.html" class="a">定义奖励信号据/a>.据/p> 定义观察规范据code class="literal">obsinfo据/code>和行为规范据code class="literal">Actinfo.据/code>.据/p>
构建环境界面对象。据/p>
设置自定义重置函数,随机化模型的参考值。据/p>
指定模拟时间据code class="literal">TF.据/code>以及样本时间据code class="literal">TS.据/code>片刻之间。据/p>
修复随机生成器种子的再现性。据/p>
鉴于观察和动作,DDPG代理使用批评值函数表示来估计长期奖励。要创建评论家,首先创建一个具有两个输入,观察和动作的深度神经网络,以及一个输出。有关创建深度神经网络值函数表示的更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ug/create-policy-and-value-function-representations.html" class="a">创建策略和值函数表示据/a>.据/p>
查看批评批评网络配置。据/p>
指定使用批评者的选项据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rlrepresentationoptions.html" class="a"> 使用指定的深度神经网络和选项创建批评家表示。您还必须为评论家指定操作和观察规范,这是从环境接口获得的。有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rlqvaluerepresentation.html" class="a"> 鉴于观察,DDPG代理决定使用演员表示采取的行动。要创建演员,首先创建一个输入的深度神经网络,一个输入,观察和一个输出,动作。据/p>
以与评论家类似的方式构建演员。有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rldeterministicactorrepresentation.html" class="a"> 要创建DDPG代理,首先使用据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rlddpgagentoptions.html" class="a"> 然后,使用指定的Actor表示,批评者表示和代理选项创建DDPG代理。有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rlddpgagent.html" class="a"> 要培训代理,首先指定培训选项。对于本例,使用以下选项:据/p>
每次训练最多跑一次据code class="literal">5000据/code>剧集。指定每个剧集最多持续据code class="literal">CEIL(TF / TS)据/code>(即据code class="literal">200据/code>)时间步骤。据/p> 在Episode Manager对话框中显示培训进度(设置据code class="literal">情节据/code>选项),并禁用命令行显示(设置据code class="literal">详细的据/code>选项据code class="literal">错误的据/code>).据/p> 停止训练时,代理收到的平均累积奖励大于据code class="literal">800据/code>超过据code class="literal">20.据/code>连续集。此时,药剂可以控制水箱内的水位。据/p> 有关更多信息,请参阅据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rltrainingoptions.html" class="a"> 训练代理人使用据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rl.agent.rlqagent.train.html" class="a"> 通过仿真验证了该模型的有效性。据/p>
open_system (据span style="color:#A020F0">'rlwatertank'据/span>)据/pre>
创建环境界面据/h3>
rlNumericSpec据/code>和据a href="//www.tatmou.com/au/help/reinforcement-learning/ref/rl.util.rlfinitesetspec.html" class="a">
rlfinitesetspec.据/code>.据/p>
obsInfo = rlNumericSpec([3 1],据span style="color:#0000FF">......据/span>'lowerimit'据/span>,[-inf -inf 0]',据span style="color:#0000FF">......据/span>'上限'据/span>,[inf inf inf]');obsInfo。Name =据span style="color:#A020F0">'观察'据/span>;Obsinfo.description =.据span style="color:#A020F0">'集成错误,错误和测量高度'据/span>;numObservations = obsInfo.Dimension (1);actInfo = rlNumericSpec([1 1]);actInfo。Name =据span style="color:#A020F0">'流动'据/span>;数量= Actinfo.dimension(1);据/pre>
env = rl金宝appSimulinkEnv (据span style="color:#A020F0">'rlwatertank'据/span>那据span style="color:#A020F0">“rlwatertank / RL代理”据/span>那据span style="color:#0000FF">......据/span>Obsinfo,Actinfo);据/pre>
env。ResetFcn = @(在)localResetFcn(的);据/pre>
ts = 1.0;tf = 200;据/pre>
RNG(0)据/pre>
创建DDPG代理据/h3>
statepath = [featureInputLayer(numobservations,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“状态”据/span>)全连接列(50,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticStateFC1”据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticRelu1”据/span>) fullyConnectedLayer (25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticStateFC2”据/span>));actionPath = [featureInputLayer(numActions,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“行动”据/span>) fullyConnectedLayer (25,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticActionFC1”据/span>));commonPath =[附加路径]据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'添加'据/span>)剥离(据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticCommonRelu”据/span>)全康连接层(1,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“CriticOutput”据/span>));criticNetwork = layerGraph ();criticNetwork = addLayers (criticNetwork statePath);criticNetwork = addLayers (criticNetwork actionPath);criticNetwork = addLayers (criticNetwork commonPath);criticNetwork = connectLayers (criticNetwork,据span style="color:#A020F0">“CriticStateFC2”据/span>那据span style="color:#A020F0">“添加/三机一体”据/span>);criticNetwork = connectLayers (criticNetwork,据span style="color:#A020F0">“CriticActionFC1”据/span>那据span style="color:#A020F0">“添加/ in2”据/span>);据/pre>
图绘制(criticNetwork)据/pre>
rlrepresentationOptions.据/code>.据/p>
criticOpts = rlRepresentationOptions (据span style="color:#A020F0">'学习'据/span>,1e-03,据span style="color:#A020F0">“GradientThreshold”据/span>1);据/pre>
rlqvalueerepresentation据/code>.据/p>
评论家= rlqvalueerepresentation(批评,undernfo,Actinfo,据span style="color:#A020F0">'观察'据/span>, {据span style="color:#A020F0">“状态”据/span>},据span style="color:#A020F0">“行动”据/span>, {据span style="color:#A020F0">“行动”据/span>}, criticOpts);据/pre>
RLDETerminyActorRepresentation据/code>.据/p>
actorNetwork = [featureInputLayer(numobobservations,据span style="color:#A020F0">“归一化”据/span>那据span style="color:#A020F0">'没有任何'据/span>那据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“状态”据/span>)全康连接层(3,据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'Actorfc'据/span>) tanhLayer (据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">'actortanh'据/span>) fullyConnectedLayer (numActions据span style="color:#A020F0">'姓名'据/span>那据span style="color:#A020F0">“行动”据/span>));actorOptions = rlRepresentationOptions (据span style="color:#A020F0">'学习'据/span>,1E-04,据span style="color:#A020F0">“GradientThreshold”据/span>1);演员= rlDeterministicActorRepresentation (actorNetwork obsInfo actInfo,据span style="color:#A020F0">'观察'据/span>, {据span style="color:#A020F0">“状态”据/span>},据span style="color:#A020F0">“行动”据/span>, {据span style="color:#A020F0">“行动”据/span>}, actorOptions);据/pre>
rlddpgagentoptions.据/code>.据/p>
agentOpts = rlDDPGAgentOptions (据span style="color:#0000FF">......据/span>“SampleTime”据/span>Ts,据span style="color:#0000FF">......据/span>“TargetSmoothFactor”据/span>,1e-3,据span style="color:#0000FF">......据/span>“DiscountFactor”据/span>, 1.0,据span style="color:#0000FF">......据/span>'minibatchsize'据/span>, 64,据span style="color:#0000FF">......据/span>'经验BufferLength'据/span>,1E6);agentopts.noiseOptions.variance = 0.3;代理.NoiseOptions.varecedecayrate = 1E-5;据/pre>
rlddpgagent.据/code>.据/p>
代理= rlddpgagent(演员,批评者,代理商);据/pre>
火车代理据/h3>
rlTrainingOptions据/code>.据/p>
maxepisodes = 5000;maxsteps = ceil(tf / ts);训练= rltrainingOptions(据span style="color:#0000FF">......据/span>“MaxEpisodes”据/span>,maxepisodes,据span style="color:#0000FF">......据/span>“MaxStepsPerEpisode”据/span>maxsteps,据span style="color:#0000FF">......据/span>'scoreaveragingwindowlength'据/span>20,据span style="color:#0000FF">......据/span>“详细”据/span>假的,据span style="color:#0000FF">......据/span>'plots'据/span>那据span style="color:#A020F0">'培训 - 进步'据/span>那据span style="color:#0000FF">......据/span>'stoptrinaincriteria'据/span>那据span style="color:#A020F0">“AverageReward”据/span>那据span style="color:#0000FF">......据/span>“StopTrainingValue”据/span>, 800);据/pre>
火车据/code>功能。培训是一个计算密集的过程,需要几分钟才能完成。要在运行此示例的同时节省时间,请通过设置加载预制代理据code class="literal">用圆形据/code>到据code class="literal">错误的据/code>.自己训练代理人,设置据code class="literal">用圆形据/code>到据code class="literal">真正的据/code>.据/p>
doTraining = false;据span style="color:#0000FF">如果据/span>用圆形据span style="color:#228B22">%训练代理人。据/span>trainingStats =火车(代理,env, trainOpts);据span style="color:#0000FF">其他的据/span>%加载示例的预训练代理。据/span>加载(据span style="color:#A020F0">“WaterTankDDPG.mat”据/span>那据span style="color:#A020F0">'代理人'据/span>)据span style="color:#0000FF">结尾据/span>
验证培训代理据/h3>
Simopts = RlsimulationOptions(据span style="color:#A020F0">“MaxSteps”据/span>maxsteps,据span style="color:#A020F0">'stoponerror'据/span>那据span style="color:#A020F0">'在'据/span>);体验= SIM(ENV,Agent,Simopts);据/pre>
本地函数据/h3>
函数据/span>在= localresetfcn(in)据span style="color:#228B22">%随机化参考信号据/span>黑色= sprintf (据span style="color:#A020F0">“rlwatertank /期望\ nWater级别”据/span>);H = 3*randn + 10;据span style="color:#0000FF">而据/span>H <= 0 ||h> = 20 h = 3 * randn + 10;据span style="color:#0000FF">结尾据/span>在= setBlockParameter(IN,BLK,据span style="color:#A020F0">'价值'据/span>,num2str(h));据span style="color:#228B22">%随机化初始高度据/span>H = 3*randn + 10;据span style="color:#0000FF">而据/span>H <= 0 ||h> = 20 h = 3 * randn + 10;据span style="color:#0000FF">结尾据/span>BLK =据span style="color:#A020F0">'rlwatertank /水箱系统/ h'据/span>;在= setBlockParameter(IN,BLK,据span style="color:#A020F0">“InitialCondition”据/span>,num2str(h));据span style="color:#0000FF">结尾据/span>