从水箱系统的模型验证块生成奖励函数

这个示例使用:

这个例子展示了如何从Simulink®Design Optimization™模型验证块中定义的性能需求自动生成奖励函数。金宝app然后使用生成的奖励函数来训练强化学习代理。

简介

您可以使用generateRewardFunction从Simulink设计优化模型验证块中指定的性能约束开始，生成强化学习的奖励函数。金宝app由此产生的奖励信号是根据环境的当前状态对违反约束的加权惩罚的总和。

在本例中，您将转换在中定义的成本和约束规范检查步长响应特性块为一个水箱系统变成了一个奖励功能。然后你使用奖励函数并使用它去训练一个代理去控制水箱。

为这个示例指定参数。

% Watertank参数= 2;b = 5;= 20;初始高度和最终高度h0 = 1;高频= 2;%模拟和采样时间Tf = 10;t = 0.1;

本例的原始模型是watertank仿金宝app真软件模型(金宝app仿真软件控制设计)．

打开模型。

open_system (“rlWatertankStepInput”）

本例中的模型已被修改为强化学习。目标是使用强化学习剂控制水箱中的水位，同时满足定义在检查步长响应特性块。打开该块以查看所需的步骤响应规范。

黑色=“rlWatertankStepInput / WaterLevelStepResponse”；open_system(黑色)

生成奖励函数

中的规范生成奖励函数代码WaterLevelStepResponse块使用generateRewardFunction．代码显示在MATLAB编辑器中。

generateRewardFunction(黑色)

生成的奖励函数是奖励设计的起点。可以通过选择不同的惩罚函数和调整惩罚权重来修改函数。对于本例，对生成的代码进行以下更改:

默认惩罚权值为1。设置重量为10。
默认的外部惩罚函数方法是步进。将方法更改为二次．

更改后，重量和惩罚规格应如下:

重量= 10;点球=总和(exteriorPenalty (x, Block1_xmin Block1_xmax,“二次”));

对于本例，修改后的代码已保存在MATLAB函数文件中rewardFunctionVfb.m．显示生成的奖励函数。

类型rewardFunctionVfb.m

REWARDFUNCTION从Simulink块规格中生成奖励。金宝app%% x: Input of watertank_stepinput_rl/WaterLevelStepResponse % t: Simulation time (s) % Reinforcement Learning Toolbox % 26-Apr-2021 13:05:16 %#codegen %% % Specifications from watertank_stepinput_rl/WaterLevelStepResponse Block1_InitialValue = 1;Block1_FinalValue = 2;Block1_StepTime = 0;Block1_StepRange = Block1_FinalValue - Block1_InitialValue;Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;block1_maxsettlement = Block1_InitialValue + Block1_StepRange * (1+2/100);block1_minsettlement = Block1_InitialValue + Block1_StepRange * (1-2/100);Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100; if t >= Block1_StepTime if Block1_InitialValue <= Block1_FinalValue Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)]; Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling]; Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)]; Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling]; else Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)]; Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling]; Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)]; Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling]; end Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1)); for idx = 1:numel(Block1_xmax) tseg = Block1_UpperBoundTimes(idx,:); xseg = Block1_UpperBoundAmplitudes(idx,:); Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN); end if all(isnan(Block1_xmax)) Block1_xmax = Inf; else Block1_xmax = max(Block1_xmax,[],'omitnan'); end Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1)); for idx = 1:numel(Block1_xmin) tseg = Block1_LowerBoundTimes(idx,:); xseg = Block1_LowerBoundAmplitudes(idx,:); Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN); end if all(isnan(Block1_xmin)) Block1_xmin = -Inf; else Block1_xmin = max(Block1_xmin,[],'omitnan'); end else Block1_xmin = -Inf; Block1_xmax = Inf; end %% Penalty function weight (specify nonnegative) Weight = 10; %% Compute penalty % Penalty is computed for violation of linear bound constraints. % % To compute exterior bound penalty, use the exteriorPenalty function and % specify the penalty method as 'step' or 'quadratic'. % % Alternaltely, use the hyperbolicPenalty or barrierPenalty function for % computing hyperbolic and barrier penalties. % % For more information, see help for these functions. Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'quadratic')); %% Compute reward reward = -Weight * Penalty; end

为了将这个奖励函数集成到水箱模型中，打开奖励子系统下的MATLAB函数块。

open_system (“rlWatertankStepInput /奖励/奖励函数”）

用以下代码行添加函数并保存模型。

r = rewardFunctionVfb (x, t);

MATLAB函数块现在将执行rewardFunctionVfb.m计算奖励。

对于这个例子，MATLAB函数块已经被修改和保存了。

创造一个强化的学习环境

在水箱子系统中建立了环境动力学模型。对于这个环境,

观测值为参考高度裁判从最后5个时间步，高度误差为犯错＝裁判-H．
作用就是电压V适用于泵。
样品时间Ts是0．1年代。

为环境建立观察和行动规范。

numObs = 6;numAct = 1;oinfo = rlNumericSpec([numObs 1]);ainfo = rlNumericSpec([numAct 1]);

创建强化学习环境使用rl金宝appSimulinkEnv函数。

env = rl金宝appSimulinkEnv (“rlWatertankStepInput”，“rlWatertankStepInput / RL代理”、oinfo ainfo);

创建一个强化学习代理

固定随机种子的再现性。

rng (100)

本例中的代理是双延迟深度确定性策略梯度(TD3)代理。

创建两个批评家表示。

%的批评家cnet = [featureInputLayer(numObs，“归一化”，“没有”，“名字”，“状态”) fullyConnectedLayer (128,“名字”，“fc1”) concatenationLayer(1、2、“名字”，“concat”) reluLayer (“名字”，“relu1”) fullyConnectedLayer (128,“名字”，“一个fc3”文件) reluLayer (“名字”，“relu2”) fullyConnectedLayer (1,“名字”，“CriticOutput”));actionPath = [featureInputLayer(numAct，“归一化”，“没有”，“名字”，“行动”) fullyConnectedLayer (8,“名字”，“取得”));criticNetwork = layerGraph (cnet);关键网络= addLayers(关键网络，actionPath);criticNetwork = connectLayers (criticNetwork,“取得”，“concat / in2”）;criticOptions = rlRepresentationOptions (“LearnRate”1 e - 3,“GradientThreshold”1);摘要= rlQValueRepresentation (criticNetwork oinfo ainfo,.．.“观察”，{“状态”}，“行动”，{“行动”}, criticOptions);critic2 = rlQValueRepresentation (criticNetwork oinfo ainfo,.．.“观察”，{“状态”}，“行动”，{“行动”}, criticOptions);

创建一个参与者表示。

actorNetwork = [featureInputLayer numObs,“归一化”，“没有”，“名字”，“状态”) fullyConnectedLayer (128,“名字”，“actorFC1”) reluLayer (“名字”，“relu1”) fullyConnectedLayer (128,“名字”，“actorFC2”) reluLayer (“名字”，“relu2”) fullyConnectedLayer (numAct“名字”，“行动”));actorOptions = rlRepresentationOptions (“LearnRate”1 e - 3,“GradientThreshold”1);演员= rlDeterministicActorRepresentation (actorNetwork oinfo ainfo,.．.“观察”，{“状态”}，“行动”，{“行动”}, actorOptions);

使用指定代理选项rlTD3AgentOptions．代理从最大容量1e6的经验缓冲区中通过随机选择大小的小批次进行训练256．的折现因子0.99有利于长期的回报。

agentOpts = rlTD3AgentOptions (“SampleTime”Ts,.．.“DiscountFactor”, 0.99,.．.“ExperienceBufferLength”1 e6,.．.“MiniBatchSize”, 256);

TD3 agent中的探测模型为高斯模型。噪声模型在训练过程中为动作添加一个均匀的随机值。设置噪声的标准差为0．5．标准差衰减的速率为1 e-5的最小值之前的每个代理步骤0．

agentOpts.ExplorationModel.StandardDeviation = 0.5;agentOpts.ExplorationModel.StandardDeviationDecayRate = 1 e-5;agentOpts.ExplorationModel.StandardDeviationMin = 0;

使用参与者和评论家表示创建TD3代理。有关TD3药物的更多信息，请参见rlTD3Agent．

代理= rlTD3Agent(演员,[摘要,critic2], agentOpts);

火车代理

要培训代理，首先指定使用的培训选项rlTrainingOptions．对于本例，使用以下选项:

跑步训练不超过100集，每集不超过100集装天花板(Tf / Ts)时间步长，其中总模拟时间特遣部队是10年代。
当代理收到的平均累积奖励大于时停止训练5在20.连续集。此时，代理可以跟踪参考高度。

trainOpts = rlTrainingOptions (.．.“MaxEpisodes”, 100,.．.“MaxStepsPerEpisode”装天花板(Tf / Ts),.．.“StopTrainingCriteria”，“AverageReward”，.．.“StopTrainingValue”5,.．.“ScoreAveragingWindowLength”, 20);

训练代理使用火车函数。训练这个代理是一个计算密集的过程，可能需要几分钟才能完成。要在运行此示例时节省时间，请通过设置加载预先训练过的代理doTraining来假．亲自训练探员，设doTraining来真正的．

doTraining = false;如果trainingStats = train(agent,env,trainOpts);其他的负载(“rlWatertankTD3Agent.mat”）结束

下图是训练进度的快照。由于训练过程中固有的随机性，你可以预期不同的结果。

验证闭环响应

对模型进行仿真，查看闭环阶跃响应。强化学习代理能够在满足阶跃响应约束的情况下跟踪参考高度。

sim卡(“rlWatertankStepInput”）;

关闭模式。

close_system (“rlWatertankStepInput”）