强化学习算法不断使用边界值,不学习

10视图(30天)
你好
我用强化学习工具箱训练控制车辆悬架系统的一种算法。为此,我使用仿真软件模型的环境,和一个TD3代金宝app理。我设置看起来非常类似于从强化学习工具箱中的例子,我使用4观察和1行动,这是一个理想的链接角用于控制悬架。我复制的代码使用底部的问题。
我面临的问题是,代理一直只使用动作较低的饱和和上界。下面的范围显示了我是什么意思,连接角(转换为度)一般应振荡90度左右(平衡)。算法首先使用一个接近90度角然后开始跳20度之间(下限)和160度(上限),即使回报函数结构,一旦算法跳跃边界值损失远远高于如果甚至保持静态90度角(奖励功能严重惩罚垂直位移的平方,这是相当高的,当使用边界值等20度范围内)。
这种行为与输出不断得到饱和导致算法在整个运行没有任何学习。任何帮助将不胜感激,我不确定是什么导致这是非常相似的结构中使用的例子,他们似乎非常好工作。无论我怎么改变算法,它似乎总是饱和的行动在边界迟早,让很少的进展。
代码:
%运行模型和参数文件
运行(“AVGS_Equivalent_Modelling”)
运行(“parameters_series.m”)
%链接与MATLAB的金宝appsimulink仿真模型文件
模型=“NEW_NN_v2”;
open_system(模型);
% %定义观察、操作和环境
t = 0.05;
Tf = 60;
numObs = 4;
obsInfo = rlNumericSpec ([numObs 1]);
obsInfo。Name =“观察”;
numAct = 1;
actInfo = rlNumericSpec (numAct [1],“LowerLimit”,(20 *(π/ 180)),“UpperLimit”(160 *(π/ 180)));
actInfo。Name =“期望的连接角”;
黑色=[模型,' / RL代理'];
env = rl金宝appSimulinkEnv(模型、黑色obsInfo actInfo);
% %生成RL代理
代理= TD3create (numObs obsInfo、numAct actInfo, Ts);
maxEpisodes = 500;
maxSteps =地板(Tf / Ts);
trainOpts = rlTrainingOptions (
“MaxEpisodes”maxEpisodes,
“MaxStepsPerEpisode”maxSteps,
“ScoreAveragingWindowLength”,50岁,
“详细”假的,
“阴谋”,“训练进步”,
“StopTrainingCriteria”,“EpisodeCount”,
“StopTrainingValue”maxEpisodes,
“SaveAgentCriteria”,“EpisodeCount”,
“SaveAgentValue”,maxEpisodes);
trainOpts。UseParallel = false;
trainOpts.ParallelizationOptions。模式=“异步”;
trainOpts.ParallelizationOptions。StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions。DataToSendFromWorkers =“经验”;
trainingStats =火车(代理,env, trainOpts);
% %辅助函数
函数代理= TD3create (numObs obsInfo、numAct actInfo, Ts)
%步行机器人——TD3代理设置脚本
% 2020年版权MathWorks公司。
% %创建演员和评论家网络使用createNetworks helper函数
[criticNetwork1, criticNetwork2 actorNetwork] = createNetworks (numObs numAct);% 2评论家网络的使用
% %指定选项使用rlRepresentationOptions评论家和演员表示
criticOptions = rlRepresentationOptions (“优化”,“亚当”,“LearnRate”1 e 1,
“GradientThreshold”,1“L2RegularizationFactor”2的军医);
actorOptions = rlRepresentationOptions (“优化”,“亚当”,“LearnRate”1 e 1,
“GradientThreshold”,1“L2RegularizationFactor”1 e-5);
% %使用指定的网络和创建评论家和演员表示
%的选项
摘要= rlQValueRepresentation (criticNetwork1 obsInfo actInfo,“观察”,{“观察”},“行动”,{“行动”},criticOptions);
critic2 = rlQValueRepresentation (criticNetwork2 obsInfo actInfo,“观察”,{“观察”},“行动”,{“行动”},criticOptions);
演员= rlDeterministicActorRepresentation (actorNetwork obsInfo actInfo,“观察”,{“观察”},“行动”,{“ActorScaling”},actorOptions);
% %指定TD3代理选项
agentOptions = rlTD3AgentOptions;
agentOptions。SampleTime = t;
agentOptions。DiscountFactor = 0.99;
agentOptions。MiniBatchSize = 64;
agentOptions。ExperienceBufferLength = 1 e6;
agentOptions。TargetSmoothFactor = 5 e - 3;
agentOptions.TargetPolicySmoothModel。方差= 0.2;%的目标政策噪音
agentOptions.TargetPolicySmoothModel。LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel。UpperLimit = 0.5;
agentOptions。ExplorationModel = rl.option.OrnsteinUhlenbeckActionNoise;%设置或者噪声探测噪声(默认是高斯rlTD3AgentOptions)
agentOptions.ExplorationModel。MeanAttractionConstant = 1;
agentOptions.ExplorationModel。方差= 0.1;
% %使用指定创建代理演员表示,评论家表示和代理的选择
代理= rlTD3Agent(演员,[摘要,critic2], agentOptions);
结束
函数[criticNetwork1, criticNetwork2 actorNetwork] = createNetworks (numObs numAct)
% %批评
%创建评论家网络层
criticLayerSizes = (400 - 300);
% %第一个批评网络
statePath1 = [
featureInputLayer (numObs“归一化”,“没有”,“名字”,“观察”)
fullyConnectedLayer (criticLayerSizes (1),“名字”,“CriticStateFC1”,
“重量”2 /√(numObs) *(兰德(criticLayerSizes (1) numObs) -0.5),
“偏见”2 /√(numObs) *(兰德(criticLayerSizes (1), 1) -0.5))
reluLayer (“名字”,“CriticStateRelu1”)
fullyConnectedLayer (criticLayerSizes (2),“名字”,“CriticStateFC2”,
“重量”2 /√(criticLayerSizes(1)) *(兰德(criticLayerSizes (2), criticLayerSizes (1)) -0.5),
“偏见”2 /√(criticLayerSizes(1)) *(兰德(criticLayerSizes (2), 1) -0.5))
];
actionPath1 = [
featureInputLayer (numAct“归一化”,“没有”,“名字”,“行动”)
fullyConnectedLayer (criticLayerSizes (2),“名字”,“CriticActionFC1”,
“重量”2 /√(numAct) *(兰德(criticLayerSizes (2), numAct) -0.5),
“偏见”2 /√(numAct) *(兰德(criticLayerSizes (2), 1) -0.5))
];
commonPath1 = [
additionLayer (2“名字”,“添加”)
reluLayer (“名字”,“CriticCommonRelu1”)
fullyConnectedLayer (1,“名字”,“CriticOutput”,
“重量”2 * 5 e - 3 *(兰德(1,criticLayerSizes (2)) -0.5),
“偏见”2 * 5 e - 3 *(-0.5兰特(1,1)))
];
%连接层图
criticNetwork1 = layerGraph (statePath1);
criticNetwork1 = addLayers (criticNetwork1 actionPath1);
criticNetwork1 = addLayers (criticNetwork1 commonPath1);
criticNetwork1 = connectLayers (criticNetwork1,“CriticStateFC2”,“添加/三机一体”);
criticNetwork1 = connectLayers (criticNetwork1,“CriticActionFC1”,“添加/ in2”);
% %第二评论家网络
statePath2 = [
featureInputLayer (numObs“归一化”,“没有”,“名字”,“观察”)
fullyConnectedLayer (criticLayerSizes (1),“名字”,“CriticStateFC1”,
“重量”2 /√(numObs) *(兰德(criticLayerSizes (1) numObs) -0.5),
“偏见”2 /√(numObs) *(兰德(criticLayerSizes (1), 1) -0.5))
reluLayer (“名字”,“CriticStateRelu1”)
fullyConnectedLayer (criticLayerSizes (2),“名字”,“CriticStateFC2”,
“重量”2 /√(criticLayerSizes(1)) *(兰德(criticLayerSizes (2), criticLayerSizes (1)) -0.5),
“偏见”2 /√(criticLayerSizes(1)) *(兰德(criticLayerSizes (2), 1) -0.5))
];
actionPath2 = [
featureInputLayer (numAct“归一化”,“没有”,“名字”,“行动”)
fullyConnectedLayer (criticLayerSizes (2),“名字”,“CriticActionFC1”,
“重量”2 /√(numAct) *(兰德(criticLayerSizes (2), numAct) -0.5),
“偏见”2 /√(numAct) *(兰德(criticLayerSizes (2), 1) -0.5))
];
commonPath2 = [
additionLayer (2“名字”,“添加”)
reluLayer (“名字”,“CriticCommonRelu1”)
fullyConnectedLayer (1,“名字”,“CriticOutput”,
“重量”2 * 5 e - 3 *(兰德(1,criticLayerSizes (2)) -0.5),
“偏见”2 * 5 e - 3 *(-0.5兰特(1,1)))
];
%连接层图
criticNetwork2 = layerGraph (statePath2);
criticNetwork2 = addLayers (criticNetwork2 actionPath2);
criticNetwork2 = addLayers (criticNetwork2 commonPath2);
criticNetwork2 = connectLayers (criticNetwork2,“CriticStateFC2”,“添加/三机一体”);
criticNetwork2 = connectLayers (criticNetwork2,“CriticActionFC1”,“添加/ in2”);
% %的演员
%创建演员网络层
actorLayerSizes = (400 - 300);
actorNetwork = [
featureInputLayer (numObs“归一化”,“没有”,“名字”,“观察”)
fullyConnectedLayer (actorLayerSizes (1),“名字”,“ActorFC1”,
“重量”2 /√(numObs) *(兰德(actorLayerSizes (1) numObs) -0.5),
“偏见”2 /√(numObs) *(兰德(actorLayerSizes (1), 1) -0.5))
reluLayer (“名字”,“ActorRelu1”)
fullyConnectedLayer (actorLayerSizes (2),“名字”,“ActorFC2”,
“重量”2 /√(actorLayerSizes(1)) *(兰德(actorLayerSizes (2), actorLayerSizes (1)) -0.5),
“偏见”2 /√(actorLayerSizes(1)) *(兰德(actorLayerSizes (2), 1) -0.5))
reluLayer (“名字”,“ActorRelu2”)
fullyConnectedLayer (numAct“名字”,“ActorFC3”,
“重量”2 * 5 e - 3 *(兰德(numAct, actorLayerSizes (2)) -0.5),
“偏见”2 * 5 e - 3 *(-0.5兰特(numAct, 1)))
tanhLayer (“名字”,“ActorTanh1”)
scalingLayer (“名字”,“ActorScaling”,“规模”(7 *π)/ 18“偏见”π/ 2)
];
结束
1评论
Mirjan Heubaum
Mirjan Heubaum 2021年11月19日
你找到解决的办法了吗?tanh或扩展层能引起这样的问题吗?

登录置评。

答案(0)

社区寻宝

找到宝藏在MATLAB中央,发现社区如何帮助你!

开始狩猎!