强化学习。突然非常高的回报在RL模型的训练。

4视图(30天)
先生在培训期间我得到突然的高回报的订单10 e16天(附加图片所示),我无法找出是什么原因导致了晒伤的。这是我用的代码和我也将仿真软件模型。金宝app
Tf = 10;
t = 0.1;
mdl = ' rl_exam2 '
obsInfo = rlNumericSpec (1 [3]);
obsInfo。Name =“观察”;
obsInfo。描述=“综合误差,误差、响应”;
numObservations = obsInfo.Dimension (1)
actInfo = rlNumericSpec ([1],“LowerLimit”, 0, ' UpperLimit ', 1);
actInfo。Name = '控制输入;
numActions = actInfo.Dimension (1);
% %创造环境
env = rl金宝appSimulinkEnv (mdl mdl / RL代理的,obsInfo, actInfo);
% %
rng (0)
% %
% %创建网络评论家
statePath = [
imageInputLayer ([numObservations 1 1],“正常化”,“没有”,“名字”,“状态”)
fullyConnectedLayer(50,“名字”,“CriticStateFC1”)
reluLayer('名称',' CriticRelu1 ')
fullyConnectedLayer(40 '名称',' CriticStateFC2 '));
actionPath = [
imageInputLayer ([numActions 1 1],“正常化”,“没有”,“名字”,“行动”)
fullyConnectedLayer(40 '名称',' CriticActionFC1 '));
commonPath = [
additionLayer(2, '名称','添加')
reluLayer('名称',' CriticCommonRelu ')
fullyConnectedLayer(“名字”,“CriticOutput”)];
criticNetwork = layerGraph ();
criticNetwork = addLayers (criticNetwork statePath);
criticNetwork = addLayers (criticNetwork actionPath);
criticNetwork = addLayers (criticNetwork commonPath);
criticNetwork = connectLayers (criticNetwork“CriticStateFC2”,“添加/三机一体”);
criticNetwork = connectLayers (criticNetwork“CriticActionFC1”,“添加/ in2”);
criticOpts = rlRepresentationOptions (LearnRate, 1 e 03, ' GradientThreshold ', 1);
评论家= rlQValueRepresentation (criticNetwork obsInfo actInfo,“观察”,{'国家'},‘行动’,{'行动'},criticOpts);
actorNetwork = [
imageInputLayer ([numObservations 1 1],“正常化”,“没有”,“名字”,“状态”)
fullyConnectedLayer(40岁的“名称”,“actorFC1”)
reluLayer('名称',' ActorRelu1 ')
fullyConnectedLayer (numActions '名称',' actorFC2 ')
tanhLayer('名称',' actorTanh ')
scalingLayer(“名字”、“行动”,“规模”,0.5,“偏见”,0.5)
];
actorOptions = rlRepresentationOptions (LearnRate, 1 e-04 ' GradientThreshold ', 1);
演员= rlDeterministicActorRepresentation (actorNetwork obsInfo actInfo,“观察”,{'国家'},‘行动’,{'行动'},actorOptions);
% %创建代理
agentOpts = rlDDPGAgentOptions (…
“SampleTime”, 0.1,…
“TargetSmoothFactor”, 1 e - 3,…
“DiscountFactor”, 1,…
“ExperienceBufferLength”, 1 e6,…
“MiniBatchSize”, 64年,…
“ExperienceBufferLength”, 1 e6);
agentOpts.NoiseOptions。方差= 0.08;
agentOpts.NoiseOptions。VarianceDecayRate = 1 e-5;
代理= rlDDPGAgent(演员、评论家、agentOpts)
% %培训选项
maxepisodes = 3000;
maxsteps =装天花板(Tf / Ts);
trainingOpts = rlTrainingOptions (…
MaxEpisodes, MaxEpisodes,……
MaxStepsPerEpisode, maxsteps,……
“ScoreAveragingWindowLength”, 20日……
“详细”,假的,…
“阴谋”、“训练进步”,…
“StopTrainingCriteria”、“EpisodeCount’,……
“StopTrainingValue”, 1500);
% %训练
doTraining = true;
如果doTraining
trainingStats =火车(代理,env, trainingOpts);
%保存(agent_new.mat, agent_ready) # # # % % %保存代理
其他的
%负载pretrained代理的例子。
加载(“agent_old.mat”、“代理”)
结束

接受的答案

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2023年5月25日
你应该首先检查错误的信号,你喂的奖励那些情节。可能是误差太大/系统变得不稳定,导致那些大的负值吗
3评论
Sourabh
Sourabh 2023年5月28日
我试过一些奖励功能但大多数我得到我的回答解决0.3我不知道为什么
请看看

登录置评。

更多的答案(0)

下载188bet金宝搏


释放

R2023a

社区寻宝

找到宝藏在MATLAB中央,发现社区如何帮助你!

开始狩猎!