强化学习。突然非常高的回报在RL模型的训练。

4视图(30天)

显示旧的评论

Sourabh 2023年5月24日

0
链接

这个问题直接联系

//www.tatmou.com/matlabcentral/answers/1972494-reinforcement-learning-sudden-very-high-rewards-during-training-of-rl-model

评论道: Sourabh2023年5月28日

答:接受 Emmanouil Tzorakoleftherakis

先生在培训期间我得到突然的高回报的订单10 e16天(附加图片所示),我无法找出是什么原因导致了晒伤的。这是我用的代码和我也将仿真软件模型。金宝app

Tf = 10;

t = 0.1;

mdl = ' rl_exam2 '

obsInfo = rlNumericSpec (1 [3]);

obsInfo。Name =“观察”;

obsInfo。描述=“综合误差,误差、响应”;

numObservations = obsInfo.Dimension (1)

actInfo = rlNumericSpec ([1],“LowerLimit”, 0, ' UpperLimit ', 1);

actInfo。Name = '控制输入;

numActions = actInfo.Dimension (1);

% %创造环境

env = rl金宝appSimulinkEnv (mdl mdl / RL代理的,obsInfo, actInfo);

% %

rng (0)

% %

% %创建网络评论家

statePath = [

imageInputLayer ([numObservations 1 1],“正常化”,“没有”,“名字”,“状态”)

fullyConnectedLayer(50,“名字”,“CriticStateFC1”)

reluLayer('名称',' CriticRelu1 ')

fullyConnectedLayer(40 '名称',' CriticStateFC2 '));

actionPath = [

imageInputLayer ([numActions 1 1],“正常化”,“没有”,“名字”,“行动”)

fullyConnectedLayer(40 '名称',' CriticActionFC1 '));

commonPath = [

additionLayer(2, '名称','添加')

reluLayer('名称',' CriticCommonRelu ')

fullyConnectedLayer(“名字”,“CriticOutput”)];

criticNetwork = layerGraph ();

criticNetwork = addLayers (criticNetwork statePath);

criticNetwork = addLayers (criticNetwork actionPath);

criticNetwork = addLayers (criticNetwork commonPath);

criticNetwork = connectLayers (criticNetwork“CriticStateFC2”,“添加/三机一体”);

criticNetwork = connectLayers (criticNetwork“CriticActionFC1”,“添加/ in2”);

criticOpts = rlRepresentationOptions (LearnRate, 1 e 03, ' GradientThreshold ', 1);

评论家= rlQValueRepresentation (criticNetwork obsInfo actInfo,“观察”,{'国家'},‘行动’,{'行动'},criticOpts);

actorNetwork = [

imageInputLayer ([numObservations 1 1],“正常化”,“没有”,“名字”,“状态”)

fullyConnectedLayer(40岁的“名称”,“actorFC1”)

reluLayer('名称',' ActorRelu1 ')

fullyConnectedLayer (numActions '名称',' actorFC2 ')

tanhLayer('名称',' actorTanh ')

scalingLayer(“名字”、“行动”,“规模”,0.5,“偏见”,0.5)

];

actorOptions = rlRepresentationOptions (LearnRate, 1 e-04 ' GradientThreshold ', 1);

演员= rlDeterministicActorRepresentation (actorNetwork obsInfo actInfo,“观察”,{'国家'},‘行动’,{'行动'},actorOptions);

% %创建代理

agentOpts = rlDDPGAgentOptions (…

“SampleTime”, 0.1,…

“TargetSmoothFactor”, 1 e - 3,…

“DiscountFactor”, 1,…

“ExperienceBufferLength”, 1 e6,…

“MiniBatchSize”, 64年,…

“ExperienceBufferLength”, 1 e6);

agentOpts.NoiseOptions。方差= 0.08;

agentOpts.NoiseOptions。VarianceDecayRate = 1 e-5;

代理= rlDDPGAgent(演员、评论家、agentOpts)

% %培训选项

maxepisodes = 3000;

maxsteps =装天花板(Tf / Ts);

trainingOpts = rlTrainingOptions (…

MaxEpisodes, MaxEpisodes,……

MaxStepsPerEpisode, maxsteps,……

“ScoreAveragingWindowLength”, 20日……

“详细”,假的,…

“阴谋”、“训练进步”,…

“StopTrainingCriteria”、“EpisodeCount’,……

“StopTrainingValue”, 1500);

% %训练

doTraining = true;

如果doTraining

trainingStats =火车(代理,env, trainingOpts);

%保存(agent_new.mat, agent_ready) # # # % % %保存代理

其他的

%负载pretrained代理的例子。

加载(“agent_old.mat”、“代理”)

结束

0评论
显示1年长的评论藏1年长的评论

登录置评。

在回答这个问题。

接受的答案

Emmanouil Tzorakoleftherakis 2023年5月25日

0
链接

直接链接到这个答案

//www.tatmou.com/matlabcentral/answers/1972494-reinforcement-learning-sudden-very-high-rewards-during-training-of-rl-model answer_1244649

你应该首先检查错误的信号,你喂的奖励那些情节。可能是误差太大/系统变得不稳定,导致那些大的负值吗

3评论
显示2年长的评论隐藏2年长的评论

Sourabh 2023年5月28日

我试过一些奖励功能但大多数我得到我的回答解决0.3我不知道为什么

请看看

登录置评。

类别

人工智能,数据科学和统计数据深度学习工具箱应用程序自主和控制系统强化学习

找到更多的在强化学习在帮助中心和文件交换

下载188bet金宝搏

释放

R2023a

社区寻宝

找到宝藏在MATLAB中央,发现社区如何帮助你!

开始狩猎!