Main Content

培训DDPG用于自适应巡航控制的代理

This example shows how to train a deep deterministic policy gradient (DDPG) agent for adaptive cruise control (ACC) in Simulink®. For more information on DDPG agents, see深度确定性政策梯度代理.

金宝appSimulink模型

此示例的增强学习环境是自助式汽车和铅轿车的简单纵向动态。培训目标是使自助式汽车以设定的速度行驶,同时通过控制纵向加速和制动来维持沿线轿厢的安全距离。此示例使用相同的车型模型Adaptive Cruise Control System Using Model Predictive Control(Model Predictive Control Toolbox)example.

Specify the initial position and velocity for the two vehicles.

x0_lead = 50;铅轿厢(M)的%初始位置v0_lead = 25;铅轿车的初始速度(M / s)x0_ego = 10;% initial position for ego car (m)v0_ego = 20;% initial velocity for ego car (m/s)

指定静止默认间隔(m),时间间隙和驱动器设置速度(M / s)。

D_default = 10; t_gap = 1.4; v_set = 30;

To simulate the physical limitations of the vehicle dynamics, constraint the acceleration to the range[-3,2]m / s ^ 2。

amin_ego = -3; amax_ego = 2;

定义采样时间TS.和simulation durationTF.in seconds.

TS.= 0.1; Tf = 60;

打开模型。

mdl ='rlACCMdl';Open_System(MDL)AppletBlk = [MDL'/RL Agent'];

对于此模型:

  • The acceleration action signal from the agent to the environment is from –3 to 2 m/s^2.

  • 自助式汽车的参考速度 V r e f 定义如下。如果相对距离less than the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity. In this manner, the ego car maintains some distance from the lead car. If the relative distance is greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe distance is defined as a linear function of the ego car longitudinal velocity V ;that is, t g a p * V + D d e f a u l t . The safe distance determines the reference tracking velocity for the ego car.

  • 环境的观察是速度误差 e = V r e f - V e g o , its integral e ,以及自助式汽车纵向速度 V .

  • 当自助车的纵向速度小于0的纵向速度或引线轿厢和自助轿车之间的相对距离变得小于0时,模拟终止。

  • The reward r t ,每次都提供 t , is

r t = - ( 0 . 1 e t 2 + u t - 1 2 ) + M t

哪里 u t - 1 是从前一步的控制输入。逻辑价值 M t = 1 如果速度错误 e t 2 < = 0 . 2 5 ;除此以外, M t = 0 .

创建环境界面

创建a reinforcement learning environment interface for the model.

创建the observation specification.

observationInfo = rlNumericSpec([3 1],'lowerimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1)); observationInfo.Name ='观察';observationInfo.Description =“关于速度误差和自我速度的信息”;

创建the action specification.

actionInfo = rlNumericSpec([1 1],'lowerimit',-3,'UpperLimit',2); actionInfo.Name =“加速”;

创建the environment interface.

env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

To define the initial condition for the position of the lead car, specify an environment reset function using an anonymous function handle. The reset functionlocalResetFcn,在示例的末尾定义,随机化引线轿厢的初始位置。

env.ResetFcn = @(in)localResetFcn(in);

Fix the random generator seed for reproducibility.

rng('default')

创建DDPG agent

A DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. To create the critic, first create a deep neural network with two inputs, the state and action, and one output. For more information on creating a neural network value function representation, see创建策略和值函数表示.

L = 48;%神经元数量statepath = [featuredupputlayer(3,'正常化','none','Name','观察')全连接层(L,'Name','fc1') reluLayer('Name','relu1')全连接层(L,'Name','fc2') additionLayer(2,'Name','加') reluLayer('Name','relu2')全连接层(L,'Name','fc3') reluLayer('Name','relu3')全康连接层(1,'Name','fc4')]; actionPath = [ featureInputLayer(1,'正常化','none','Name','action')全连接层(L,'Name','fc5')]; criticNetwork = layerGraph(statePath); criticNetwork = addLayers(criticNetwork, actionPath); criticNetwork = connectLayers(criticNetwork,'fc5','添加/ in2');

View the critic network configuration.

情节(criticNetwork)

Specify options for the critic representation usingrlRepresentationOptions.

criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1,'l2regularizationfactor',1E-4);

创建the critic representation using the specified neural network and options. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, seerlqvalueerepresentation.

critic = rlQValueRepresentation(criticNetwork,observationInfo,actionInfo,......'观察',{'观察'},'Action',{'action'},criticOptions);

A DDPG agent decides which action to take given observations by using an actor representation. To create the actor, first create a deep neural network with one input, the observation, and one output, the action.

Construct the actor similarly to the critic. For more information, seeRLDETerminyActorRepresentation.

actorNetwork = [ featureInputLayer(3,'正常化','none','Name','观察')全连接层(L,'Name','fc1') reluLayer('Name','relu1')全连接层(L,'Name','fc2') reluLayer('Name','relu2')全连接层(L,'Name','fc3') reluLayer('Name','relu3')全康连接层(1,'Name','fc4') tanhLayer('Name','tanh1') scalingLayer('Name','ActorScaling1','Scale',2.5,'偏压',-0.5)];ACTOROPTIONS = RLREPRESENTATIONOPTIONS('LearnRate',1e-4,'GradientThreshold',1,'l2regularizationfactor',1E-4);actor = rldeterminyactorrepresentation(Actornetwork,观察ini,actioninfo,......'观察',{'观察'},'Action',{'ActorScaling1'},actorOptions);

To create the DDPG agent, first specify the DDPG agent options usingrlddpgagent.Options.

agentOptions = rlDDPGAgentOptions(......'采样时间',ts,......'TargetSmoothFactor',1e-3,......'ExperienceBufferLength',1e6,......'DiscountFactor',0.99,......'minibatchsize',64); agentOptions.NoiseOptions.Variance = 0.6; agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

Then, create the DDPG agent using the specified actor representation, critic representation, and agent options. For more information, seerlddpgagent..

agent = rlDDPGAgent(actor,critic,agentOptions);

Train Agent

要培训代理,首先指定培训选项。对于此示例,请使用以下选项:

  • Run each training episode for at most5000episodes, with each episode lasting at most 600 time steps.

  • 在Episode Manager对话框中显示培训进度。

  • Stop training when the agent receives an episode reward greater than 260.

有关更多信息,请参阅rlTrainingOptions.

maxepisodes = 5000; maxsteps = ceil(Tf/Ts); trainingOpts = rlTrainingOptions(......'MaxEpisodes',maxepisodes,......'MaxStepsPerEpisode',maxsteps,......'verbose',false,......“阴谋”,'培训 - 进步',......'StopTrainingCriteria','EpisodeReward',......'StopTrainingValue',260);

Train the agent using thetrain功能。培训是一个计算密集的过程,需要几分钟才能完成。要在运行此示例的同时节省时间,请通过设置加载预制代理doTrainingfalse. To train the agent yourself, setdoTrainingtrue.

dotraining = false;ifdoTraining%训练代理人。Trainstats =火车(代理,ENV,Trainpepopts);else% Load a pretrained agent for the example.加载('金宝appsimulinkaccddpg.mat','代理')end

Simulate DDPG Agent

要验证培训的代理的性能,请通过取消注释以下命令来模拟Simulink环境中的代理。金宝app有关代理模拟的更多信息,请参阅rlSimulationOptionssim.

% simOptions = rlSimulationOptions('MaxSteps',maxsteps);% experience = sim(env,agent,simOptions);

To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink.

x0_lead = 80;SIM(MDL)

以下情节显示了当铅轿车在自助轿厢之前70(m)时显示仿真结果。

  • In the first 28 seconds, the relative distance is greater than the safe distance (bottom plot), so the ego car tracks set velocity (middle plot). To speed up and reach the set velocity, acceleration is positive (top plot).

  • 从28到60秒,相对距离小于安全距离(底部图),因此自我车辆追踪引线速度和设定速度的最小值。从28到36秒,引线速度小于设定的速度(中间图)。为了减速并跟踪铅轿厢速度,加速是负(顶部图)。从36到60秒,自助式汽车调整其加速以跟踪参考速度(中间图)。在此时间间隔内,自助式汽车将设定的速度从43到52秒追踪,并将引线速度从36到43秒跟踪到52到60秒。

关闭Simulink金宝app模型。

bdclose(mdl)

重置功能

function在= localresetfcn(in)%重置引线轿厢的初始位置。在= setVariable(在,'x0_lead',40+randi(60,1,1));end

See Also

相关话题