Main Content

Avoid Obstacles Using Reinforcement Learning for Mobile Robots

This example uses Deep Deterministic Policy Gradient (DDPG) based reinforcement learning to develop a strategy for a mobile robot to avoid obstacles. For a brief summary of the DDPG algorithm, seeDeep Deterministic Policy Gradient Agents(Reinforcement Learning Toolbox).

This example scenario trains a mobile robot to avoid obstacles given range sensor readings that detect obstacles in the map. The objective of the reinforcement learning algorithm is to learn what controls (linear and angular velocity), the robot should use to avoid colliding into obstacles. This example uses an occupancy map of a known environment to generate range sensor readings, detect obstacles, and check collisions the robot may make. The range sensor readings are the observations for the DDPG agent, and the linear and angular velocity controls are the action.

Load Map

Load a map matrix,simpleMap, that represents the environment for the robot.

loadexampleMapssimpleMaploadexampleHelperOfficeAreaMapoffice_area_mapmapMatrix = simpleMap; mapScale = 1;

Range Sensor Parameters

Next, set up arangeSensorobject which simulates a noisy range sensor. The range sensor readings are considered observations by the agent. Define the angular positions of the range readings, the max range, and the noise parameters.

scanAngles = [-3*pi/8 : pi/8 :3*pi/8]; maxRange = 12; lidarNoiseVariance = 0.1^2; lidarNoiseSeeds = randi(intmax,size(scanAngles));

Robot Parameters

The action of the agent is a two-dimensional vector a = [ v , ω ] where v ω are the linear and angular velocities of our robot. The DDPG agent uses normalized inputs for both the angular and linear velocities, meaning the actions of the agent are a scalar between -1 and 1, which is multiplied by themaxLinSpeedmaxAngSpeedparameters to get the actual control. Specify this maximum linear and angular velocity.

Also, specify the initial position of the robot as[x y theta].

% Max speed parametersmaxLinSpeed = 0.3; maxAngSpeed = 0.3;% Initial pose of the robotinitX = 17; initY = 15; initTheta = pi/2;

显示地图和机器人的位置

To visualize the actions of the robot, create a figure. Start by showing the occupancy map and plot the initial position of the robot.

fig = figure("Name","simpleMap"); set(fig,"Visible","on"); ax = axes(fig); show(binaryOccupancyMap(mapMatrix),"Parent",ax); holdonplotTransforms([initX, initY, 0], eul2quat([initTheta, 0, 0]),"MeshFilePath","groundvehicle.stl","View","2D"); light; holdoff

Figure simpleMap contains an axes. The axes with title Binary Occupancy Grid contains 5 objects of type patch, line, image.

Environment Interface

Create an environment model that takes the action, and gives the observation and reward signals. Specify the provided example model name, exampleHelperAvoidObstaclesMobileRobot, the simulation time parameters, and the agent block name.

mdl ="exampleHelperAvoidObstaclesMobileRobot"; Tfinal = 100; sampleTime = 0.1; agentBlk = mdl +"/Agent";

Open the model.

open_system(mdl)

The model contains theEnvironmentAgentblocks. TheAgentblock is not defined yet.

Inside theEnvironmentSubsystem block, you should see a model for simulating the robot and sensor data. The subsystem takes in the action, generates the observation signal based on the range sensor readings, and calculates the reward based on the distance from obstacles, and the effort of the action commands.

open_system(mdl +"/Environment")

Define observation parameters,obsInfo, using therlNumericSpecobject and giving the lower and upper limit for the range readings with enough elements for each angular position in the range sensor.

obsInfo = rlNumericSpec([numel(scanAngles) 1],..."LowerLimit",zeros(numel(scanAngles),1),..."UpperLimit",ones(numel(scanAngles),1)*maxRange); numObservations = obsInfo.Dimension(1);

Define action parameters,actInfo. The action is the control command vector, a = [ v , ω ] , normalized to [ - 1 , 1 ] .

numActions = 2; actInfo = rlNumericSpec([numActions 1],..."LowerLimit",-1,..."UpperLimit"1);

Build the environment interface object usingrlSimulinkEnv(Reinforcement Learning Toolbox). Specify the model, agent block name, observation parameters, and action parameters. Set the reset function for the simulation usingexampleHelperRLAvoidObstaclesResetFcn. This function restarts the simulation by placing the robot in a new random location to begin avoiding obstacles.

env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo); env.ResetFcn = @(in)exampleHelperRLAvoidObstaclesResetFcn(in,scanAngles,maxRange,mapMatrix); env.UseFastRestart =“关闭”;

For another example that sets up a Simulink® environment for training, seeCreate Simulink Environment and Train Agent(Reinforcement Learning Toolbox).

DDPG Agent

A DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. To create the critic, first create a deep neural network with two inputs, the observation and action, and one output. For more information on creating a deep neural network value function representation, seeCreate Policy and Value Function Representations(Reinforcement Learning Toolbox).

statePath = [featureInputLayer(numObservations,"Normalization","none","Name","State") fullyConnectedLayer(50,"Name","CriticStateFC1") reluLayer("Name","CriticRelu1") fullyConnectedLayer(25,"Name","CriticStateFC2")]; actionPath = [ featureInputLayer(numActions,"Normalization","none","Name","Action") fullyConnectedLayer(25,"Name","CriticActionFC1")]; commonPath = [ additionLayer(2,"Name","add") reluLayer("Name","CriticCommonRelu") fullyConnectedLayer(1,"Name","CriticOutput")]; criticNetwork = layerGraph(); criticNetwork = addLayers(criticNetwork,statePath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = addLayers(criticNetwork,commonPath); criticNetwork = connectLayers(criticNetwork,"CriticStateFC2",“添加/三机一体”); criticNetwork = connectLayers(criticNetwork,"CriticActionFC1","add/in2");

Next, specify options for the critic representation usingrlRepresentationOptions(Reinforcement Learning Toolbox).

Finally, create the critic representation using the specified deep neural network and options. You must also specify the action and observation specifications for the critic, which you obtain from the environment interface. For more information, seerlQValueRepresentation(Reinforcement Learning Toolbox).

criticOpts = rlRepresentationOptions("LearnRate",1e-3,"L2RegularizationFactor",1e-4,"GradientThreshold"1);critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,"Observation",{'State'},"Action",{'Action'},criticOpts);

A DDPG agent decides which action to take given observations using an actor representation. To create the actor, first create a deep neural network with one input, the observation, and one output, the action.

Finally, construct the actor in a similar manner as the critic. For more information, seerlDeterministicActorRepresentation(Reinforcement Learning Toolbox).

actorNetwork = [ featureInputLayer(numObservations,"Normalization","none","Name","State") fullyConnectedLayer(50,"Name","actorFC1") reluLayer("Name","actorReLU1") fullyConnectedLayer(50,"Name","actorFC2") reluLayer("Name","actorReLU2") fullyConnectedLayer(2,"Name","actorFC3") tanhLayer("Name","Action")]; actorOptions = rlRepresentationOptions("LearnRate",1e-4,"L2RegularizationFactor",1e-4,"GradientThreshold"1);演员= rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,"Observation",{'State'},"Action",{'Action'},actorOptions);

Create DDPG agent object

Specify the agent options.

agentOpts = rlDDPGAgentOptions(..."SampleTime",sampleTime,..."TargetSmoothFactor",1e-3,..."DiscountFactor",0.995,..."MiniBatchSize",128,..."ExperienceBufferLength",1e6); agentOpts.NoiseOptions.Variance = 0.1; agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;

Create therlDDPGAgentobject. TheobstacleAvoidanceAgentvariable is used in the model for theAgentblock.

obstacleAvoidanceAgent = rlDDPGAgent(actor,critic,agentOpts); open_system(mdl +"/Agent")

Reward

The reward function for the agent is modeled as shown.

The agent is rewarded to avoid the nearest obstacle, which minimizes the worst-case scenario. Additionally, the agent is given a positive reward for higher linear speeds, and is given a negative reward for higher angular speeds. This rewarding strategy discourages the agent's behavior of going in circles. Tuning your rewards is key to properly training an agent, so your rewards vary depending on your application.

Train Agent

To train the agent, first specify the training options. For this example, use the following options:

  • Train for at most10000episodes, with each episode lasting at mostmaxStepstime steps.

  • Display the training progress in the Episode Manager dialog box (set thePlotsoption) and enable the command line display (set theVerboseoption to true).

  • Stop training when the agent receives an average cumulative reward greater than 400 over fifty consecutive episodes.

For more information, seerlTrainingOptions(Reinforcement Learning Toolbox).

maxEpisodes = 10000; maxSteps = ceil(Tfinal/sampleTime); trainOpts = rlTrainingOptions(..."MaxEpisodes",maxEpisodes,..."MaxStepsPerEpisode",maxSteps,..."ScoreAveragingWindowLength",50,..."StopTrainingCriteria","AverageReward",..."StopTrainingValue",400,..."Verbose", true,..."Plots","training-progress");

Train the agent using thetrain(Reinforcement Learning Toolbox)function. Training is a computationally-intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by settingdoTrainingtofalse. To train the agent yourself, setdoTrainingtotrue.

doTraining = false;% Toggle this to true for training.ifdoTraining% Train the agent.trainingStats = train(obstacleAvoidanceAgent,env,trainOpts);else% Load pretrained agent for the example.loadexampleHelperAvoidObstaclesAgentobstacleAvoidanceAgentend

TheReinforcement Learning Episode Managercan be used to track episode-wise training progress. As the episode number increases, you want to see an increase in the reward value.

Simulate

Use the trained agent to simulate the robot driving in the map and avoiding obstacles.

out = sim("exampleHelperAvoidObstaclesMobileRobot.slx");

Visualize

To visualize the simulation of the robot driving around the environment with range sensor readings, use the helper,exampleHelperAvoidObstaclesPosePlot.

fori = 1:5:size(out.range, 3) u = out.pose(i, :); r = out.range(:, :, i); exampleHelperAvoidObstaclesPosePlot(u, mapMatrix, mapScale, r, scanAngles, ax);end

Figure simpleMap contains an axes. The axes with title Binary Occupancy Grid contains 6 objects of type patch, line, image.

Extensibility

You can now use this agent to simulate driving in a different map. Another map generated from lidar scans of an office environment is used with the same trained model. This map represents a more realistic scenario to apply the trained model after training.

Change the map

mapMatrix = office_area_map.occupancyMatrix > 0.5; mapScale = 10; initX = 20; initY = 30; initTheta = 0; fig = figure("Name","office_area_map"); set(fig,"Visible","on"); ax = axes(fig); show(binaryOccupancyMap(mapMatrix, mapScale),"Parent",ax); holdonplotTransforms([initX, initY, 0], eul2quat([initTheta, 0, 0]),"MeshFilePath","groundvehicle.stl","View","2D"); light; holdoff

Figure office_area_map contains an axes. The axes with title Binary Occupancy Grid contains 5 objects of type patch, line, image.

Simulate

out = sim("exampleHelperAvoidObstaclesMobileRobot.slx");

Visualize

fori = 1:5:size(out.range, 3) u = out.pose(i, :); r = out.range(:, :, i); exampleHelperAvoidObstaclesPosePlot(u, mapMatrix, mapScale, r, scanAngles, ax);end

Figure office_area_map contains an axes. The axes with title Binary Occupancy Grid contains 6 objects of type patch, line, image.