Main Content

强化学习是什么?

Reinforcement learning is a goal-directed computational approach where a computer learns to perform a task by interacting with an unknown dynamic environment. This learning approach enables a computer to make a series of decisions to maximize the cumulative reward for the task without human intervention and without being explicitly programmed to achieve the task. The following diagram shows a general representation of a reinforcement learning scenario.

Diagram showing an agent that interacts with its environment. The observation signal goes from the environment to the agent, and the action signal goes from the agent to the environment. The reward signal goes from the environment to the reinforcement learning algorithm inside the agent. The reinforcement learning algorithm uses the available information to update a policy. The agent uses a policy to map an observation to an action. This is similar to a control diagram, shown below, in which a controller senses an error between a desired reference and a plant output and uses the error to acts on a plant input.

The goal of reinforcement learning is to train anagentto complete a task within an unknownenvironment. The agent receivesobservationsand arewardfrom the environment and sendsactionsto the environment. The reward is a measure of how successful an action is with respect to completing the task goal.

The agent contains two components: apolicyand alearning algorithm.

  • The policy is a mapping that selects actions based on the observations from the environment. Typically, the policy is a function approximator with tunable parameters, such as a deep neural network.

  • The learning algorithm continuously updates the policy parameters based on the actions, observations, and reward. The goal of the learning algorithm is to find an optimal policy that maximizes the cumulative reward received during the task.

In other words, reinforcement learning involves an agent learning the optimal behavior through repeated trial-and-error interactions with the environment without human involvement.

As an example, consider the task of parking a vehicle using an automated driving system. The goal of this task is for the vehicle computer (agent) to park the vehicle in the correct position and orientation. To do so, the controller uses readings from cameras, accelerometers, gyroscopes, a GPS receiver, and lidar (observations) to generate steering, braking, and acceleration commands (actions). The action commands are sent to the actuators that control the vehicle. The resulting observations depend on the actuators, sensors, vehicle dynamics, road surface, wind, and many other less-important factors. All these factors, that is, everything that is not the agent, make up theenvironmentin reinforcement learning.

To learn how to generate the correct actions from the observations, the computer repeatedly tries to park the vehicle using a trial-and-error process. To guide the learning process, you provide a signal that is one when the car successfully reaches the desired position and orientation and zero otherwise (reward). During each trial, the computer selects actions using a mapping (policy) initialized with some default values. After each trial, the computer updates the mapping to maximize the reward (learning algorithm). This process continues until the computer learns an optimal mapping that successfully parks the car.

Reinforcement Learning Workflow

The general workflow for training an agent using reinforcement learning includes the following steps.

Figure showing the seven stages of a typical reinforcement learning workflow.

  1. Formulate problem— Define the task for the agent to learn, including how the agent interacts with the environment and any primary and secondary goals the agent must achieve.

  2. Create environment— Define the environment within which the agent operates, including the interface between agent and environment and the environment dynamic model. For more information, seeCreate MATLAB Reinforcement Learning EnvironmentsandCreate Simulink Reinforcement Learning Environments.

  3. Define reward——指定reward signal that the agent uses to measure its performance against the task goals and how to calculate this signal from the environment. For more information, seeDefine Reward Signals.

  4. Create agent— Create the agent, which includes defining a policy approximator (actor) an value function approximator (critic) and configuring the agent learning algorithm. For more information, seeCreate Policies and Value FunctionsandReinforcement Learning Agents.

  5. Train agent— Train the agent approximators using the defined environment, reward, and agent learning algorithm. For more information, seeTrain Reinforcement Learning Agents.

  6. Validate agent— Evaluate the performance of the trained agent by simulating the agent and environment together. For more information, seeTrain Reinforcement Learning Agents.

  7. Deploy policy— Deploy the trained policy approximator using, for example, generated GPU code. For more information, seeDeploy Trained Reinforcement Learning Policies.

Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. For example, if the training process does not converge to an optimal policy within a reasonable amount of time, you might have to update some of the following before retraining the agent:

  • Training settings

  • Learning algorithm configuration

  • Policy and value function (actor and critic) approximators

  • Reward signal definition

  • Action and observation signals

  • Environment dynamics

Related Topics