Deep Q-Network Agents

The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning. For more information on Q-learning, seeQ-Learning Agents。

For more information on the different types of reinforcement learning agents, see加固学习代理人。

DQN代理商可以在具有以下观察和动作空间的环境中培训。

Observation Space	Action Space
连续或离散	Discrete

DQN.agents use the following critic.

Critic	Actor
Q值功能评论家Q(S,A), which you create using`rlQValueFunction`要么`rlVectorQValueFunction`	DQN代理商不使用演员。

During training, the agent:

在学习期间每次更新批评批评物业。
使用epsilon-贪婪的探索探索动作空间。在每个控制间隔期间，代理要么用概率选择随机动作ε.或者在具有概率1-的值函数贪婪地选择动作1-ε.。这种贪婪的行动是价值函数最大的动作。
Stores past experiences using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.

批评功能近似器

To estimate the value function, a DQN agent maintains two function approximators:

CriticQ(S,A;φ.) — The critic, with parametersφ.，采取观察S和行动A作为输入并返回相应的长期奖励期望。
Target criticQ_t(S,A;φ._t) — To improve the stability of the optimization, the agent periodically updates the target critic parametersφ._t使用最新的批评参数值。

两个都Q(S,A;φ.) andQ_t(S,A;φ._t）具有相同的结构和参数化。

有关创建批评函数近似的更多信息，请参阅Create Policies and Value Functions。

During training, the agent tunes the parameter values inφ.。After training, the parameters remain at their tuned value and the trained value function approximator is stored in criticQ(S,A).

Agent Creation

您可以在Matlab创建和培训DQN代理^®command line or using the加固学习设计师app. For more information on creating agents using加固学习设计师, seeCreate Agents Using Reinforcement Learning Designer。

在命令行，您可以根据环境的观察和操作规范创建DQN代理。为此，请执行以下步骤。

为您的环境创建观察规范。如果您已经有环境界面对象，可以使用环境界面对象getobservationInfo.。
为您的环境创建动作规范。如果您已经有环境界面对象，可以使用环境界面对象getActionInfo.。
If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object usingrlAgentInitializationOptions。
如果需要，请使用一个指定代理选项rldqnagentoptions.目的。
Create the agent using anrlDQNAgent目的。

Alternatively, you can create actor and critic and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment.

Create a critic using anrlQValueFunction目的。
Specify agent options using anrldqnagentoptions.目的。
Create the agent using anrlDQNAgent目的。

DQN.agents support critics that use recurrent deep neural networks as functions approximators.

For more information on creating actors and critics for function approximation, seeCreate Policies and Value Functions。

Training Algorithm

DQN.agents use the following training algorithm, in which they update their critic model at each time step. To configure the training algorithm, specify options using anrldqnagentoptions.目的。

Initialize the criticQ(s,a;φ.) with random parameter valuesφ., and initialize the target critic parametersφ._twith the same values. ${φ.}_{t} = φ.$ 。
For each training time step:
1. For the current observationS, select a random actionAwith probabilityε.。否则，选择批评值函数最大的操作。
  
  $A = \underset{A}{\arg 最大限度} Q (S, A; φ.)$
  
  指定ε.和its decay rate, use theEpsilonGreedyExploration选项。
2. 执行动作A。遵守奖励R和下一次观察S'。
3. Store the experience (S,A,R,S') in the experience buffer.
4. 样本随机迷你批次Mexperiences (S_i,A_i,R_i,S'_i）来自经验缓冲区。指定M, use the小匹匹匹匹配选项。
5. IfS'_i是终端状态，设置值函数目标y_itoR_i。Otherwise, set it to
  
  $\begin{array}{l} \begin{array}{l} A_{最大限度} = \underset{A'}{\arg 最大限度} Q (S_{i}', A'; φ.) \\ y_{i} = R_{i} + γ. Q_{t} (S_{i}', A_{最大限度}; {φ.}_{t}) \end{array} & (double DQN.) \\ y_{i} = R_{i} + γ. \underset{A'}{最大限度} Q_{t} (S_{i}', A'; {φ.}_{t}) & (DQN.) \end{array}$
  
  设置折扣因素γ., use theDiscountFactor选项。To use double DQN, set theunmorlebledqn选择真的。
6. Update the critic parameters by one-step minimization of the lossL在所有采样的体验中。
  
  $L = \frac{1}{M} \sum_{i = 1}^{M} {(y_{i} - Q (S_{i}, A_{i}; φ.))}^{2}$
7. Update the target critic parameters depending on the target update method. For more information, see目标更新方法。
8. Update the probability thresholdε.根据您在中指定的衰减速率选择随机动作EpsilonGreedyExploration选项。

目标更新方法

DQN代理使用以下目标更新方法之一更新其目标批读参数。

平滑- 使用平滑系数的每次步骤更新目标参数τ.。指定smoothing factor, use theTargetSmoothFactor选项。

${φ.}_{t} = τ. φ. + (1 - τ.) {φ.}_{t}$
Periodic- 定期更新目标参数而不进行平滑（TargetSmoothFactor = 1). To specify the update period, use theTargetUpdateFrequencyparameter.
定期平滑- 使用平滑定期更新目标参数。

要配置目标更新方法，请创建一个rldqnagentoptions.object, and set theTargetUpdateFrequency和TargetSmoothFactorparameters as shown in the following table.

Update Method	`TargetUpdateFrequency`	`TargetSmoothFactor`
平滑(default)	`1`	Less than`1`
Periodic	Greater than`1`	`1`
定期平滑	Greater than`1`	Less than`1`

参考

[1] Mnih，Volodymyr，Koray Kavukcuoglu，David Silver，Alex Graves，Ioannis Antonoglou，Daan Wierstra和Martin Riedmiller。“与深入的加强学习玩Atari。”ArXiv:1312.5602 [Cs], December 19, 2013.https://arxiv.org/abs/1312.5602。