Deep Q-Network Agents
The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning. For more information on Q-learning, seeQ-Learning Agents。
For more information on the different types of reinforcement learning agents, see加固学习代理人。
DQN代理商可以在具有以下观察和动作空间的环境中培训。
Observation Space | Action Space |
---|---|
连续或离散 | Discrete |
DQN.agents use the following critic.
Critic | Actor |
---|---|
Q值功能评论家Q(S,A), which you create using |
DQN代理商不使用演员。 |
During training, the agent:
在学习期间每次更新批评批评物业。
使用epsilon-贪婪的探索探索动作空间。在每个控制间隔期间,代理要么用概率选择随机动作ε.或者在具有概率1-的值函数贪婪地选择动作1-ε.。这种贪婪的行动是价值函数最大的动作。
Stores past experiences using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.
批评功能近似器
To estimate the value function, a DQN agent maintains two function approximators:
CriticQ(S,A;φ.) — The critic, with parametersφ.,采取观察S和行动A作为输入并返回相应的长期奖励期望。
Target criticQt(S,A;φ.t) — To improve the stability of the optimization, the agent periodically updates the target critic parametersφ.t使用最新的批评参数值。
两个都Q(S,A;φ.) andQt(S,A;φ.t)具有相同的结构和参数化。
有关创建批评函数近似的更多信息,请参阅Create Policies and Value Functions。
During training, the agent tunes the parameter values inφ.。After training, the parameters remain at their tuned value and the trained value function approximator is stored in criticQ(S,A).
Agent Creation
您可以在Matlab创建和培训DQN代理®command line or using the加固学习设计师app. For more information on creating agents using加固学习设计师, seeCreate Agents Using Reinforcement Learning Designer。
在命令行,您可以根据环境的观察和操作规范创建DQN代理。为此,请执行以下步骤。
为您的环境创建观察规范。如果您已经有环境界面对象,可以使用环境界面对象
getobservationInfo.
。为您的环境创建动作规范。如果您已经有环境界面对象,可以使用环境界面对象
getActionInfo.
。If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions
。如果需要,请使用一个指定代理选项
rldqnagentoptions.
目的。Create the agent using an
rlDQNAgent
目的。
Alternatively, you can create actor and critic and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment.
Create a critic using an
rlQValueFunction
目的。Specify agent options using an
rldqnagentoptions.
目的。Create the agent using an
rlDQNAgent
目的。
DQN.agents support critics that use recurrent deep neural networks as functions approximators.
For more information on creating actors and critics for function approximation, seeCreate Policies and Value Functions。
Training Algorithm
DQN.agents use the following training algorithm, in which they update their critic model at each time step. To configure the training algorithm, specify options using anrldqnagentoptions.
目的。
Initialize the criticQ(s,a;φ.) with random parameter valuesφ., and initialize the target critic parametersφ.twith the same values. 。
For each training time step:
For the current observationS, select a random actionAwith probabilityε.。否则,选择批评值函数最大的操作。
指定ε.和its decay rate, use the
EpsilonGreedyExploration
选项。执行动作A。遵守奖励R和下一次观察S'。
Store the experience (S,A,R,S') in the experience buffer.
样本随机迷你批次Mexperiences (Si,Ai,Ri,S'i)来自经验缓冲区。指定M, use the
小匹匹匹匹配
选项。IfS'i是终端状态,设置值函数目标yitoRi。Otherwise, set it to
设置折扣因素γ., use the
DiscountFactor
选项。To use double DQN, set theunmorlebledqn
选择真的
。Update the critic parameters by one-step minimization of the lossL在所有采样的体验中。
Update the target critic parameters depending on the target update method. For more information, see目标更新方法。
Update the probability thresholdε.根据您在中指定的衰减速率选择随机动作
EpsilonGreedyExploration
选项。
目标更新方法
DQN代理使用以下目标更新方法之一更新其目标批读参数。
平滑- 使用平滑系数的每次步骤更新目标参数τ.。指定smoothing factor, use the
TargetSmoothFactor
选项。Periodic- 定期更新目标参数而不进行平滑(
TargetSmoothFactor = 1
). To specify the update period, use theTargetUpdateFrequency
parameter.定期平滑- 使用平滑定期更新目标参数。
要配置目标更新方法,请创建一个rldqnagentoptions.
object, and set theTargetUpdateFrequency
和TargetSmoothFactor
parameters as shown in the following table.
Update Method | TargetUpdateFrequency |
TargetSmoothFactor |
---|---|---|
平滑(default) | 1 |
Less than1 |
Periodic | Greater than1 |
1 |
定期平滑 | Greater than1 |
Less than1 |
参考
[1] Mnih,Volodymyr,Koray Kavukcuoglu,David Silver,Alex Graves,Ioannis Antonoglou,Daan Wierstra和Martin Riedmiller。“与深入的加强学习玩Atari。”ArXiv:1312.5602 [Cs], December 19, 2013.https://arxiv.org/abs/1312.5602。