Main Content

rlValueRepresentation

价值函数评论家代表强化学习者

描述

这个对象实现approximat值函数or to be used as a critic within a reinforcement learning agent. A value function is a function that maps an observation to a scalar value. The output represents the expected total long-term reward when the agent starts from the given observation and takes the best possible action. Value function critics therefore only need observations (but not actions) as inputs. After you create anrlValueRepresentation评论家, use it to create an agent relying on a value function critic, such as anrlacagent,rlPGAgent, orrlPPOAgent。For an example of this workflow, seeCreate Actor and Critic Representations。有关创建表示形式的更多信息,请参阅创建策略和价值功能表示

创建

描述

example

评论家= rlValueRepresentation(net,observationInfo,'Observation',obsName)creates the value function based评论家从深神网络net。This syntax sets theObservationInfoproperty of评论家到inputobservationInfoobsNamemust contain the names of the input layers ofnet

example

评论家= rlValueRepresentation(tab,observationInfo)creates the value function based评论家discrete observation space, from the value tabletab, which is anrlTableobject containing a column array with as many elements as the possible observations. This syntax sets theObservationInfoproperty of评论家到inputobservationInfo

example

评论家= rlValueRepresentation({basisFcn,W0},,observationInfo)creates the value function based评论家using a custom basis function as underlying approximator. The first input argument is a two-elements cell in which the first element contains the handlebasisFcnto a custom basis function, and the second element contains the initial weight vectorW0。This syntax sets theObservationInfoproperty of评论家到inputobservationInfo

评论家= rlValueRepresentation(___,选项)creates the value function based评论家using the additional option set选项, which is anrlRepresentationOptions目的。This syntax sets the选项property of评论家选项input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

展开全部

深神经网络used as the underlying approximator within the critic, specified as one of the following:

The network input layers must be in the same order and with the same data type and dimensions as the signals defined inObservationInfo。Also, the names of these input layers must match the observation names listed inobsName

rlValueRepresentation对象支持经常性的金宝app深神经网络。

有关深神经网络层的列表,请参见深度学习层的清单。For more information on creating deep neural networks for reinforcement learning, see创建策略和价值功能表示

观察名称,指定为字符串或字符矢量的单元格数组。观察名称必须是输入层的名称net。These network layers must be in the same order and with the same data type and dimensions as the signals defined inObservationInfo

例子:{'my_obs'}

值表,指定为rlTableobject containing a column vector with length equal to the number of observations. The elementi当代理商从给定的观察开始时,是预期的累积长期奖励s并采取最好的动作。该向量的元素是表示形式的可学习参数。

Custom basis function, specified as a function handle to a user-defined function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic isc = w'*b, 在哪里Wis a weight vector andBis the column vector returned by the custom basis function.cis the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The learnable parameters of this representation are the elements ofW

创建价值函数评论家表示时,您的基础函数必须具有以下签名。

B = myBasisFunction(obs1,obs2,...,obsN)

Hereobs1toobsN是按照相同顺序的观察和相同的数据类型和维度的观察值与定义的信号ObservationInfo

例子:@(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]

Initial value of the basis function weights,W, specified as a column vector having the same length as the vector returned by the basis function.

特性

展开全部

Representation options, specified as anrlRepresentationOptions目的。可用选项包括用于培训和学习率的优化器。

观察规范,指定为rlfinitesetspecorrlNumericSpecobject or an array containing a mix of such objects. These objects define properties such as the dimensions, data types, and names of the observation signals.

rlValueRepresentationsets theObservationInfoproperty of评论家到inputobservationInfo

You can extractObservationInfofrom an existing environment or agent usinggetObservationInfo。You can also construct the specifications manually.

Object Functions

rlacagent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getValue Obtain estimated value function representation

例子

collapse all

创建一个观察规范对象(或者替代使用getObservationInfoto extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing 4 doubles.

obsinfo = rlnumericspec([4 1]);

创建一个深层神经网络,以近似评论家中的价值功能。网络的输入(此处称为myobs) must accept a four-element vector (the observation vector defined byobsInfo), and the output must be a scalar (the value, representing the expected cumulative long-term reward when the agent starts from the given observation).

net = [featureInputLayer(4,'正常化','none','Name','myobs')完整连接的layer(1,'Name','value')];

Create the critic using the network, observation specification object, and name of the network input layer.

评论家= rlValueRepresentation(net,obsInfo,“观察”,{'myobs'})
评论家= rlValueRepresentation with properties: ObservationInfo: [1x1 rl.util.rlNumericSpec] Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use thegetValuefunction to return the value of a random observation, using the current network weights.

V =getValue(critic,{rand(4,1)})
V =single0.7904

您现在可以使用评论家(与演员一起)来创建依靠价值函数评论家的代理商(例如rlacagentorrlPGAgent).

Create an actor representation and a critic representation that you can use to define a reinforcement learning agent such as an Actor Critic (AC) agent.

For this example, create actor and critic representations for an agent that can be trained against the cart-pole environment described inTrain AC Agent to Balance Cart-Pole System。First, create the environment. Then, extract the observation and action specifications from the environment. You need these specifications to define the agent and critic representations.

env = rlPredefinedEnv(“ Cartpole-Discrete”); obsInfo = getObservationInfo(env); actInfo = getActionInfo(env);

For a state-value-function critic such as those used for AC or PG agents, the inputs are the observations and the output should be a scalar value, the state value. For this example, create the critic representation using a deep neural network with one output, and with observation signals corresponding tox,xdot,theta, andthetadot如所述Train AC Agent to Balance Cart-Pole System。You can obtain the number of observations from theobsInfo规格。Name the network layer input“观察”

numObservation = obsInfo.Dimension(1); criticNetwork = [ featureInputLayer(numObservation,'正常化','none','Name',“观察”)完整连接的layer(1,'Name','CriticFC')];

Specify options for the critic representation usingrlRepresentationOptions。These options control the learning of the critic network parameters. For this example, set the learning rate to 0.05 and the gradient threshold to 1.

repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);

Create the critic representation using the specified neural network and options. Also, specify the action and observation information for the critic. Set the observation name to“观察”, which is the of the评论家Networkinput layer.

评论家= rlValueRepresentation(criticNetwork,obsInfo,“观察”,{“观察”},,repOpts)
评论家= rlValueRepresentation with properties: ObservationInfo: [1x1 rl.util.rlNumericSpec] Options: [1x1 rl.option.rlRepresentationOptions]

同样,为演员创建网络。AC代理决定使用Actor表示采取哪种行动进行观察。对于演员而言,输入是观察值,输出取决于动作空间是离散还是连续的。对于本示例的参与者,有两个可能的离散动作–10或10。要创建参与者,请使用具有与评论家相同观察输入的深神经网络,可以输出这两个值。您可以从Actinfo规格。命名输出'action'

numAction = numel(actInfo.Elements); actorNetwork = [ featureInputLayer(numObservation,'正常化','none','Name',“观察”) fullyConnectedLayer (numAction'Name','action')];

Create the actor representation using the observation name and specification and the same representation options.

actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,。。。“观察”,{“观察”},,repOpts)
Actor =具有属性的rlstochasticactorrementation:actionInfo:[1x1 rl.util.rlfinitesetsetspec]观察者:[1x1 rl.util.rlnumericspec]选项:[1x1 rl.option.rlrepresentationptions]

使用演员和评论家代理创建AC代理。

代理人Opts = rlACAgentOptions(。。。'NumStepsToLookAhead',32,。。。'DiscountFactor',0.99); agent = rlACAgent(actor,critic,agentOpts)
代理人= rlACAgent with properties: AgentOptions: [1x1 rl.option.rlACAgentOptions]

For additional examples showing how to create actor and critic representations for different agent types, see:

Create a finite set observation specification object (or alternatively usegetObservationInfo从具有离散观察空间的环境中提取规范对象)。对于此示例,将观察空间定义为由4个可能值组成的有限集。

obsInfo = rlFiniteSetSpec([1 3 5 7]);

创建一个表以近似评论家中的值函数。

vtable = rltable(obsinfo);

The table is a column vector in which each entry stores the expected cumulative long-term reward for each possible observation as defined byobsInfo。You can access the table using the桌子property of the可vtable目的。The initial value of each element is zero.

可vtable。桌子
ans =4×10 0 0 0

You can also initialize the table to any value, in this case, an array containing all the integers from1to4

可vtable。桌子= reshape(1:4,4,1)
可vtable= rlTable with properties: Table: [4x1 double]

使用表格和观察规范对象创建评论家。

评论家= rlvaluerepresentation(vtable,obsinfo)
评论家= rlValueRepresentation with properties: ObservationInfo: [1x1 rl.util.rlFiniteSetSpec] Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use thegetValue使用当前表条目返回给定观察值的值。

V =getValue(critic,{7})
V =4

您现在可以使用评论家(与演员一起)来创建依靠价值函数评论家的代理商(例如rlacagentorrlPGAgent代理人).

创建一个观察规范对象(或者替代使用getObservationInfoto extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing 4 doubles.

obsinfo = rlnumericspec([4 1]);

创建自定义基础函数以近似评论家中的值函数。自定义基础函数必须返回列向量。每个向量元素必须是定义的观测值的函数obsInfo

myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]
myBasisFcn =function_handle with value:@(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]

The output of the critic is the scalarW'*myBasisFcn(myobs), 在哪里Wis a weight column vector which must have the same size of the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0= [3;5;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second argument is the observation specification object.

评论家= rlValueRepresentation({myBasisFcn,W0},obsInfo)
评论家= rlValueRepresentation with properties: ObservationInfo: [1x1 rl.util.rlNumericSpec] Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use thegetValue使用当前参数向量返回给定观察值的值。

v = getValue(评论家,{[2 4 6 8]'})
V =1x1 dlarray 130.9453

您现在可以使用评论家(与演员一起)来创建依靠价值函数评论家的代理商(例如rlacagentorrlPGAgent).

创建环境并获得观察和行动信息。

env = rlPredefinedEnv('CartPole-Discrete'); obsInfo = getObservationInfo(env); actInfo = getActionInfo(env); numObs = obsInfo.Dimension(1); numDiscreteAct = numel(actInfo.Elements);

为评论家创建一个经常性的深神经网络。要创建一个经常性的神经网络,请使用sequenceInputLayeras the input layer and include at least onelstmlayer

critisnetwork = [sequenceInputlayer(numobs,'正常化','none','Name','state') fullyConnectedLayer(8,'Name','fc') reluLayer('Name','relu') lstmLayer(8,'outputmode','sequence','Name','lstm')完整连接的layer(1,'Name','output')];

为评论家创建一个值函数表示对象。

评论家选项= rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1); critic = rlValueRepresentation(criticNetwork,obsInfo,。。。“观察”,'state',criticOptions);
Introduced in R2020a