4
$\begingroup$

One way of understanding the difference between value function approaches, policy approaches and actor-critic approaches in reinforcement learning is the following:

  • A critic explicitly models a value function for a policy.
  • An actor explicitly models a policy.

Value function approaches, such as Q-learning, only keep track of a value function, and the policy is directly derived from that (e.g. greedily or epsilon-greedily). Therefore, these approaches can be classified as a "critic-only" approach.

Some policy search/gradient approaches, such as REINFORCE, only use a policy representation, therefore, I would argue that this approach can be classified as an "actor-only" approach.

Of course, many policy search/gradient approaches also use value models in addition to a policy model. These algorithms are commonly referred to as "actor-critic" approaches (well-known ones are A2C / A3C).

Keeping this taxonomy intact for model-based dynamic programming algorithms, I would argue that value iteration is an actor-only approach, and policy iteration is an actor-critic approach. However, not many people discuss the term actor-critic when referring to policy iteration. How come?

Also, I am not familiar with any model-based/dynamic programming like actor only approaches? Do these exist? If not, what prevents this from happening?

$\endgroup$

2 Answers 2

3
$\begingroup$

Keeping this taxonomy intact for model-based Dynamic programming algorithms, I would argue that value iteration is a Actor only approach, and policy iteration is a Actor-Critic approach. However, not many people discuss the term Actor-Critic when referring to Policy Iteration. How come?

Both policy iteration and value iteration are value-based approaches. The policy in policy iteration is either arbitrary or derived from a value table. It is not modelled separately.

To count as an Actor, the policy function needs to modelled directly as a parametric function of the state, not indirectly via a value assessment. You cannot use policy gradient methods to adjust an Actor's policy function unless it is possible to derive the gradient of the policy function with respect to parameters that control the relationship bewteen state and action. An Actor policy might be noted as $\pi(a|s,\theta)$ and the parameters $\theta$ are what make it possible to learn improvements.

Policy iteration often generates an explicit policy, from the current value estimates. This is not a representation that can be directly manipulated, instead it is a consequence of measuring values, and there are no parameters that can be learned. Therefore the policy seen in policy iteration cannot be used as an actor in Actor-Critic or related methods.

Another way to state this is that the policy and value functions in DP are not separate enough to be considered as an actor/critic pair. Instead they are both views of the same measurement, with the value function being closer to raw measurements and policy being a mapping of the value function to policy space.

Also, I am not familiar with any model-based/dynamic programming like actor only approaches? Do these exist? If not, what prevents this from happening?

The main difference between model-based dynamic programming and model-free methods like Q-learning, or SARSA, is that the dynamic programming methods directly use the full distribution model (which can be expressed as $p(r, s'|s,a)$) to calculate expected bootstrapped returns.

There is nothing in principle stopping you substituting expected returns calculated in this way into REINFORCE or Actor-Critic methods. However, it may be computationally hard to do so - these methods are often chosen when action space is large for instance.

Basic REINFORCE using model-based expectations would be especially hard as you need an expected value calculated over all possible trajectories from each starting state - if you are going to expand the tree of all possible results to that degree, then a simple tree search algorithm would perform better, and the algorithm then resolves to a one-off planning exhaustive tree search.

Actor-Critic using dynamic programming methods for the Critic should be viable, and I expect you could find examples of it being done in some situations. It may work well for some card or board games, if the combined action space and state space is not too large - it would behave a little like using Expected SARSA for the Critic component, except also run expectations over the state transition dynamics (whilst Expected SARSA only runs expectations over policy). You could vary the depth of this too, getting better estimates theoretically at the expense of extra computation (potentially a lot of extra computation if there is a large branching factor)

$\endgroup$
7
  • $\begingroup$ I think that the confusion that policy iteration is an actor-critic method lies in the fact that in actor-critic methods you use the value function to guide the search for the policy. In policy iteration, you actually use the value function to derive the policy too. I don't think it's fully clear from your answer why policy iteration couldn't be considered an actor-critic method. $\endgroup$
    – nbro
    Commented May 13, 2020 at 11:49
  • $\begingroup$ Thanks for your elaborate answer. in policy iteration however; the policy function is usually directly represented, right? Sure, it is a quite trivial step to do so (policy improvement), but it is done nonetheless. For example, see the algorithm shown in Sutton and Barto: incompleteideas.net/sutton/book/first/4/… $\endgroup$
    – dan888
    Commented May 13, 2020 at 12:14
  • $\begingroup$ About how REINFORCE would translate into a DP method, wouldn't this resolve to some form of policy iteration? 1. You generate a "trajectory" and calculate its return. In a model-based sense however; this trajectory updates the value based on all possible successor states until a terminal state, somewhat emulating a step of policy evaluation. 2. You improve the policy based on the gradient signal from this evaluation -> policy improvement Sure, doing it this way doesn't intuitively sound very smart. Why not update all states in a normal Policy evaluation manner? $\endgroup$
    – dan888
    Commented May 13, 2020 at 12:18
  • $\begingroup$ @NeilSlater Okay, so the key difference between PI and Actor-Critic would be that PI generates a policy in every step, whereas AC modifies/adapts an older policy in every step, right? Do you happen to know of any sources that would argue similarly? $\endgroup$
    – dan888
    Commented May 13, 2020 at 12:23
  • 1
    $\begingroup$ @nbro: I have tried to make that clearer, the difference being the policy derviation method in DP is not a direct parametric function over state that can be adjusted by changing those parameters. I guess another way to state this is that the Actor and Critic are not separate enough, they are just transformations of the same measurement. $\endgroup$ Commented May 13, 2020 at 13:42
0
$\begingroup$

policy iteration is an actor-critic approach.

This is a very insightful observation and I would agree with it. You can find similar statement in this lecture and this NeurIPS paper.

The actor's role is to perform policy improvement and the critic's role is to perform policy evaluation. This helps explain its advantage over value-only approaches, as PI converges faster than VI although each iteration is more expensive.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .