Algorithms |

Overview | Introduction | TD-Learning | Applet | Follow Up | Source Code | References |

Q-Learning

Q-Learning is an Off-Policy algorithm for Temporal Difference learning.
It can be proven that given sufficient training under any
-soft
policy, the algorithm converges with probability 1 to a close approximation of
the action-value function for an arbitrary target policy.
Q-Learning learns the optimal policy even when actions are selected
according to a more exploratory or even random policy. The procedural form
of the algorithm is:

The parameters used in the Q-value update process are:

- the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.

- discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 0 for the algorithm to converge.

- the maximum reward that is attainable in the state following the current one. i.e the reward for taking the optimal action thereafter.

This procedural approach can be translated into plain english steps as follows:

- Initialize the Q-values table,
**Q(s, a)**. - Observe the current state,
**s**. - Choose an action,
**a**, for that state based on one of the action selection policies explained here on the previous page (-soft, -greedy or softmax). - Take the action, and observe the reward,
**r**, as well as the new state,**s'**. - Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the forumla and parameters described above.
- Set the state to the new state, and repeat the process until a terminal state is reached.

Sarsa

The Sarsa algorithm is an On-Policy algorithm for TD-Learning. The major
difference between it and Q-Learning, is that the maximum reward for the
next state is not necessarily used for updating the Q-values. Instead,
a new action, and therefore reward, is selected using the same policy
that determined the original action. The name Sarsa actually comes from
the fact that the updates are done using the quintuple
**Q(s, a, r, s', a').**
Where: **s, a** are the original state and
action, **r** is the reward observed in the following state
and **s', a'** are the new state-action pair.
The procedural form of Sarsa algorithm is comparable to that of
Q-Learning:

As you can see, there are two action selection steps needed, for determining the next state-action pair along with the first. The parameters and have the same meaning as they do in Q-Learning.

Example

To highlight the difference between Q-Learning and Sarsa, an example from [1] will be used. They took the cliff world shown below:

Next...

Test drive the algorithms and Action Selection policies using our applet.

Previous page | Next page |