|Overview||Introduction||TD-Learning||Applet||Follow Up||Source Code||References|
Q-Learning is an Off-Policy algorithm for Temporal Difference learning.
It can be proven that given sufficient training under any
policy, the algorithm converges with probability 1 to a close approximation of
the action-value function for an arbitrary target policy.
Q-Learning learns the optimal policy even when actions are selected
according to a more exploratory or even random policy. The procedural form
of the algorithm is:
The parameters used in the Q-value update process are:
- the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.
- discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 0 for the algorithm to converge.
- the maximum reward that is attainable in the state following the current one. i.e the reward for taking the optimal action thereafter.
This procedural approach can be translated into plain english steps as follows:
The Sarsa algorithm is an On-Policy algorithm for TD-Learning. The major
difference between it and Q-Learning, is that the maximum reward for the
next state is not necessarily used for updating the Q-values. Instead,
a new action, and therefore reward, is selected using the same policy
that determined the original action. The name Sarsa actually comes from
the fact that the updates are done using the quintuple
Q(s, a, r, s', a').
Where: s, a are the original state and
action, r is the reward observed in the following state
and s', a' are the new state-action pair.
The procedural form of Sarsa algorithm is comparable to that of
As you can see, there are two action selection steps needed, for determining the next state-action pair along with the first. The parameters and have the same meaning as they do in Q-Learning.
To highlight the difference between Q-Learning and Sarsa, an example from  will be used. They took the cliff world shown below:
Test drive the algorithms and Action Selection policies using our applet.
|Previous page||Next page|