δ(S1, a1) = S1, R(S1, a1) = +1 δ(S1, a2) = S2, R(S1, a2) = -2 δ(S2, a1) = S1, R(S2, a1) = +7 δ(S2, a2) = S2, R(S2, a2) = +3
Write the Q values in a matrix like this:
Q | a1 | a2 |
---|---|---|
S1 | ||
S2 |
Trace through the first few steps of the Q-learning algorithm, with a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.