COMP9444 Neural Networks and Deep Learning
Term 3, 2019

Exercises 6: Reinforcement Learning


Consider an environment with two states S = {S1, S2} and two actions A = {a1, a2}, where the (deterministic) transitions δ and reward R for each state and action are as follows:
δ(S1, a1) = S1, R(S1, a1) =+1
δ(S1, a2) = S2, R(S1, a2) =-2
δ(S2, a1) = S1, R(S2, a1) =+7
δ(S2, a2) = S2, R(S2, a2) =+3
  1. Draw a picture of this environment, using circles for the states and arrows for the transitions.

  2. Assuming a discount factor of γ = 0.7, determine:
    1. the optimal policy π* : S → A
    2. the value function V* : S → R
    3. the "Q" function Q* : S × A → R

    Write the Q values in a matrix like this:

    Qa1a2
    S1  
    S2  

    Trace through the first few steps of the Q-learning algorithm, with a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.

  3. Now let's consider how the Value function changes as the discount factor γ varies between 0 and 1.
    There are four deterministic policies for this environment, which can be written as π11, π12, π21 and π22, where πij(S1) = ai, πij(S2) = aj

    1. Calculate the value function Vπ(γ): S → R for each of these four policies (keeping γ as a variable)
    2. Determine for which range of values of γ each of the policies π11, π12, π21, π22 is optimal


Make sure you try answering the Exercises yourself, before checking the Sample Solutions