COMP9444 Exercises 6

COMP9444 Neural Networks and Deep Learning
Term 3, 2019

Exercises 6: Reinforcement Learning

Consider an environment with two states S = {S₁, S₂} and two actions A = {a₁, a₂}, where the (deterministic) transitions δ and reward R for each state and action are as follows:

δ(S₁, a₁) = S₁, R(S₁, a₁) = +1

δ(S₁, a₂) = S₂, R(S₁, a₂) = -2

δ(S₂, a₁) = S₁, R(S₂, a₁) = +7

δ(S₂, a₂) = S₂, R(S₂, a₂) = +3

Draw a picture of this environment, using circles for the states and arrows for the transitions.
Assuming a discount factor of γ = 0.7, determine:
1. the optimal policy π^* : S → A
2. the value function V^* : S → R
3. the "Q" function Q^* : S × A → R
Write the Q values in a matrix like this:

Q a₁ a₂

S₁

S₂

Trace through the first few steps of the Q-learning algorithm, with a learning rate of 1 and with all Q values initially set to zero. Explain why it is necessary to force exploration through probabilistic choice of actions, in order to ensure convergence to the true Q values.
Now let's consider how the Value function changes as the discount factor γ varies between 0 and 1.
There are four deterministic policies for this environment, which can be written as π₁₁, π₁₂, π₂₁ and π₂₂, where π_ij(S₁) = a_i, π_ij(S₂) = a_j
1. Calculate the value function V^π_(γ): S → R for each of these four policies (keeping γ as a variable)
2. Determine for which range of values of γ each of the policies π₁₁, π₁₂, π₂₁, π₂₂ is optimal

Make sure you try answering the Exercises yourself, before checking the Sample Solutions

δ(S₁, a₁) = S₁,	R(S₁, a₁) =	+1
δ(S₁, a₂) = S₂,	R(S₁, a₂) =	-2
δ(S₂, a₁) = S₁,	R(S₂, a₁) =	+7
δ(S₂, a₂) = S₂,	R(S₂, a₂) =	+3

Q	a₁	a₂
S₁
S₂

COMP9444 Neural Networks and Deep Learning Term 3, 2019

Exercises 6: Reinforcement Learning

COMP9444 Neural Networks and Deep Learning
Term 3, 2019