Description
Dynamic Programming
1 Question
1.1
Finding the shortest path to state 14 corresponds to a deterministic policy. The reward rs has to be negative to ensure the shortest path goal while not being too low to prevent the agent from visiting red terminal state 1. Precisely, rs must respect: —10 < rs < 0.
Let us define rs = —1 . Since the transitions are deterministic, the optimal policy is indeed the shortest path to state 14. For such rewards, the value function of the optimal policy for each state is the following.
5 -10 7 6 6 7 8 5 7 8 9 4 8 9 10 3
1.2
In a general MDP, a policy π induces a value function V 1 π for a reward signal r1(s, a). Let us apply an affine transformation to the reward r1 of the form: r2(s, a) = αr1(s, a) + β with (α, β) E R2. For the same policy π, the new value function V 2 π is thus the following:
V2(s)=E[
�= E
�+ t=0
�+ t=0
�= αE
γtr2(st,dt(ht))|s0 = s;π]
+∞� t=0
+ β)|s0 = s;π]
γtr1(st,dt(ht))|s0 = s;π] +
+∞� t=0
γtβ
Therefore, after an affine transformation of the reward signal, the new value function becomes:
V2π (s) = αV1π(s) + β 1 — γ
Let π� 1 be the optimal policy corresponding to reward r1 and π*2 the optimal policy after the affine
1



