We begin with the common state value function.
Now the optimal policy one obtains depends some very low level details. See the
discussion here
Here we will show a variety of the optimal policies we can obtain using the earlier code.
If we follow the algorithm where we loop from the first action "1" to the last action and
change our policy as we go only if the next policy is better than the previous by
more than 10^(-8) we can obtain the result found in the book.
If we decrese this threshold so that we take any policy that is greater
than the previous by any amount we obtain.
If we reverse the order of the actions when we consider them one at a time and
still take the next policy even if it is only numerically greater
than previous ones (i.e. eps_pol=0) we obtain
If we reverse the order of the policies but now only take policies that are
greater by a threshold amount 10^(-8) we obtain a trivial policy
Various experiments can be performed in this way. If we take a different
number of non-terminal states say an odd number say 1021 we find much more
regular patterns. In that case we have
When the number of non-terminal states is 1022 we find our optimal policy given by.
John Weatherwax