Here you will find experiments and results obtained when performing SARSA and Q learning
algorithms on the cliff walk problem from the book. The next plot is the learned policy
when using the SARSA on policy algorithm for this problem and ten thousand episodes. Here
arrows represent the greedy policy direction to be taken at each point.
Note that this algorithm finds the "safe" path or the one that walks as far from the cliff as possible.
The corresponding state value function is given by
The next plot is the learned policy when using the Q-learning
algorithm off-policy algorithm for this problem and ten million
episodes. Again arrows represent the greedy policy direction to be
taken at each point.
Note that this algorithm finds the quickest path or the one that walks as close to the cliff as possible.
The corresponding state value function is given by
These match quite well with the similar results presented in the book. It is interesting
to note that the Q-learn policy can look somewhat strange in regions where we don't visit
very often. There the statistics is quite poor and can result in correspondingly poor action
estimates. Since during our episodes we don't actually visit these states very often the fact
that we have poor action estimates is of no consequence.
If we compute the average reward per episode under each algorithm for many episodes we get the following plots.
John Weatherwax
Last modified: Sun May 15 08:46:34 EDT 2005