Here you will find experiments and results obtained when performing SARSA and Q learning algorithms on the cliff walk problem from the book. The next plot is the learned policy when using the SARSA on policy algorithm for this problem and ten thousand episodes. Here arrows represent the greedy policy direction to be taken at each point.

Note that this algorithm finds the "safe" path or the one that walks as far from the cliff as possible. The corresponding state value function is given by

The next plot is the learned policy when using the Q-learning algorithm off-policy algorithm for this problem and ten million episodes. Again arrows represent the greedy policy direction to be taken at each point.

Note that this algorithm finds the quickest path or the one that walks as close to the cliff as possible. The corresponding state value function is given by

These match quite well with the similar results presented in the book. It is interesting to note that the Q-learn policy can look somewhat strange in regions where we don't visit very often. There the statistics is quite poor and can result in correspondingly poor action estimates. Since during our episodes we don't actually visit these states very often the fact that we have poor action estimates is of no consequence. If we compute the average reward per episode under each algorithm for many episodes we get the following plots.


John Weatherwax
Last modified: Sun May 15 08:46:34 EDT 2005