Chapter 6 in Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

Here you will find experiments and comparing TD(lambda) methods for the Markov chain example with recurring states. We compare the TD algorithm in two versions the first is with accumulating elegiblity traces and the second replacing eligiblity traces. I believe that when the probability of transition from a current state to the state to its right is 0.5 the true solution to this problem (for five states) is V = [0.6,0.7,0.8,0.9,1.0]. Using this as the target outcome and comparing the two methods as we have done previously, we see that with accumulating elegibility traces the algorithm does not converge for as large a number of alphas. The replacing traces algorithm seems to converge for all lambda in the range [0,1], while the accumulating elegibility traces converges over a much smaller range. In fact for larger values of alpha the error in the replacing eligibility trace is less than that in the accumulating elgability trace case.

John Weatherwax

Last modified: Sun May 15 08:46:34 EDT 2005