Here we present some very simple results with the racetrack example. We first constructed a sample racetrack similar to the ones that are presented in the book. The allowable spots where our "car" can be found are

We next implement a Monte Carlo learning algorithm to estimate the action value function of a vechicle driving on this track. From that we can estimate the greedy policy that maximized this action value function. As a means of visualizing the solution computed we average out the possible actions and velocities to obtain an state value function that depends on position only. When we plot this we obtain

Interpreting the values on this graph as the cost to go from the given state to the exit of the race track we see the intuitive fact that states closer to the exit are cheaper while thoes further from the exit are more expensive. These experiments are not complete because some of the states may have had very few samples pass through them and consequently the state value function might not be estimated very accuratly. Obviously, additional analysis could be done to assess the quality of our solution.
John Weatherwax