Skip to the content.

Home > Mountain Car

Introduction

The mountain car is not sufficiently powerful to climb the hill directly, but must instead develop a strategy based on the surrounding slopes. You can read more about this environment here. Below is an example of running a random (untrained) agent in this environment. The episode takes quite a long time to terminate.

Training

Train a control agent for the mountain car environment with the following command.

rlai train --agent "rlai.gpi.state_action_value.ActionValueMdpAgent" --continuous-state-discretization-resolution 0.005 --gamma 0.95 --environment "rlai.core.environments.gymnasium.Gym" --gym-id "MountainCar-v0" --render-every-nth-episode 1000 --video-directory "~/Desktop/mountaincar_videos" --train-function "rlai.gpi.temporal_difference.iteration.iterate_value_q_pi" --mode "Q_LEARNING" --num-improvements 10000 --num-episodes-per-improvement 1 --epsilon 0.01 --make-final-policy-greedy True --num-improvements-per-plot 100 --num-improvements-per-checkpoint 100 --checkpoint-path "~/Desktop/mountaincar_checkpoint.pickle" --save-agent-path "~/Desktop/mountaincar_agent.pickle"

Arguments are explained below.

Note that, unlike other tasks such as the inverted pendulum, no value is passed for --T (maximum number of time steps per episode). This is because there is no way to predict how long an episode will last, particularly episodes earlier in the training. All episodes must be permitted to run until success in order to learn a useful policy. The training progression is shown below.

acrobot

In the left sub-figure above, the left y-axis shows the negation of time taken to reach the goal, the right y-axis shows the size of the state space, and the x-axis shows improvement iterations for the agent’s policy. The right sub-figure shows the same reward y-axis but along a time x-axis. Based on the learning trajectory, it appears that little subsequent improvement would be gained were the agent to continue improving its policy; as shown below, the results are quite satisfactory after 30 minutes of wallclock training time.

Results

The video below shows the trained agent controlling the car. Note how the agent develops an oscillating movement.