Skip to the content.

Home > Acrobot

Introduction

The acrobot is a two-joint, two-link mechanism, and the goal is to get the end of the mechanism to touch the horizontal height marker. You can read more about this environment here. Below is an example of running a random (untrained) agent in this environment. The episode takes quite a long time to terminate.

Training

Train a control agent for the acrobot environment with the following command.

rlai train --agent "rlai.gpi.state_action_value.ActionValueMdpAgent" --continuous-state-discretization-resolution 0.5 --gamma 0.9 --environment "rlai.core.environments.gymnasium.Gym" --gym-id "Acrobot-v1" --render-every-nth-episode 1000 --video-directory "~/Desktop/acrobat_videos" --train-function "rlai.gpi.temporal_difference.iteration.iterate_value_q_pi" --mode "Q_LEARNING" --num-improvements 10000 --num-episodes-per-improvement 10 --epsilon 0.05 --make-final-policy-greedy True --num-improvements-per-plot 100 --num-improvements-per-checkpoint 1000 --checkpoint-path "~/Desktop/acrobat_checkpoint.pickle" --save-agent-path "~/Desktop/acrobat_agent.pickle"

Arguments are explained below.

Note that, unlike other tasks such as the inverted pendulum, no value is passed for --T (maximum number of time steps per episode). This is because there is no way to predict how long an episode will last, particularly episodes earlier in the training. All episodes must be permitted to run until success in order to learn a useful policy. The training progression is shown below.

acrobot

In the left sub-figure above, the left y-axis shows the negation of time taken to reach the goal, the right y-axis shows the size of the state space, and the x-axis shows improvement iterations for the agent’s policy. The right sub-figure shows the same reward y-axis but along a time x-axis. Based on the learning trajectory, it appears that little subsequent improvement would be gained were the agent to continue improving its policy; as shown below, the results are quite satisfactory after 1.25 hours of wallclock training time.

Results

The video below shows the trained agent controlling the acrobot. Note how the agent develops an oscillating movement to get the free-moving joint progressively higher. After achieving sufficient height of the free-moving joint, it appears that the agent is waiting for random dynamics to swing the end of the mechanism up to the goal line. It is difficult to discern a systematic pattern of behavior in the critical final moments of the episode, having watched many attempts like this.