Skip to the content.

Home > Lunar Lander with Continuous Control

Introduction

You can read more about this environment here. Many of the issues involved in solving this environment are addressed in the continuous mountain car case study, so we will focus here on details specific to the continuous lunar lander environment.

Development

A few key points of development are worth mentioning.

Fuel Level

Similar to the continuous mountain car environment, the continuous lunar lander does not include the concept of fuel. Within RLAI, an initial fuel level of 1.0 is set at the start of each episode, and throttle use depletes the fuel level accordingly. See here for details.

Reward

Looking at the Gym reward calculation code for the lunar lander, one sees a complicated arrangement of scaling factors and transformations. A goal in this case study was to simplify this reward function. The ideal terminal state is easy to describe: zeros across the position and movement state variables. This portion of the reward function is calculated as follows:

state_reward = -np.abs(observation[0:6]).sum()

In addition to rewarding the state variables as above, fuel is rewarded if the state is good. Rewarding for remaining fuel unconditionally can cause the agent to veer out of bounds immediately and thus sacrifice state reward for fuel reward. The terminating state is considered good if the lander is within the goal posts (which are at x = +/-0.2) and the other orientation variables (y position, x and y velocity, angle and angular velocity) are near zero.

The full code for the continuous lunar lander reward can be found here

Training

The following command trains an agent for the continuous lunar lander environment using policy gradient optimization with a baseline state-value estimator:

rlai train --random-seed 12345 --agent rlai.policy_gradient.ParameterizedMdpAgent --gamma 1.0 --environment rlai.core.environments.gymnasium.Gym --gym-id LunarLanderContinuous-v2 --render-every-nth-episode 100 --video-directory ~/Desktop/lunarlander_continuous_videos --plot-environment --T 500 --train-function rlai.policy_gradient.monte_carlo.reinforce.improve --num-episodes 50000 --plot-state-value True --v-S rlai.state_value.function_approximation.ApproximateStateValueEstimator --feature-extractor rlai.core.environments.gymnasium.ContinuousLunarLanderFeatureExtractor --function-approximation-model rlai.models.sklearn.SKLearnSGD --loss squared_error --sgd-alpha 0.0 --learning-rate constant --eta0 0.0001 --policy rlai.policy_gradient.policies.continuous_action.ContinuousActionBetaDistributionPolicy --policy-feature-extractor rlai.core.environments.gymnasium.ContinuousLunarLanderFeatureExtractor --plot-policy --alpha 0.0001 --update-upon-every-visit True --save-agent-path ~/Desktop/continuous_lunarlander_agent.pickle

The arguments are explained below.

RLAI

Agent

Environment

Training Function and Episodes

Baseline State-Value Estimator

Policy

Output

Results

The following video shows the final agent after 50000 training episodes:

Most of the landings are quite good despite challenging initial conditions. In a couple episodes, the lander comes down too fast, and the lander body strikes the ground. Another shortcoming is that the lander has not learned to fully shut down its engines after landing. Both of these might be addressed with additional learning episodes or changes to the policy and baseline models.