Skip to the content.

Home > Robocode

Introduction

robocode Robocode is a programming game in which robot agents are programmed to operate in a simulated battle environment. Full documentation about the game can be found here. Robocode has several features that make it an appealing testbed for reinforcement learning:

The purpose of this case study is to explore the use of RLAI for learning robot policies.

Installing Robocode

  1. Download the Robocode installer here. This is a customized build of Robocode that has been modified to make it compatible with RLAI. This build provides robots with elevated permissions, particularly (1) permission to use TCP (socket) communication with the localhost (127.0.0.1) and (2) permission to perform reflection-based inspection of objects. These permissions are needed to communicate with the RLAI server, and in general they do not pose much of a security risk; however, it is probably a good idea to avoid importing other robots into this installation of Robocode.
  2. Run the Robocode installer. Install to a directory such as robocode_rlai.

Running Robocode with RLAI

  1. Start the Robocode RLAI environment. This is most easily done using the JupyterLab notebook. There is already a configuration saved in the notebook that should suffice as a demonstration of reinforcement learning with Robocode. Load the configuration and start it.
  2. Start Robocode from the directory into which you installed it above. Add a few robots as well as the RlaiRobot, then begin the battle. If this is successful, then you will see the RlaiRobot moving, firing, etc. This is the start of training, so the agent will likely appear random until its policy develops.
  3. In order to restart training with different parameters, you will need to first restart the Robocode application and then restart the RLAI environment by killing the command (if using the CLI) or by restarting the JupyterLab kernel. This is a bit tedious but is required to reset the state of each.

Development of a RL Robot for Robocode

The following sections explore the development of a RL robot for Robocode. The approach taken here is incremental and focuses on developing particular skills, rather than addressing the entire behavior policy at once. Each section links to a GitHub commit marking the code and configurations that were used. The layout of each section is as follows:

Radar-Driven Aiming Against a Stationary Opponent (GitHub)

Introduction

This section develops radar-driven aiming against a stationary opponent. The RL robot is also stationary but is allowed to rotate both its radar (to obtain bearings on the opponent) and its gun (to orient the gun with respect to the opponent’s bearing).

Reward Signal

The reward signal at each time step is defined as the sum of bullet power that has hit the opponent, minus the sum of bullet power that has missed the opponent.

State Features

Learning Model and Training

Key model parameters are listed below:

The full training command with all parameters is listed below:

rlai train --random-seed 12345 --agent rlai.gpi.state_action_value.ActionValueMdpAgent --gamma 0.99 --environment rlai.core.environments.robocode.RobocodeEnvironment --port 54321 --train-function rlai.gpi.temporal_difference.iteration.iterate_value_q_pi --mode SARSA --n-steps 100 --num-improvements 1000 --num-episodes-per-improvement 1 --num-updates-per-improvement 1 --make-final-policy-greedy True --num-improvements-per-plot 5 --save-agent-path ~/Desktop/robot_agent.pickle --q-S-A rlai.gpi.state_action_value.function_approximation.ApproximateStateActionValueEstimator --epsilon 0.2 --plot-model --plot-model-per-improvements 5 --function-approximation-model rlai.gpi.state_action_value.function_approximation.models.sklearn.SKLearnSGD --loss squared_error --sgd-alpha 0.0 --learning-rate constant --eta0 0.001 --feature-extractor rlai.core.environments.robocode.RobocodeFeatureExtractor

Running this training command in the JupyterLab notebook is shown below:

Results

The policy learned above produces the following Robocode battle:

In the video, the game rate is being manually adjusted to highlight early segments of the first few rounds and then speed through the rest of the battle (see the slider at the bottom). It is clear that the agent has developed a tactic of holding fire at the start of each round while rotating the radar to obtain a bearing on the opponent. Once a bearing is obtained, the gun is rotated accordingly and firing begins when the gun’s aim is sufficiently close. All ELOs seem to be satisfied by this approach.

Radar-Driven Aiming Against a Mobile Opponent (GitHub)

Introduction

This section develops radar-driven aiming against a mobile opponent. The RL robot is stationary but is allowed to rotate both its radar (to obtain bearings on the opponent) and its gun (to orient the gun with respect to the opponent’s bearing). A mobile opponent introduces the following challenges:

Reward Signal

The reward signal at each time step is defined as the sum of bullet power that hit the opponent, minus the sum of bullet power that missed the opponent. One difficulty that arose in the current development iteration pertains to reward delay and discounting. Many time steps elapse from the time a bullet is fired until it registers as a hit or miss. It is likely that the robot will take many actions during these intervening time steps, but none of these actions can possibly influence the outcome of the previously fired bullet. If we take the usual approach of discounting the bullet’s reward starting at the time when it hits or misses, then it becomes difficult to construct an appropriate discount. On one hand, we want the time step associated with the bullet firing to receive full credit for the subsequent hit or miss outcome. This would require us to eliminate discounting (i.e., gamma=1.0). On the other hand, we know that actions intervening between the firing of the bullet and the bullet’s hit or miss outcome should receive no credit. This would require us to discount entirely (i.e., gamma=0.0). In the usual discounting approach, we cannot simultaneously assign full credit to the bullet firing action while also assigning zero credit to actions that intervene between firing and the bullet’s outcome. A solution is to move away from the usual discounting approach. The alternative taken here is shown below:

reward-shift

In this alternative, the hit or miss outcome is shifted without discounting to the time at which the bullet was fired. The usual discounting is applied starting at this time step and working backward through actions taken prior to firing.

State Features

The state features described in this section make heavy use of two general concepts that are described below.

A useful consequence of transforming lateral distances as described above is that there is no need for an extra step of feature standardization, which was used in the previous iteration but found to be unnecessary here.

Learning Model and Training

Key model parameters are listed below:

The full training command with all parameters is listed below:

rlai train --save-agent-path ~/Desktop/robot_agent.pickle --random-seed 12345 --agent rlai.gpi.state_action_value.ActionValueMdpAgent --gamma 0.9 --environment rlai.core.environments.robocode.RobocodeEnvironment --port 54321 --train-function rlai.gpi.temporal_difference.iteration.iterate_value_q_pi --mode SARSA --n-steps 500 --num-improvements 100 --num-episodes-per-improvement 1 --num-updates-per-improvement 1 --make-final-policy-greedy True --num-improvements-per-plot 5 --q-S-A rlai.gpi.state_action_value.function_approximation.ApproximateStateActionValueEstimator --epsilon 0.15 --plot-model --plot-model-per-improvements 5 --function-approximation-model rlai.gpi.state_action_value.function_approximation.models.sklearn.SKLearnSGD --loss squared_error --sgd-alpha 0.0 --learning-rate constant --eta0 0.0005 --feature-extractor rlai.core.environments.robocode.RobocodeFeatureExtractor

Running this training command in the JupyterLab notebook is shown below:

Boxplots of the model coefficients for each action are shown below:

aiming-boxplots

In the boxplots above, the y-axes show feature weights, and the x-axes bin the training iterations (battle rounds) into groups of 10. The first 10 rounds are in bin 0, and the final 10 rounds are in bin 9. A few observations above these boxplots:

In general, these feature weights are consistent with the ELOs and intuition.

Results

The policy learned above produces the following Robocode battle:

Having experimented with the features, model parameters, and learned policies obtained in this iteration, it appears that most of the ELOs are satisfied. However, the radar was not expected to exhibit the behavior shown above, where it rotates continuously. The radar was expected to stay fixated on the opponent, as this would provide the most up-to-date bearings to the opponent. Training was performed several times, and this was always the successful policy that emerged. What’s going on? The answer likely lies in the specification of the reward signal, which instructs the agent to maximize bullet hits and minimize bullet misses. This objective does not differentiate the policy shown in the preceding video from an alternative in which the radar remains fixated continuously on the opponent. If the reward were redesigned to emphasize bullet hits per turn, then the alternative (radar fixation) would likely be favored because it would enable accurate, high-rate firing.

Primary points of future development:

Evasive Movement

Introduction

This section develops evasive movement against a stationary opponent who is actively tracking and firing. The RL robot is mobile but is not allowed to rotate its gun or fire. Moving unarmed against a firing opponent introduces the following challenges:

Reward Signal

The reward signal at each time step is 1.0 if no bullet hit the RL robot, and it is -10.0 if a bullet did hit the RL robot. Why the asymmetry? The enemy’s rate of fire is limited by Robocode physics, which dictate that each bullet firing event heats the gun. The enemy’s gun will not fire beyond a heat threshold, until it cools below the threshold per the cooling rate. Because the enemy’s rate of fire is limited, few bullets can possibly hit the RL agent relative to the number of turns. Thus, if the reward signal is symmetric with 1.0 (for each turn without a hit) and -1.0 (for each turn with a hit), then it is easy for returns to be dominated by positive values even when the enemy is aiming, firing, and hitting the RL robot very effectively. Using the asymmetric reward compensates for the enemy’s limited rate of fire, making hits much more prominent within the returns. I didn’t explore alternatives to -10.0, as it seemed to work without much experimentation.

State Features

Learning Model and Training

The full training command with all parameters is listed below:

rlai train --random-seed 12345 --agent rlai.core.environments.robocode.RobocodeAgent --gamma 0.9 --environment rlai.core.environments.robocode.RobocodeEnvironment --port 54321 --train-function rlai.gpi.temporal_difference.iteration.iterate_value_q_pi --mode SARSA --n-steps 100 --num-improvements 10000 --num-episodes-per-improvement 1 --num-updates-per-improvement 1 --epsilon 0.15 --q-S-A rlai.gpi.state_action_value.function_approximation.ApproximateStateActionValueEstimator --plot-model --plot-model-bins 10 --plot-model-per-improvements 100 --function-approximation-model rlai.gpi.state_action_value.function_approximation.models.sklearn.SKLearnSGD --loss squared_error --sgd-alpha 0.0 --learning-rate constant --eta0 0.001 --feature-extractor rlai.core.environments.robocode.RobocodeFeatureExtractor --make-final-policy-greedy True --num-improvements-per-plot 50 --save-agent-path ~/Desktop/robot_agent_move.pickle

Results

The policy learned above produces the following Robocode battle:

A few things are worth noting about the RL robot’s policy:

Integrated Aiming, Firing, and Movement (TBD)