Home > Chapter 13: Policy Gradient Methods
rlai.policy_gradient.policies.ParameterizedPolicy
Policy for use with policy gradient methods.
rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesJaxPolicy
Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is
performed using the JAX library. This is only compatible with feature extractors derived from
`rlai.gpi.state_action_value.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return
state-action feature vectors.
rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesPolicy
Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is coded up
manually. See the `SoftMaxInActionPreferencesJaxPolicy` for a similar policy in which the gradient is calculated
using the JAX library. This is only compatible with feature extractors derived from
`rlai.gpi.state_action_value.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return
state-action feature vectors.
rlai.policy_gradient.monte_carlo.reinforce.improve
Perform Monte Carlo improvement of an agent's policy within an environment via the REINFORCE policy gradient method.
This improvement function operates over rewards obtained at the end of episodes, so it is only appropriate for
episodic tasks.
:param agent: Agent containing a parameterized policy to be optimized.
:param environment: Environment.
:param num_episodes: Number of episodes to execute.
:param update_upon_every_visit: True to update each state-action pair upon each visit within an episode, or False to
update each state-action pair upon the first visit within an episode.
:param alpha: Policy gradient step size.
:param thread_manager: Thread manager. The current function (and the thread running it) will wait on this manager
before starting each iteration. This provides a mechanism for pausing, resuming, and aborting training. Omit for no
waiting.
:param plot_state_value: Whether to plot the state-value.
:param num_warmup_episodes: Number of warmup episodes to run before updating the policy. Warmup episodes allow
estimates (e.g., means and variances of feature scalers, baseline state-value estimators, etc.) to settle before
updating the policy.
:param num_episodes_per_policy_update_plot: Number of episodes per plot.
:param policy_update_plot_pdf_directory: Directory in which to store plot PDFs, or None to display them directly.
:param num_episodes_per_checkpoint: Number of episodes per checkpoint save.
:param checkpoint_path: Checkpoint path. Must be provided if `num_episodes_per_checkpoint` is provided.
:param training_pool_directory: Path to directory in which to store pooled training runs.
:param training_pool_count: Number of runners in the training pool.
:param training_pool_iterate_episodes: Number of episodes per training pool iteration.
:param training_pool_evaluate_episodes: Number of episodes to evaluate the agent when iterating the training pool.
:param training_pool_max_iterations_without_improvement: Maximum number of training pool iterations to allow
before reverting to the best prior agent, or None to never revert.
:return: Final checkpoint path, or None if checkpoints were not saved.
rlai.core.ContinuousMultiDimensionalAction
Continuous-valued multidimensional action.
rlai.policy_gradient.policies.continuous_action.ContinuousActionBetaDistributionPolicy
Parameterized policy that produces continuous, multidimensional actions by modeling multiple independent beta
distributions in terms of state features. This is appropriate for action spaces that are bounded in [min, max],
where the values of min and max can be different along each action dimension. The state features must be extracted
by an extractor derived from
`rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.
rlai.policy_gradient.policies.continuous_action.ContinuousActionNormalDistributionPolicy
Parameterized policy that produces continuous, multidimensional actions by modeling the multidimensional mean and
covariance matrix of the multivariate normal distribution in terms of state features. This is appropriate for action
spaces that are unbounded in (-infinity, infinity). The state features must be extracted by an extractor derived
from `rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.
rlai.policy_gradient.policies.continuous_action.ContinuousActionPolicy
Parameterized policy that produces continuous, multidimensional actions.