rlai | This is a Python implementation of concepts and algorithms described in “Reinforcement Learning: An Introduction” (Sutton and Barto, 2018, 2nd edition).

Home > Chapter 13: Policy Gradient Methods

rlai.policy_gradient.policies.ParameterizedPolicy

Policy for use with policy gradient methods.

rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesJaxPolicy

Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is
    performed using the JAX library. This is only compatible with feature extractors derived from
    `rlai.gpi.state_action_value.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return
    state-action feature vectors.

rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesPolicy

Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is coded up
    manually. See the `SoftMaxInActionPreferencesJaxPolicy` for a similar policy in which the gradient is calculated
    using the JAX library. This is only compatible with feature extractors derived from
    `rlai.gpi.state_action_value.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return
    state-action feature vectors.

rlai.policy_gradient.monte_carlo.reinforce.improve

Perform Monte Carlo improvement of an agent's policy within an environment via the REINFORCE policy gradient method.
    This improvement function operates over rewards obtained at the end of episodes, so it is only appropriate for
    episodic tasks.

    :param agent: Agent containing a parameterized policy to be optimized.
    :param environment: Environment.
    :param num_episodes: Number of episodes to execute.
    :param update_upon_every_visit: True to update each state-action pair upon each visit within an episode, or False to
    update each state-action pair upon the first visit within an episode.
    :param alpha: Policy gradient step size.
    :param thread_manager: Thread manager. The current function (and the thread running it) will wait on this manager
    before starting each iteration. This provides a mechanism for pausing, resuming, and aborting training. Omit for no
    waiting.
    :param plot_state_value: Whether to plot the state-value.
    :param num_warmup_episodes: Number of warmup episodes to run before updating the policy. Warmup episodes allow
    estimates (e.g., means and variances of feature scalers, baseline state-value estimators, etc.) to settle before
    updating the policy.
    :param num_episodes_per_policy_update_plot: Number of episodes per plot.
    :param policy_update_plot_pdf_directory: Directory in which to store plot PDFs, or None to display them directly.
    :param num_episodes_per_checkpoint: Number of episodes per checkpoint save.
    :param checkpoint_path: Checkpoint path. Must be provided if `num_episodes_per_checkpoint` is provided.
    :param training_pool_directory: Path to directory in which to store pooled training runs.
    :param training_pool_count: Number of runners in the training pool.
    :param training_pool_iterate_episodes: Number of episodes per training pool iteration.
    :param training_pool_evaluate_episodes: Number of episodes to evaluate the agent when iterating the training pool.
    :param training_pool_max_iterations_without_improvement: Maximum number of training pool iterations to allow
    before reverting to the best prior agent, or None to never revert.
    :return: Final checkpoint path, or None if checkpoints were not saved.

rlai.core.ContinuousMultiDimensionalAction

Continuous-valued multidimensional action.

rlai.policy_gradient.policies.continuous_action.ContinuousActionBetaDistributionPolicy

Parameterized policy that produces continuous, multidimensional actions by modeling multiple independent beta
    distributions in terms of state features. This is appropriate for action spaces that are bounded in [min, max],
    where the values of min and max can be different along each action dimension. The state features must be extracted
    by an extractor derived from
    `rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.

rlai.policy_gradient.policies.continuous_action.ContinuousActionNormalDistributionPolicy

Parameterized policy that produces continuous, multidimensional actions by modeling the multidimensional mean and
    covariance matrix of the multivariate normal distribution in terms of state features. This is appropriate for action
    spaces that are unbounded in (-infinity, infinity). The state features must be extracted by an extractor derived
    from `rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.

rlai.policy_gradient.policies.continuous_action.ContinuousActionPolicy

Parameterized policy that produces continuous, multidimensional actions.