Home > Chapter 13: Policy Gradient Methods
rlai.core.policies.parameterized.ParameterizedPolicy
Policy for use with policy gradient methods.
rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesJaxPolicy
Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is
performed using the JAX library. This is only compatible with feature extractors derived from
`rlai.q_S_A.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return state-action feature
vectors.
rlai.policy_gradient.policies.discrete_action.SoftMaxInActionPreferencesPolicy
Parameterized policy that implements a soft-max over action preferences. The policy gradient calculation is coded up
manually. See the `SoftMaxInActionPreferencesJaxPolicy` for a similar policy in which the gradient is calculated
using the JAX library. This is only compatible with feature extractors derived from
`rlai.q_S_A.function_approximation.models.feature_extraction.StateActionFeatureExtractor`, which return state-action feature
vectors.
rlai.policy_gradient.monte_carlo.reinforce.improve
Perform Monte Carlo improvement of an agent's policy within an environment via the REINFORCE policy gradient method.
This improvement function operates over rewards obtained at the end of episodes, so it is only appropriate for
episodic tasks.
:param agent: Agent containing a parameterized policy to be optimized.
:param environment: Environment.
:param num_episodes: Number of episodes to execute.
:param update_upon_every_visit: True to update each state-action pair upon each visit within an episode, or False to
update each state-action pair upon the first visit within an episode.
:param alpha: Policy gradient step size.
:param thread_manager: Thread manager. The current function (and the thread running it) will wait on this manager
before starting each iteration. This provides a mechanism for pausing, resuming, and aborting training. Omit for no
waiting.
:param plot_state_value: Whether to plot the state-value.
:param num_episodes_per_checkpoint: Number of episodes per checkpoint save.
:param checkpoint_path: Checkpoint path. Must be provided if `num_episodes_per_checkpoint` is provided.
:param training_pool_directory: Path to directory in which to store pooled training runs.
:param training_pool_count: Number of runners in the training pool.
:param training_pool_iterate_episodes: Number of episodes per training pool iteration.
:param training_pool_evaluate_episodes: Number of episodes to evaluate the agent when iterating the training pool.
:param training_pool_max_iterations_without_improvement: Maximum number of training pool iterations to allow
before reverting to the best prior agent, or None to never revert.
:return: Final checkpoint path, or None if checkpoints were not saved.
rlai.actions.ContinuousMultiDimensionalAction
Continuous-valued multi-dimensional action.
rlai.policy_gradient.policies.continuous_action.ContinuousActionBetaDistributionPolicy
Parameterized policy that produces continuous, multi-dimensional actions by modeling multiple independent beta
distributions in terms of state features. This is appropriate for action spaces that are bounded in [min, max],
where the values of min and max can be different along each action dimension. The state features must be extracted
by an extractor derived from `rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.
rlai.policy_gradient.policies.continuous_action.ContinuousActionNormalDistributionPolicy
Parameterized policy that produces continuous, multi-dimensional actions by modeling the multidimensional mean and
covariance matrix of the multivariate normal distribution in terms of state features. This is appropriate for action
spaces that are unbounded in (-infinity, infinity). The state features must be extracted by an extractor derived
from `rlai.state_value.function_approximation.models.feature_extraction.StateFeatureExtractor`.
rlai.policy_gradient.policies.continuous_action.ContinuousActionPolicy
Parameterized policy that produces continuous, multi-dimensional actions.