rlai | This is a Python implementation of concepts and algorithms described in “Reinforcement Learning: An Introduction” (Sutton and Barto, 2018, 2nd edition).

Home > Chapter 4: Dynamic Programming

rlai.gpi.dynamic_programming.evaluation.evaluate_v_pi

Perform iterative policy evaluation of an agent's policy within an environment, returning state values.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.core.ModelBasedMdpState`.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: Minimum tolerated change in state-value estimates, below which evaluation terminates. Either `theta`
    or `num_iterations` (or both) can be specified, but passing neither will raise an exception.
    :param num_iterations: Number of evaluation iterations to execute.  Either `theta` or `num_iterations` (or both)
    can be specified, but passing neither will raise an exception.
    :param update_in_place: Whether to update value estimates in place.
    :param initial_v_S: Initial guess at state-value, or None for no guess.
    :return: 2-tuple of (1) dictionary of MDP states and their estimated values under the agent's policy, and (2) final
    value of delta.

rlai.gpi.dynamic_programming.evaluation.evaluate_q_pi

Perform iterative policy evaluation of an agent's policy within an environment, returning state-action values.

    :param agent: MDP agent.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: Minimum tolerated change in state-value estimates, below which evaluation terminates. Either `theta`
    or `num_iterations` (or both) can be specified, but passing neither will raise an exception.
    :param num_iterations: Number of evaluation iterations to execute.  Either `theta` or `num_iterations` (or both)
    can be specified, but passing neither will raise an exception.
    :param update_in_place: Whether to update value estimates in place.
    :param initial_q_S_A: Initial guess at state-action value, or None for no guess.
    :return: 2-tuple of (1) dictionary of MDP states, actions, and their estimated values under the agent's policy, and
    (2) final value of delta.

rlai.gpi.dynamic_programming.iteration.iterate_policy_q_pi

Run policy iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.core.ModelBasedMdpState`.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: Minimum tolerated change in state-value estimates, below which evaluation terminates.
    :param update_in_place: Whether to update value estimates in place.
    :return: Final state-action value estimates.

rlai.gpi.dynamic_programming.iteration.iterate_policy_v_pi

Run policy iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.core.ModelBasedMdpState`.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: Minimum tolerated change in state-value estimates, below which evaluation terminates.
    :param update_in_place: Whether to update value estimates in place.
    :return: Final state-value estimates.

rlai.gpi.dynamic_programming.iteration.iterate_value_v_pi

Run dynamic programming value iteration on an agent using state-value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.core.ModelBasedMdpState`.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: See `evaluate_v_pi`.
    :param evaluation_iterations_per_improvement: Number of policy evaluation iterations to execute for each iteration
    of improvement (e.g., passing 1 results in Equation 4.10).
    :param update_in_place: See `evaluate_v_pi`.
    :return: Final state-value estimates.

rlai.core.environments.gamblers_problem.GamblersProblem

Gambler's problem MDP environment.

rlai.gpi.dynamic_programming.iteration.iterate_value_q_pi

Run value iteration on an agent using state-action value estimates.

    :param agent: MDP agent. Must contain a policy `pi` that has been fully initialized with instances of
    `rlai.core.ModelBasedMdpState`.
    :param environment: Model-based MDP environment to evaluate.
    :param theta: See `evaluate_q_pi`.
    :param evaluation_iterations_per_improvement: Number of policy evaluation iterations to execute for each iteration
    of improvement.
    :param update_in_place: See `evaluate_q_pi`.
    :return: Final state-action value estimates.