Implements: MDP formalism, Bellman expectation & optimality equations, a 4×4 GridWorld environment, greedy / ε-greedy policies, and value iteration.
Includes numeric examples (5.23 and 4.58), demos, visualizations, and tests aligned with the textbook.
- Python ≥ 3.10
pip install -r requirements.txt(use the repo-rootrequirements.txt)
Tip: Create and activate a virtual environment before installing.
# Numeric checks for examples 5.23 and 4.58
python -m ch2_rl_formulation.examples.numeric_checks
# GridWorld demo: evaluate a policy, compute Q, and act greedily
python -m ch2_rl_formulation.examples.gridworld_demo
# Plot values and a derived greedy policy (matplotlib, no explicit colors)
python -m ch2_rl_formulation.examples.plot_value_and_policych2_rl_formulation/
├─ __init__.py
├─ gridworld.py # 4×4 deterministic GridWorld (tabular P, R)
├─ evaluation.py # policy_evaluation(), q_from_v(), greedy_from_q()
├─ policies.py # deterministic & ε-greedy policies
├─ value_iteration.py # value_iteration(), extract greedy policy
├─ visualize.py # minimal matplotlib plots (no fixed color maps)
├─ examples/
│ ├─ numeric_checks.py
│ ├─ gridworld_demo.py
│ └─ plot_value_and_policy.py
└─ tests/
├─ test_gridworld.py
├─ test_evaluation.py
├─ test_policies.py
└─ test_value_iteration.py
GridWorld4x4(step_reward=-1.0, goal=(0, 3))- Attributes:
S(states),A(actions),P(S×A×S′),R(S×A), helpers for indexing.
- Attributes:
deterministic_policy(mapping_or_array)epsilon_greedy_policy(Q, epsilon=0.1)
policy_evaluation(P, R, policy, gamma=0.99, tol=1e-8, max_iters=10_000)q_from_v(P, R, V, gamma=0.99)greedy_from_q(Q)
value_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)- Returns
(V*, π*)whereπ*is greedy w.r.t.V*.
plot_values(V, shape=(4,4))plot_policy(pi, shape=(4,4))- Uses matplotlib with default styles (no explicit colors set).
pytest -q ch2_rl_formulation/tests- Covers grid dynamics, policy evaluation convergence, greedy extraction, and value iteration optimality.
- Matrices
PandRare tabular numpy arrays (no randomness in dynamics). - Examples 5.23 and 4.58 match the book’s numerics (tolerances set in tests).
- Plots avoid explicit color selection to keep CI/headless rendering consistent.
- Chapter 3 (Multi-Armed Bandits): action–selection strategies under uncertainty
- Chapter 4 (Dynamic Programming): exact solutions with full model knowledge