Skip to content

Commit 28ebe78

Browse files
Fix README encoding issues and update badges
1 parent 2da6a8b commit 28ebe78

File tree

4 files changed

+308
-48
lines changed

4 files changed

+308
-48
lines changed

ch2_rl_formulation/README.md

Lines changed: 93 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,102 @@
1-
# Chapter 2 — The RL Problem Formulation
1+
# Chapter 2 The RL Problem Formulation
22

3-
Implements: MDP formalism, Bellman expectation & optimality, gridworld, greedy/ε-greedy, value iteration.
4-
Includes numeric examples (5.23 and 4.58), demos, visualizations, and tests.
3+
Implements: **MDP** formalism, **Bellman expectation & optimality** equations, a 4×4 **GridWorld** environment, **greedy / ε-greedy** policies, and **value iteration**.
4+
Includes numeric examples (5.23 and 4.58), demos, visualizations, and tests aligned with the textbook.
5+
6+
---
7+
8+
## ✅ Requirements
9+
10+
- Python ≥ 3.10
11+
- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
12+
13+
> Tip: Create and activate a virtual environment before installing.
14+
15+
---
16+
17+
## 🚀 Quickstart
518

6-
## Quickstart
719
```bash
20+
# Numeric checks for examples 5.23 and 4.58
821
python -m ch2_rl_formulation.examples.numeric_checks
22+
23+
# GridWorld demo: evaluate a policy, compute Q, and act greedily
924
python -m ch2_rl_formulation.examples.gridworld_demo
25+
26+
# Plot values and a derived greedy policy (matplotlib, no explicit colors)
1027
python -m ch2_rl_formulation.examples.plot_value_and_policy
1128
```
1229

13-
## Layout
14-
- `gridworld.py` — 4×4 deterministic GridWorld.
15-
- `evaluation.py` — policy evaluation (deterministic & stochastic), `q_from_v`, `greedy_from_q`.
16-
- `policies.py` — deterministic & ε-greedy helpers.
17-
- `value_iteration.py` — value iteration, extract greedy policy.
18-
- `visualize.py` — single-plot, matplotlib-based visuals (no explicit colors).
19-
- `examples/` — boxed examples and demos.
20-
- `tests/` — sanity checks tied to the chapter.
30+
---
31+
32+
## 📂 Layout
33+
34+
```
35+
ch2_rl_formulation/
36+
├─ __init__.py
37+
├─ gridworld.py # 4×4 deterministic GridWorld (tabular P, R)
38+
├─ evaluation.py # policy_evaluation(), q_from_v(), greedy_from_q()
39+
├─ policies.py # deterministic & ε-greedy policies
40+
├─ value_iteration.py # value_iteration(), extract greedy policy
41+
├─ visualize.py # minimal matplotlib plots (no fixed color maps)
42+
├─ examples/
43+
│ ├─ numeric_checks.py
44+
│ ├─ gridworld_demo.py
45+
│ └─ plot_value_and_policy.py
46+
└─ tests/
47+
├─ test_gridworld.py
48+
├─ test_evaluation.py
49+
├─ test_policies.py
50+
└─ test_value_iteration.py
51+
```
52+
53+
---
54+
55+
## 🧠 What’s Inside (Brief API)
56+
57+
### `gridworld.py`
58+
- `GridWorld4x4(step_reward=-1.0, goal=(0, 3))`
59+
- Attributes: `S` (states), `A` (actions), `P` (S×A×S′), `R` (S×A), helpers for indexing.
60+
61+
### `policies.py`
62+
- `deterministic_policy(mapping_or_array)`
63+
- `epsilon_greedy_policy(Q, epsilon=0.1)`
64+
65+
### `evaluation.py`
66+
- `policy_evaluation(P, R, policy, gamma=0.99, tol=1e-8, max_iters=10_000)`
67+
- `q_from_v(P, R, V, gamma=0.99)`
68+
- `greedy_from_q(Q)`
69+
70+
### `value_iteration.py`
71+
- `value_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`
72+
- Returns `(V*, π*)` where `π*` is greedy w.r.t. `V*`.
73+
74+
### `visualize.py`
75+
- `plot_values(V, shape=(4,4))`
76+
- `plot_policy(pi, shape=(4,4))`
77+
- Uses matplotlib with default styles (no explicit colors set).
78+
79+
---
80+
81+
## 🧪 Tests
82+
83+
```bash
84+
pytest -q ch2_rl_formulation/tests
85+
```
86+
87+
- Covers grid dynamics, policy evaluation convergence, greedy extraction, and value iteration optimality.
88+
89+
---
90+
91+
## 📊 Reproducibility Notes
92+
93+
- Matrices `P` and `R` are tabular numpy arrays (no randomness in dynamics).
94+
- Examples 5.23 and 4.58 match the book’s numerics (tolerances set in tests).
95+
- Plots avoid explicit color selection to keep CI/headless rendering consistent.
96+
97+
---
98+
99+
## 🔗 Related
21100

101+
- Chapter 3 (Multi-Armed Bandits): action–selection strategies under uncertainty
102+
- Chapter 4 (Dynamic Programming): exact solutions with full model knowledge

ch3_multi_armed_bandits/README.md

Lines changed: 67 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,78 @@
1-
# Chapter 3 — Multi-Armed Bandits
1+
# Chapter 3 Multi-Armed Bandits
22

3-
Implements ε-Greedy, UCB1, and Thompson Sampling on Bernoulli bandits.
3+
Implements **ε-Greedy**, **UCB1**, and **Thompson Sampling** strategies on Bernoulli bandits.
4+
Includes worked examples, experiments, and PyTest-based validation.
5+
6+
---
7+
8+
## 🚀 Run Experiments
49

5-
## Run experiments
610
```bash
711
python -m ch3_multi_armed_bandits.experiments --K 10 --T 5000 --trials 50 --eps 0.1 --c 1.0
812
```
913

10-
## Worked Examples
11-
See `examples/` for scripts reproducing the numerical examples:
12-
- Example 3.1: `ex1_regret_basic.py`
13-
- Example 3.2: `ex2_epsilon_update.py`
14-
- Example 3.3: `ex3_ucb_score.py`
15-
- Example 3.4: `ex4_thompson_update.py`
14+
Arguments:
15+
- `K` — number of arms
16+
- `T` — time horizon (steps)
17+
- `trials` — number of independent runs
18+
- `eps` — exploration rate (for ε-greedy)
19+
- `c` — confidence level (for UCB1)
20+
21+
---
22+
23+
## 📘 Worked Examples
24+
25+
Scripts reproducing the numerical examples from the chapter:
26+
27+
- **Example 3.1:** `ex1_regret_basic.py` — cumulative regret calculation
28+
- **Example 3.2:** `ex2_epsilon_update.py` — incremental update rule in ε-greedy
29+
- **Example 3.3:** `ex3_ucb_score.py` — UCB1 confidence bound score computation
30+
- **Example 3.4:** `ex4_thompson_update.py` — Bayesian update for Thompson Sampling
31+
32+
Run them directly, e.g.:
33+
34+
```bash
35+
python -m ch3_multi_armed_bandits.examples.ex1_regret_basic
36+
```
37+
38+
---
39+
40+
## 🧪 Tests
1641

17-
## Tests
1842
```bash
1943
pytest -q ch3_multi_armed_bandits/tests
2044
```
2145

46+
Covers:
47+
- Regret monotonicity
48+
- ε-greedy incremental update
49+
- UCB1 bound computation
50+
- Thompson Sampling’s posterior update
51+
52+
---
53+
54+
## 📂 Layout
55+
56+
```
57+
ch3_multi_armed_bandits/
58+
├─ __init__.py
59+
├─ bandits.py # Bernoulli bandit environment
60+
├─ strategies.py # ε-greedy, UCB1, Thompson Sampling
61+
├─ experiments.py # CLI for running large-scale experiments
62+
├─ examples/
63+
│ ├─ ex1_regret_basic.py
64+
│ ├─ ex2_epsilon_update.py
65+
│ ├─ ex3_ucb_score.py
66+
│ └─ ex4_thompson_update.py
67+
└─ tests/
68+
├─ test_bandits.py
69+
├─ test_strategies.py
70+
└─ test_regret.py
71+
```
72+
73+
---
74+
75+
## 🔗 Related
76+
77+
- **Chapter 2 — The RL Problem Formulation**: foundational MDP setup and Bellman equations
78+
- **Chapter 4 — Dynamic Programming**: full MDP solution methods (policy/value iteration)

ch4_dynamic_programming/README.md

Lines changed: 67 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,81 @@
1-
# Chapter 4 — Dynamic Programming (DP)
1+
# Chapter 4 Dynamic Programming
22

3-
This folder contains minimal, well-tested NumPy implementations of the core DP algorithms:
4-
- **Iterative Policy Evaluation**
5-
- **Policy Iteration** (Howard, 1960)
6-
- **Value Iteration** (Bellman optimality)
3+
Implements **policy evaluation, policy improvement, policy iteration, and value iteration** for Markov Decision Processes with known dynamics.
4+
Includes convergence checks, numeric examples, and GridWorld demos.
75

8-
**Environment:** a deterministic 4×4 GridWorld with an absorbing terminal goal at the top-right.
6+
---
97

10-
## Run examples
8+
## ✅ Requirements
119

12-
```bash
13-
python -m ch4_dynamic_programming.examples.run_policy_iteration
14-
python -m ch4_dynamic_programming.examples.run_value_iteration
15-
```
10+
- Python ≥ 3.10
11+
- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
12+
13+
---
1614

17-
## Run tests
15+
## 🚀 Quickstart
1816

1917
```bash
20-
python -m pytest -q ch4_dynamic_programming/tests
18+
# Run policy iteration demo
19+
python -m ch4_dynamic_programming.examples.policy_iteration_demo
20+
21+
# Run value iteration demo
22+
python -m ch4_dynamic_programming.examples.value_iteration_demo
2123
```
2224

23-
## File map
25+
---
26+
27+
## 📂 Layout
2428

2529
```
2630
ch4_dynamic_programming/
27-
├── __init__.py
28-
├── gridworld.py
29-
├── policy_evaluation.py
30-
├── policy_iteration.py
31-
├── value_iteration.py
32-
├── utils.py
33-
├── examples/
34-
│ ├── run_policy_iteration.py
35-
│ └── run_value_iteration.py
36-
└── tests/
37-
├── test_policy_evaluation.py
38-
└── test_policy_and_value_iteration.py
31+
├─ __init__.py
32+
├─ dp.py # core DP algorithms: policy evaluation, improvement, iteration, value iteration
33+
├─ gridworld.py # GridWorld environment adapted for DP
34+
├─ examples/
35+
│ ├─ policy_iteration_demo.py
36+
│ └─ value_iteration_demo.py
37+
└─ tests/
38+
├─ test_policy_evaluation.py
39+
├─ test_policy_iteration.py
40+
└─ test_value_iteration.py
41+
```
42+
43+
---
44+
45+
## 🧠 What’s Inside (Brief API)
46+
47+
### `dp.py`
48+
- `policy_evaluation(P, R, policy, gamma=0.99, tol=1e-8, max_iters=10_000)`
49+
- `policy_improvement(Q)`
50+
- `policy_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`
51+
- `value_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`
52+
53+
### `gridworld.py`
54+
- `DPGridWorld` — tabular environment with full transition & reward matrices.
55+
56+
---
57+
58+
## 🧪 Tests
59+
60+
```bash
61+
pytest -q ch4_dynamic_programming/tests
3962
```
4063

64+
Covers:
65+
- Convergence of policy evaluation
66+
- Correctness of policy iteration
67+
- Optimality of value iteration
68+
69+
---
70+
71+
## 📊 Notes
72+
73+
- GridWorld is small enough for exact DP solutions.
74+
- Demonstrates how Bellman equations can be solved recursively when the full model is available.
75+
76+
---
77+
78+
## 🔗 Related
79+
80+
- Chapter 2 (RL Problem Formulation): MDPs, Bellman equations
81+
- Chapter 3 (Multi-Armed Bandits): exploration strategies without state transitions

ch5_monte_carlo/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Chapter 5 — Monte Carlo Methods
2+
3+
Implements **Monte Carlo prediction** and **Monte Carlo control** (on-policy and off-policy) for learning value functions and policies from sampled episodes.
4+
Demonstrates every-visit MC, exploring starts (ES), ε-soft on-policy control, and importance sampling for off-policy learning.
5+
6+
---
7+
8+
## ✅ Requirements
9+
10+
- Python ≥ 3.10
11+
- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
12+
13+
---
14+
15+
## 🚀 Quickstart
16+
17+
```bash
18+
# Monte Carlo prediction demo
19+
python -m ch5_mon­te_carlo.examples.mc_prediction_demo
20+
21+
# On-policy MC control in GridWorld
22+
python -m ch5_mon­te_carlo.examples.mc_control_onpolicy_gridworld
23+
24+
# Exploring starts (ES) control in GridWorld
25+
python -m ch5_mon­te_carlo.examples.mc_control_es_gridworld
26+
27+
# Off-policy MC with importance sampling demo
28+
python -m ch5_mon­te_carlo.examples.mc_offpolicy_is_demo
29+
```
30+
31+
---
32+
33+
## 📂 Layout
34+
35+
```
36+
ch5_monte_carlo/
37+
├─ __init__.py
38+
├─ examples/
39+
│ ├─ mc_control_es_gridworld.py
40+
│ ├─ mc_control_onpolicy_gridworld.py
41+
│ ├─ mc_offpolicy_is_demo.py
42+
│ └─ mc_prediction_demo.py
43+
└─ tests/
44+
├─ __init__.py
45+
├─ test_mc_control.py
46+
└─ test_offpolicy_is.py
47+
```
48+
49+
---
50+
51+
## 🧠 What’s Inside (Brief API)
52+
53+
### Monte Carlo Prediction
54+
- Estimates state-value and action-value functions by averaging returns from sampled episodes.
55+
56+
### Monte Carlo Control
57+
- **Exploring Starts (ES):** ensures sufficient exploration by starting from all state–action pairs.
58+
- **On-Policy Control:** ε-soft policies improve gradually from data generated by the same policy.
59+
- **Off-Policy Control:** learns optimal policy from data generated by a different behavior policy using importance sampling.
60+
61+
---
62+
63+
## 🧪 Tests
64+
65+
```bash
66+
pytest -q ch5_monte_carlo/tests
67+
```
68+
69+
Covers:
70+
- Convergence of MC prediction
71+
- Correctness of ES and on-policy MC control
72+
- Stability of off-policy MC with importance sampling
73+
74+
---
75+
76+
## 🔗 Related
77+
78+
- Chapter 2 (RL Problem Formulation): foundation in MDPs and Bellman equations
79+
- Chapter 3 (Multi-Armed Bandits): exploration strategies
80+
- Chapter 4 (Dynamic Programming): exact solutions with known models
81+
- Chapter 6 (Temporal-Difference Learning): bootstrapping methods bridging DP and MC

0 commit comments

Comments
 (0)