Fix README encoding issues and update badges

srikanthbaride · srikanthbaride · commit 28ebe78b920c · 2025-09-03T11:41:27.000-05:00
diff --git a/ch2_rl_formulation/README.md b/ch2_rl_formulation/README.md
@@ -1,21 +1,102 @@
-﻿# Chapter 2 â€” The RL Problem Formulation
+﻿# Chapter 2 — The RL Problem Formulation
 
-Implements: MDP formalism, Bellman expectation & optimality, gridworld, greedy/Îµ-greedy, value iteration.
-Includes numeric examples (5.23 and 4.58), demos, visualizations, and tests.
+Implements: **MDP** formalism, **Bellman expectation & optimality** equations, a 4×4 **GridWorld** environment, **greedy / ε-greedy** policies, and **value iteration**.  
+Includes numeric examples (5.23 and 4.58), demos, visualizations, and tests aligned with the textbook.
+
+---
+
+## ✅ Requirements
+
+- Python ≥ 3.10
+- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
+
+> Tip: Create and activate a virtual environment before installing.
+
+---
+
+## 🚀 Quickstart
 
-## Quickstart
 ```bash
+# Numeric checks for examples 5.23 and 4.58
 python -m ch2_rl_formulation.examples.numeric_checks
+
+# GridWorld demo: evaluate a policy, compute Q, and act greedily
 python -m ch2_rl_formulation.examples.gridworld_demo
+
+# Plot values and a derived greedy policy (matplotlib, no explicit colors)
 python -m ch2_rl_formulation.examples.plot_value_and_policy
 ```
 
-## Layout
-- `gridworld.py` â€” 4Ã—4 deterministic GridWorld.
-- `evaluation.py` â€” policy evaluation (deterministic & stochastic), `q_from_v`, `greedy_from_q`.
-- `policies.py` â€” deterministic & Îµ-greedy helpers.
-- `value_iteration.py` â€” value iteration, extract greedy policy.
-- `visualize.py` â€” single-plot, matplotlib-based visuals (no explicit colors).
-- `examples/` â€” boxed examples and demos.
-- `tests/` â€” sanity checks tied to the chapter.
+---
+
+## 📂 Layout
+
+```
+ch2_rl_formulation/
+├─ __init__.py
+├─ gridworld.py           # 4×4 deterministic GridWorld (tabular P, R)
+├─ evaluation.py          # policy_evaluation(), q_from_v(), greedy_from_q()
+├─ policies.py            # deterministic & ε-greedy policies
+├─ value_iteration.py     # value_iteration(), extract greedy policy
+├─ visualize.py           # minimal matplotlib plots (no fixed color maps)
+├─ examples/
+│  ├─ numeric_checks.py
+│  ├─ gridworld_demo.py
+│  └─ plot_value_and_policy.py
+└─ tests/
+   ├─ test_gridworld.py
+   ├─ test_evaluation.py
+   ├─ test_policies.py
+   └─ test_value_iteration.py
+```
+
+---
+
+## 🧠 What’s Inside (Brief API)
+
+### `gridworld.py`
+- `GridWorld4x4(step_reward=-1.0, goal=(0, 3))`
+  - Attributes: `S` (states), `A` (actions), `P` (S×A×S′), `R` (S×A), helpers for indexing.
+
+### `policies.py`
+- `deterministic_policy(mapping_or_array)`  
+- `epsilon_greedy_policy(Q, epsilon=0.1)`  
+
+### `evaluation.py`
+- `policy_evaluation(P, R, policy, gamma=0.99, tol=1e-8, max_iters=10_000)`  
+- `q_from_v(P, R, V, gamma=0.99)`  
+- `greedy_from_q(Q)`  
+
+### `value_iteration.py`
+- `value_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`  
+- Returns `(V*, π*)` where `π*` is greedy w.r.t. `V*`.  
+
+### `visualize.py`
+- `plot_values(V, shape=(4,4))`  
+- `plot_policy(pi, shape=(4,4))`  
+- Uses matplotlib with default styles (no explicit colors set).  
+
+---
+
+## 🧪 Tests
+
+```bash
+pytest -q ch2_rl_formulation/tests
+```
+
+- Covers grid dynamics, policy evaluation convergence, greedy extraction, and value iteration optimality.
+
+---
+
+## 📊 Reproducibility Notes
+
+- Matrices `P` and `R` are tabular numpy arrays (no randomness in dynamics).  
+- Examples 5.23 and 4.58 match the book’s numerics (tolerances set in tests).  
+- Plots avoid explicit color selection to keep CI/headless rendering consistent.  
+
+---
+
+## 🔗 Related
 
+- Chapter 3 (Multi-Armed Bandits): action–selection strategies under uncertainty  
+- Chapter 4 (Dynamic Programming): exact solutions with full model knowledge  
diff --git a/ch3_multi_armed_bandits/README.md b/ch3_multi_armed_bandits/README.md
@@ -1,21 +1,78 @@
-﻿# Chapter 3 â€” Multi-Armed Bandits
+﻿# Chapter 3 — Multi-Armed Bandits
 
-Implements Îµ-Greedy, UCB1, and Thompson Sampling on Bernoulli bandits.
+Implements **ε-Greedy**, **UCB1**, and **Thompson Sampling** strategies on Bernoulli bandits.  
+Includes worked examples, experiments, and PyTest-based validation.
+
+---
+
+## 🚀 Run Experiments
 
-## Run experiments
 ```bash
 python -m ch3_multi_armed_bandits.experiments --K 10 --T 5000 --trials 50 --eps 0.1 --c 1.0
 ```
 
-## Worked Examples
-See `examples/` for scripts reproducing the numerical examples:
-- Example 3.1: `ex1_regret_basic.py`
-- Example 3.2: `ex2_epsilon_update.py`
-- Example 3.3: `ex3_ucb_score.py`
-- Example 3.4: `ex4_thompson_update.py`
+Arguments:
+- `K` — number of arms  
+- `T` — time horizon (steps)  
+- `trials` — number of independent runs  
+- `eps` — exploration rate (for ε-greedy)  
+- `c` — confidence level (for UCB1)  
+
+---
+
+## 📘 Worked Examples
+
+Scripts reproducing the numerical examples from the chapter:
+
+- **Example 3.1:** `ex1_regret_basic.py` — cumulative regret calculation  
+- **Example 3.2:** `ex2_epsilon_update.py` — incremental update rule in ε-greedy  
+- **Example 3.3:** `ex3_ucb_score.py` — UCB1 confidence bound score computation  
+- **Example 3.4:** `ex4_thompson_update.py` — Bayesian update for Thompson Sampling  
+
+Run them directly, e.g.:
+
+```bash
+python -m ch3_multi_armed_bandits.examples.ex1_regret_basic
+```
+
+---
+
+## 🧪 Tests
 
-## Tests
 ```bash
 pytest -q ch3_multi_armed_bandits/tests
 ```
 
+Covers:
+- Regret monotonicity  
+- ε-greedy incremental update  
+- UCB1 bound computation  
+- Thompson Sampling’s posterior update  
+
+---
+
+## 📂 Layout
+
+```
+ch3_multi_armed_bandits/
+├─ __init__.py
+├─ bandits.py              # Bernoulli bandit environment
+├─ strategies.py           # ε-greedy, UCB1, Thompson Sampling
+├─ experiments.py          # CLI for running large-scale experiments
+├─ examples/
+│  ├─ ex1_regret_basic.py
+│  ├─ ex2_epsilon_update.py
+│  ├─ ex3_ucb_score.py
+│  └─ ex4_thompson_update.py
+└─ tests/
+   ├─ test_bandits.py
+   ├─ test_strategies.py
+   └─ test_regret.py
+```
+
+---
+
+## 🔗 Related
+
+- **Chapter 2 — The RL Problem Formulation**: foundational MDP setup and Bellman equations  
+- **Chapter 4 — Dynamic Programming**: full MDP solution methods (policy/value iteration)  
diff --git a/ch4_dynamic_programming/README.md b/ch4_dynamic_programming/README.md
@@ -1,40 +1,81 @@
-﻿# Chapter 4 â€” Dynamic Programming (DP)
+﻿# Chapter 4 — Dynamic Programming
 
-This folder contains minimal, well-tested NumPy implementations of the core DP algorithms:
-- **Iterative Policy Evaluation**
-- **Policy Iteration** (Howard, 1960)
-- **Value Iteration** (Bellman optimality)
+Implements **policy evaluation, policy improvement, policy iteration, and value iteration** for Markov Decision Processes with known dynamics.  
+Includes convergence checks, numeric examples, and GridWorld demos.  
 
-**Environment:** a deterministic 4Ã—4 GridWorld with an absorbing terminal goal at the top-right.
+---
 
-## Run examples
+## ✅ Requirements
 
-```bash
-python -m ch4_dynamic_programming.examples.run_policy_iteration
-python -m ch4_dynamic_programming.examples.run_value_iteration
-```
+- Python ≥ 3.10
+- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
+
+---
 
-## Run tests
+## 🚀 Quickstart
 
 ```bash
-python -m pytest -q ch4_dynamic_programming/tests
+# Run policy iteration demo
+python -m ch4_dynamic_programming.examples.policy_iteration_demo
+
+# Run value iteration demo
+python -m ch4_dynamic_programming.examples.value_iteration_demo
 ```
 
-## File map
+---
+
+## 📂 Layout
 
 ```
 ch4_dynamic_programming/
-â”œâ”€â”€ __init__.py
-â”œâ”€â”€ gridworld.py
-â”œâ”€â”€ policy_evaluation.py
-â”œâ”€â”€ policy_iteration.py
-â”œâ”€â”€ value_iteration.py
-â”œâ”€â”€ utils.py
-â”œâ”€â”€ examples/
-â”‚   â”œâ”€â”€ run_policy_iteration.py
-â”‚   â””â”€â”€ run_value_iteration.py
-â””â”€â”€ tests/
-    â”œâ”€â”€ test_policy_evaluation.py
-    â””â”€â”€ test_policy_and_value_iteration.py
+├─ __init__.py
+├─ dp.py                   # core DP algorithms: policy evaluation, improvement, iteration, value iteration
+├─ gridworld.py            # GridWorld environment adapted for DP
+├─ examples/
+│  ├─ policy_iteration_demo.py
+│  └─ value_iteration_demo.py
+└─ tests/
+   ├─ test_policy_evaluation.py
+   ├─ test_policy_iteration.py
+   └─ test_value_iteration.py
+```
+
+---
+
+## 🧠 What’s Inside (Brief API)
+
+### `dp.py`
+- `policy_evaluation(P, R, policy, gamma=0.99, tol=1e-8, max_iters=10_000)`  
+- `policy_improvement(Q)`  
+- `policy_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`  
+- `value_iteration(P, R, gamma=0.99, tol=1e-8, max_iters=10_000)`  
+
+### `gridworld.py`
+- `DPGridWorld` — tabular environment with full transition & reward matrices.  
+
+---
+
+## 🧪 Tests
+
+```bash
+pytest -q ch4_dynamic_programming/tests
 ```
 
+Covers:
+- Convergence of policy evaluation  
+- Correctness of policy iteration  
+- Optimality of value iteration  
+
+---
+
+## 📊 Notes
+
+- GridWorld is small enough for exact DP solutions.  
+- Demonstrates how Bellman equations can be solved recursively when the full model is available.  
+
+---
+
+## 🔗 Related
+
+- Chapter 2 (RL Problem Formulation): MDPs, Bellman equations  
+- Chapter 3 (Multi-Armed Bandits): exploration strategies without state transitions  
diff --git a/ch5_monte_carlo/README.md b/ch5_monte_carlo/README.md
@@ -0,0 +1,81 @@
+﻿# Chapter 5 — Monte Carlo Methods
+
+Implements **Monte Carlo prediction** and **Monte Carlo control** (on-policy and off-policy) for learning value functions and policies from sampled episodes.  
+Demonstrates every-visit MC, exploring starts (ES), ε-soft on-policy control, and importance sampling for off-policy learning.
+
+---
+
+## ✅ Requirements
+
+- Python ≥ 3.10
+- `pip install -r requirements.txt` (use the repo-root `requirements.txt`)
+
+---
+
+## 🚀 Quickstart
+
+```bash
+# Monte Carlo prediction demo
+python -m ch5_mon­te_carlo.examples.mc_prediction_demo
+
+# On-policy MC control in GridWorld
+python -m ch5_mon­te_carlo.examples.mc_control_onpolicy_gridworld
+
+# Exploring starts (ES) control in GridWorld
+python -m ch5_mon­te_carlo.examples.mc_control_es_gridworld
+
+# Off-policy MC with importance sampling demo
+python -m ch5_mon­te_carlo.examples.mc_offpolicy_is_demo
+```
+
+---
+
+## 📂 Layout
+
+```
+ch5_monte_carlo/
+├─ __init__.py
+├─ examples/
+│  ├─ mc_control_es_gridworld.py
+│  ├─ mc_control_onpolicy_gridworld.py
+│  ├─ mc_offpolicy_is_demo.py
+│  └─ mc_prediction_demo.py
+└─ tests/
+   ├─ __init__.py
+   ├─ test_mc_control.py
+   └─ test_offpolicy_is.py
+```
+
+---
+
+## 🧠 What’s Inside (Brief API)
+
+### Monte Carlo Prediction
+- Estimates state-value and action-value functions by averaging returns from sampled episodes.
+
+### Monte Carlo Control
+- **Exploring Starts (ES):** ensures sufficient exploration by starting from all state–action pairs.  
+- **On-Policy Control:** ε-soft policies improve gradually from data generated by the same policy.  
+- **Off-Policy Control:** learns optimal policy from data generated by a different behavior policy using importance sampling.  
+
+---
+
+## 🧪 Tests
+
+```bash
+pytest -q ch5_monte_carlo/tests
+```
+
+Covers:
+- Convergence of MC prediction  
+- Correctness of ES and on-policy MC control  
+- Stability of off-policy MC with importance sampling  
+
+---
+
+## 🔗 Related
+
+- Chapter 2 (RL Problem Formulation): foundation in MDPs and Bellman equations  
+- Chapter 3 (Multi-Armed Bandits): exploration strategies  
+- Chapter 4 (Dynamic Programming): exact solutions with known models  
+- Chapter 6 (Temporal-Difference Learning): bootstrapping methods bridging DP and MC