Flamefork · Flamefork · Jan 25, 2026 · Jan 24, 2026 · Jan 24, 2026 · Jan 25, 2026
diff --git a/README.md b/README.md
@@ -239,6 +239,16 @@ To run the SQL tests:
 make test
 ```
 
+### Running Benchmarks
+
+Performance benchmarks detect regressions. Run the full suite:
+
+```bash
+make release && uv run python bench/bench.py
+```
+
+See [bench/README.md](bench/README.md) and [bench/PROFILING.md](bench/PROFILING.md) for filtering, profiling, and baseline management.
+
 ## Contributing
 
 Contributions are welcome! Please:

diff --git a/bench/PROFILING.md b/bench/PROFILING.md
@@ -0,0 +1,201 @@
+# Profiling Notes (Samply + DuckDB)
+
+This doc captures what we learned while profiling `json_extract_columns` so we can repeat it without surprises.
+
+## Build (symbols)
+
+Use a symbol-rich build for useful stacks:
+
+```bash
+make reldebug
+```
+
+`make release` works, but stacks may show raw addresses if symbols are missing.
+
+## CPU sampling with Samply (no local server)
+
+To avoid launching Samply's local UI server, use `--save-only`.
+
+```bash
+mkdir -p bench/results/samply
+samply record --save-only --output bench/results/samply/<case>.json.gz -- \
+  uv run python bench/run_benchmarks.py --filter <case>
+```
+
+Example:
+
+```bash
+samply record --save-only \
+  --output bench/results/samply/json_extract_columns-100k-many_patterns.json.gz -- \
+  uv run python bench/run_benchmarks.py --filter json_extract_columns/100k/many_patterns
+```
+
+Notes:
+- `--save-only` prevents starting the local web server.
+- `--no-open` only avoids opening the UI; it can still start the server.
+
+## Offline symbolization (optional)
+
+If you want symbols available later (even without the original binaries), add:
+
+```bash
+samply record --save-only --unstable-presymbolicate \
+  --output bench/results/samply/<case>.json.gz -- \
+  uv run python bench/run_benchmarks.py --filter <case>
+```
+
+This emits a sidecar file next to the profile:
+
+```
+bench/results/samply/<case>.json.syms.json
+```
+
+`--unstable-presymbolicate` is marked unstable by Samply, but it is useful when
+you need symbols after moving the profile.
+
+## Viewing a saved profile
+
+Start the server without auto-opening a browser:
+
+```bash
+samply load --no-open bench/results/samply/<case>.json.gz
+```
+
+Then open `http://127.0.0.1:3000` manually (or the Firefox Profiler URL printed
+by samply).
+
+If a `.syms.json` sidecar exists in the same directory, Samply uses it for
+symbolization.
+
+## Analyzing profiles programmatically
+
+Use `bench/analyze_profile.py` to extract function timings from profiles.
+Requires `--unstable-presymbolicate` when recording to generate the `.syms.json` sidecar.
+
+### Basic usage
+
+```bash
+python3 bench/analyze_profile.py bench/results/samply/<case>.json.gz
+```
+
+### Options
+
+| Option | Description |
+|--------|-------------|
+| `--top N` | Show top N functions (default: 30) |
+| `--filter STRING` | Filter functions containing STRING (case-insensitive) |
+| `--thread NAME` | Analyze specific thread (default: thread with most samples) |
+
+### Examples
+
+```bash
+# Basic analysis - shows all threads, then top functions by self/inclusive time
+python3 bench/analyze_profile.py bench/results/samply/json_group_merge.json.gz
+
+# Filter for json-related functions only
+python3 bench/analyze_profile.py <profile> --filter json --top 20
+
+# Analyze a specific thread (useful when multiple workers)
+python3 bench/analyze_profile.py <profile> --thread python3
+
+# Show more results
+python3 bench/analyze_profile.py <profile> --top 50
+```
+
+### Output format
+
+The script outputs two sections:
+
+**Self time**: Time spent directly in each function (excluding callees).
+Useful for finding CPU-intensive functions.
+
+```
+=== Self time (top 30) ===
+ 30.3%   2335  duckdb::JsonGroupMergeApplyPatchInternal
+ 26.2%   2015  duckdb::yyjson_mut_obj_iter_next
+ 13.6%   1046  _platform_memcmp
+```
+
+**Inclusive time**: Time spent in each function including all callees.
+Useful for finding hot call paths.
+
+```
+=== Inclusive time (top 30) ===
+ 79.0%   6078  duckdb::AggregateFunction::UnaryScatterUpdate
+ 40.8%   3143  duckdb::JsonGroupMergeApplyPatchInternal
+```
+
+### Symbol file structure
+
+The `.syms.json` sidecar (generated by `--unstable-presymbolicate`):
+
+```json
+{
+  "string_table": ["symbol1", "symbol2", ...],
+  "data": [
+    {
+      "debug_name": "duckdb",
+      "symbol_table": [
+        {"rva": 8960, "size": 624, "symbol": 2}
+      ]
+    }
+  ]
+}
+```
+
+- `string_table`: function names indexed by symbol_table entries
+- `data[].debug_name`: library name (e.g., "duckdb", "libc")
+- `data[].symbol_table`: maps RVA ranges to symbol indices
+- Profile's `frameTable.address` contains RVAs to look up
+
+### Troubleshooting
+
+**"Error: syms file not found"**
+Re-record with `--unstable-presymbolicate`:
+```bash
+samply record --save-only --unstable-presymbolicate --output <file>.json.gz -- <cmd>
+```
+
+**Functions showing as `<frame:N>` or `fun_XXXXXX`**
+Symbols not found. Possible causes:
+- Build without debug symbols (use `make reldebug`)
+- System libraries without debug packages
+- Binary stripped after recording
+
+## Attaching to an existing process
+
+On Linux, you can attach by PID:
+
+```bash
+samply record -p <pid>
+```
+
+On macOS, attaching to a running process requires:
+
+```bash
+samply setup
+```
+
+(This codesigns the binary so it can attach.)
+
+## DuckDB query profiles (not CPU sampling)
+
+To collect DuckDB's JSON query profile:
+
+```bash
+uv run python bench/run_benchmarks.py --profile --filter <case>
+```
+
+This writes:
+
+```
+bench/results/profiles/<case>/query_profile.json
+```
+
+## Benchmark outputs
+
+`run_benchmarks.py` always writes timing results to:
+
+```
+bench/results/latest.json
+```
diff --git a/bench/README.md b/bench/README.md
@@ -15,6 +15,31 @@ uv run python bench/bench.py
 uv run python bench/compare_results.py --save-baseline
 ```
 
+## Architecture
+
+### Script Relationships
+
+```
+bench.py (orchestrator)
+├─ ensure_data_exists() → generate_data.py
+├─ run_sanity_checks() → sanity_checks.py
+├─ run_benchmarks() → run_benchmarks.py
+└─ run_comparison() → compare_results.py
+```
+
+| Script | Purpose |
+|--------|---------|
+| `bench.py` | One-command pipeline: generates data, validates, benchmarks, compares |
+| `run_benchmarks.py` | Runs benchmarks with filtering/profiling options |
+| `compare_results.py` | Compares latest vs baseline, detects regressions |
+| `generate_data.py` | Creates deterministic synthetic datasets |
+| `sanity_checks.py` | Validates data row counts and schema |
+| `config.py` | Centralized configuration (sizes, scenarios, thresholds) |
+
+**When to use which:**
+- `bench.py` — Full pipeline, no options. Use for CI and general validation.
+- `run_benchmarks.py` — Targeted runs with `--filter` and `--profile`. Use for investigation.
+
 ## Filtering Benchmarks
 
 Use `--filter` with substring matching to run specific benchmarks:
@@ -46,6 +71,11 @@ All artifacts are in `bench/results/`:
 | `diff.json` | Comparison between latest and baseline |
 | `profiles/<case>/` | DuckDB query profiles (when collected) |
 
+## Profiling
+
+DuckDB query profiles are collected via `--profile`. For CPU sampling with Samply
+(and `--save-only` to avoid a local server), see `bench/PROFILING.md`.
+
 ## Interpreting Results
 
 ### Statuses
@@ -63,7 +93,11 @@ All artifacts are in `bench/results/`:
 - **tolerance_pct** (default: 5%): Minimum percentage change to be considered significant
 - **min_effect_ms** (default: 5ms): Minimum absolute change to be considered significant
 
-A change is `UNCHANGED` if it's within tolerance% OR below min_effect_ms.
+A change is classified as `UNCHANGED` if **either**:
+- Absolute change < min_effect_ms, OR
+- Percentage change ≤ tolerance_pct
+
+Both conditions protect against noise: small absolute changes in fast queries and small percentage changes in slow queries.
 
 ## Baseline Rules
 
@@ -100,6 +134,19 @@ uv run python bench/run_benchmarks.py --profile
 
 Profiles are saved to `bench/results/profiles/<case>/query_profile.json`.
 
+## Sanity Checks
+
+Before running benchmarks, `bench.py` validates data integrity:
+
+1. **Row count** — Each file has exactly the expected rows (1k, 10k, 100k)
+2. **Schema** — Required columns exist: `json_nested`, `json_flat`, `g1e1`, `g1e3`, `g1e4`
+
+If checks fail, regenerate data:
+
+```bash
+uv run python bench/generate_data.py
+```
+
 ## Data Generation
 
 Data is auto-generated on first run. To regenerate manually:
@@ -108,6 +155,45 @@ Data is auto-generated on first run. To regenerate manually:
 uv run python bench/generate_data.py
 ```
 
+### Dataset Structure
+
+Each parquet file contains:
+
+| Column | Description |
+|--------|-------------|
+| `json_nested` | Hierarchical JSON with 1-5 levels of nesting |
+| `json_flat` | Flattened dot-notation version |
+| `g1e1` | Group key with ~10 unique values |
+| `g1e3` | Group key with ~1,000 unique values |
+| `g1e4` | Group key with ~10,000 unique values |
+
+Data is deterministic (seed=42) and reproducible across runs.
+
 Dataset sizes are defined in `bench/config.py`:
+- `1k`: 1,000 rows
 - `10k`: 10,000 rows
 - `100k`: 100,000 rows
+
+## Adding New Benchmarks
+
+1. **Define scenario in `config.py`:**
+   ```python
+   SCENARIOS = [
+       # ...existing scenarios...
+       {"function": "json_new_fn", "scenario": "basic"},
+   ]
+   ```
+
+2. **Add query builder in `run_benchmarks.py`:**
+   ```python
+   case "json_new_fn":
+       return f"SELECT sum(length(CAST(json_new_fn(json_nested) AS VARCHAR))) FROM {table}"
+   ```
+
+3. **Run and save baseline:**
+   ```bash
+   uv run python bench/run_benchmarks.py --filter json_new_fn
+   uv run python bench/compare_results.py --save-baseline
+   ```
+
+Cases are auto-discovered from `SIZES × SCENARIOS` (currently 27 cases).