Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,16 @@ To run the SQL tests:
make test
```

### Running Benchmarks

Performance benchmarks detect regressions. Run the full suite:

```bash
make release && uv run python bench/bench.py
```

See [bench/README.md](bench/README.md) and [bench/PROFILING.md](bench/PROFILING.md) for filtering, profiling, and baseline management.

## Contributing

Contributions are welcome! Please:
Expand Down
201 changes: 201 additions & 0 deletions bench/PROFILING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Profiling Notes (Samply + DuckDB)

This doc captures what we learned while profiling `json_extract_columns` so we can repeat it without surprises.

## Build (symbols)

Use a symbol-rich build for useful stacks:

```bash
make reldebug
```

`make release` works, but stacks may show raw addresses if symbols are missing.

## CPU sampling with Samply (no local server)

To avoid launching Samply's local UI server, use `--save-only`.

```bash
mkdir -p bench/results/samply
samply record --save-only --output bench/results/samply/<case>.json.gz -- \
uv run python bench/run_benchmarks.py --filter <case>
```

Example:

```bash
samply record --save-only \
--output bench/results/samply/json_extract_columns-100k-many_patterns.json.gz -- \
uv run python bench/run_benchmarks.py --filter json_extract_columns/100k/many_patterns
```

Notes:
- `--save-only` prevents starting the local web server.
- `--no-open` only avoids opening the UI; it can still start the server.

## Offline symbolization (optional)

If you want symbols available later (even without the original binaries), add:

```bash
samply record --save-only --unstable-presymbolicate \
--output bench/results/samply/<case>.json.gz -- \
uv run python bench/run_benchmarks.py --filter <case>
```

This emits a sidecar file next to the profile:

```
bench/results/samply/<case>.json.syms.json
```

`--unstable-presymbolicate` is marked unstable by Samply, but it is useful when
you need symbols after moving the profile.

## Viewing a saved profile

Start the server without auto-opening a browser:

```bash
samply load --no-open bench/results/samply/<case>.json.gz
```

Then open `http://127.0.0.1:3000` manually (or the Firefox Profiler URL printed
by samply).

If a `.syms.json` sidecar exists in the same directory, Samply uses it for
symbolization.

## Analyzing profiles programmatically

Use `bench/analyze_profile.py` to extract function timings from profiles.
Requires `--unstable-presymbolicate` when recording to generate the `.syms.json` sidecar.

### Basic usage

```bash
python3 bench/analyze_profile.py bench/results/samply/<case>.json.gz
```

### Options

| Option | Description |
|--------|-------------|
| `--top N` | Show top N functions (default: 30) |
| `--filter STRING` | Filter functions containing STRING (case-insensitive) |
| `--thread NAME` | Analyze specific thread (default: thread with most samples) |

### Examples

```bash
# Basic analysis - shows all threads, then top functions by self/inclusive time
python3 bench/analyze_profile.py bench/results/samply/json_group_merge.json.gz

# Filter for json-related functions only
python3 bench/analyze_profile.py <profile> --filter json --top 20

# Analyze a specific thread (useful when multiple workers)
python3 bench/analyze_profile.py <profile> --thread python3

# Show more results
python3 bench/analyze_profile.py <profile> --top 50
```

### Output format

The script outputs two sections:

**Self time**: Time spent directly in each function (excluding callees).
Useful for finding CPU-intensive functions.

```
=== Self time (top 30) ===
30.3% 2335 duckdb::JsonGroupMergeApplyPatchInternal
26.2% 2015 duckdb::yyjson_mut_obj_iter_next
13.6% 1046 _platform_memcmp
```

**Inclusive time**: Time spent in each function including all callees.
Useful for finding hot call paths.

```
=== Inclusive time (top 30) ===
79.0% 6078 duckdb::AggregateFunction::UnaryScatterUpdate
40.8% 3143 duckdb::JsonGroupMergeApplyPatchInternal
```

### Symbol file structure

The `.syms.json` sidecar (generated by `--unstable-presymbolicate`):

```json
{
"string_table": ["symbol1", "symbol2", ...],
"data": [
{
"debug_name": "duckdb",
"symbol_table": [
{"rva": 8960, "size": 624, "symbol": 2}
]
}
]
}
```

- `string_table`: function names indexed by symbol_table entries
- `data[].debug_name`: library name (e.g., "duckdb", "libc")
- `data[].symbol_table`: maps RVA ranges to symbol indices
- Profile's `frameTable.address` contains RVAs to look up

### Troubleshooting

**"Error: syms file not found"**
Re-record with `--unstable-presymbolicate`:
```bash
samply record --save-only --unstable-presymbolicate --output <file>.json.gz -- <cmd>
```

**Functions showing as `<frame:N>` or `fun_XXXXXX`**
Symbols not found. Possible causes:
- Build without debug symbols (use `make reldebug`)
- System libraries without debug packages
- Binary stripped after recording

## Attaching to an existing process

On Linux, you can attach by PID:

```bash
samply record -p <pid>
```

On macOS, attaching to a running process requires:

```bash
samply setup
```

(This codesigns the binary so it can attach.)

## DuckDB query profiles (not CPU sampling)

To collect DuckDB's JSON query profile:

```bash
uv run python bench/run_benchmarks.py --profile --filter <case>
```

This writes:

```
bench/results/profiles/<case>/query_profile.json
```

## Benchmark outputs

`run_benchmarks.py` always writes timing results to:

```
bench/results/latest.json
```
88 changes: 87 additions & 1 deletion bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,31 @@ uv run python bench/bench.py
uv run python bench/compare_results.py --save-baseline
```

## Architecture

### Script Relationships

```
bench.py (orchestrator)
├─ ensure_data_exists() → generate_data.py
├─ run_sanity_checks() → sanity_checks.py
├─ run_benchmarks() → run_benchmarks.py
└─ run_comparison() → compare_results.py
```

| Script | Purpose |
|--------|---------|
| `bench.py` | One-command pipeline: generates data, validates, benchmarks, compares |
| `run_benchmarks.py` | Runs benchmarks with filtering/profiling options |
| `compare_results.py` | Compares latest vs baseline, detects regressions |
| `generate_data.py` | Creates deterministic synthetic datasets |
| `sanity_checks.py` | Validates data row counts and schema |
| `config.py` | Centralized configuration (sizes, scenarios, thresholds) |

**When to use which:**
- `bench.py` — Full pipeline, no options. Use for CI and general validation.
- `run_benchmarks.py` — Targeted runs with `--filter` and `--profile`. Use for investigation.

## Filtering Benchmarks

Use `--filter` with substring matching to run specific benchmarks:
Expand Down Expand Up @@ -46,6 +71,11 @@ All artifacts are in `bench/results/`:
| `diff.json` | Comparison between latest and baseline |
| `profiles/<case>/` | DuckDB query profiles (when collected) |

## Profiling

DuckDB query profiles are collected via `--profile`. For CPU sampling with Samply
(and `--save-only` to avoid a local server), see `bench/PROFILING.md`.

## Interpreting Results

### Statuses
Expand All @@ -63,7 +93,11 @@ All artifacts are in `bench/results/`:
- **tolerance_pct** (default: 5%): Minimum percentage change to be considered significant
- **min_effect_ms** (default: 5ms): Minimum absolute change to be considered significant

A change is `UNCHANGED` if it's within tolerance% OR below min_effect_ms.
A change is classified as `UNCHANGED` if **either**:
- Absolute change < min_effect_ms, OR
- Percentage change ≤ tolerance_pct

Both conditions protect against noise: small absolute changes in fast queries and small percentage changes in slow queries.

## Baseline Rules

Expand Down Expand Up @@ -100,6 +134,19 @@ uv run python bench/run_benchmarks.py --profile

Profiles are saved to `bench/results/profiles/<case>/query_profile.json`.

## Sanity Checks

Before running benchmarks, `bench.py` validates data integrity:

1. **Row count** — Each file has exactly the expected rows (1k, 10k, 100k)
2. **Schema** — Required columns exist: `json_nested`, `json_flat`, `g1e1`, `g1e3`, `g1e4`

If checks fail, regenerate data:

```bash
uv run python bench/generate_data.py
```

## Data Generation

Data is auto-generated on first run. To regenerate manually:
Expand All @@ -108,6 +155,45 @@ Data is auto-generated on first run. To regenerate manually:
uv run python bench/generate_data.py
```

### Dataset Structure

Each parquet file contains:

| Column | Description |
|--------|-------------|
| `json_nested` | Hierarchical JSON with 1-5 levels of nesting |
| `json_flat` | Flattened dot-notation version |
| `g1e1` | Group key with ~10 unique values |
| `g1e3` | Group key with ~1,000 unique values |
| `g1e4` | Group key with ~10,000 unique values |

Data is deterministic (seed=42) and reproducible across runs.

Dataset sizes are defined in `bench/config.py`:
- `1k`: 1,000 rows
- `10k`: 10,000 rows
- `100k`: 100,000 rows

## Adding New Benchmarks

1. **Define scenario in `config.py`:**
```python
SCENARIOS = [
# ...existing scenarios...
{"function": "json_new_fn", "scenario": "basic"},
]
```

2. **Add query builder in `run_benchmarks.py`:**
```python
case "json_new_fn":
return f"SELECT sum(length(CAST(json_new_fn(json_nested) AS VARCHAR))) FROM {table}"
```

3. **Run and save baseline:**
```bash
uv run python bench/run_benchmarks.py --filter json_new_fn
uv run python bench/compare_results.py --save-baseline
```

Cases are auto-discovered from `SIZES × SCENARIOS` (currently 27 cases).
Loading
Loading