SpatialBench for Presto Geospatial Benchmarking

# Proposal: SpatialBench for Presto Geospatial Benchmarking

## Overview

This proposal introduces [Apache Sedona SpatialBench](https://sedona.apache.org/spatialbench/) as a standardized geospatial benchmark for Presto Native (CPU and GPU), following the same integration pattern that velox-testing uses for TPC-H and TPC-DS.

SpatialBench fills a critical gap: while TPC-H and TPC-DS thoroughly exercise relational operators (joins, aggregations, sorting), they contain **zero geospatial operations**. As NVIDIA invests in GPU-accelerating Velox's spatial functions (`ST_Distance`, `ST_Contains`, `ST_Intersects`, etc.), we need a reproducible benchmark to measure progress and prevent regressions.

## Why SpatialBench

### The Problem

Standard database benchmarks (TPC-H, TPC-DS) do not test:
- Spatial predicate evaluation (point-in-polygon, intersection)
- Distance-based filtering and joins
- Geometry construction and serialization (WKB/WKT)
- Spatial aggregations (convex hull, union)
- Mixed spatial + relational workloads (spatial joins with GROUP BY)

Without a spatial benchmark, we cannot quantify the impact of GPU acceleration on geospatial queries, nor compare Presto Native GPU against CPU or Java workers for spatial workloads.

### Why SpatialBench Specifically

| Criteria | SpatialBench | Alternative (ad-hoc queries) |
|----------|-------------|------------------------------|
| Reproducible | Deterministic data generator with scale factors | No guarantee |
| Standardized | 12 queries covering all major spatial operations | Cherry-picked |
| Scalable | SF1 (1GB) to SF1000+ (1TB+) | Fixed datasets |
| Community | Apache-licensed, multi-engine (DuckDB, Spark, Sedona) | Single-use |
| Realistic | Transportation/urban mobility star schema | Synthetic points |
| Unbiased | No engine-specific optimizations baked in | Tuned to one engine |

## Integration with velox-testing

### Architecture

SpatialBench integrates with velox-testing following the established TPC-H pattern:

```
velox-testing/
├── presto/
│   ├── pbench/benchmarks/
│   │   ├── tpch/                      # Existing
│   │   │   ├── queries/
│   │   │   ├── duckdb_queries/
│   │   │   └── sf100.json
│   │   └── spatialbench/              # New
│   │       ├── queries/
│   │       │   ├── q01.sql ... q12.sql
│   │       └── sf1.json, sf10.json
│   └── scripts/
│       ├── setup_benchmark_tables.sh  # Extended for spatialbench
│       └── run_benchmark.sh           # Already supports -b flag
├── benchmarks/
│   └── spatialbench/
│       ├── scripts/
│       │   ├── generate_data.sh       # Wraps spatialbench-cli
│       │   ├── setup_tables.sh        # Hive external tables
│       │   └── run_benchmark.sh       # Standalone runner
│       ├── queries/
│       │   └── q01.sql ... q12.sql    # Presto-syntax queries
│       └── README.md
└── benchmark_data_tools/
    └── generate_table_schemas.py      # Extended for spatialbench
```

### Data Pipeline

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│ spatialbench-cli │────▶│  Parquet files    │────▶│  Hive external      │
│ (Rust, SF1-1000) │     │  /datasets/       │     │  tables in Presto   │
│                  │     │  spatialbench/    │     │  (file metastore)   │
└─────────────────┘     │  sf{N}/           │     └─────────────────────┘
                        │   ├── trip/       │              │
                        │   ├── building/   │              ▼
                        │   ├── zone/       │     ┌─────────────────────┐
                        │   ├── customer/   │     │  Benchmark runner   │
                        │   ├── driver/     │     │  (Q1-Q12, timing,   │
                        │   └── vehicle/    │     │   CSV results)      │
                        └──────────────────┘     └─────────────────────┘
```

### Workflow (mirrors TPC-H)

| Step | TPC-H | SpatialBench |
|------|-------|--------------|
| **1. Generate data** | `generate_data_files.py --benchmark-type tpch --scale-factor 100` | `generate_data.sh -s "1 10 100"` |
| **2. Start Presto** | `start_native_gpu_presto.sh` | Same |
| **3. Create tables** | `setup_benchmark_tables.sh -b tpch -s tpch_sf100 -d sf100` | `setup_benchmark_tables.sh -b spatialbench -s spatialbench_sf1 -d sf1` |
| **4. Run benchmark** | `run_benchmark.sh -b tpch -s tpch_sf100` | `run_benchmark.sh -b spatialbench -s spatialbench_sf1` |
| **5. Collect results** | CSV + profiling | Same |

## Dataset

SpatialBench uses a transportation/urban mobility star schema:

| Table | Description | Key Spatial Columns | SF1 Rows | SF1 Size |
|-------|-------------|--------------------|---------:|--------:|
| **trip** | Taxi/rideshare trips (fact) | `t_pickuploc` (WKB Point), `t_dropoffloc` (WKB Point) | ~5M | ~350 MB |
| **building** | Building footprints (dimension) | `b_boundary` (WKB Polygon) | ~20K | ~2 MB |
| **zone** | Administrative zones (dimension) | `z_boundary` (WKB Polygon) | ~160K | ~80 MB |
| **customer** | Customer dimension | — | ~150K | ~5 MB |
| **driver** | Driver dimension | — | ~10K | ~400 KB |
| **vehicle** | Vehicle dimension | — | ~10K | ~400 KB |

Geometry columns are stored as WKB (Well-Known Binary) in VARBINARY Parquet columns, matching Presto's `ST_GeomFromBinary()` input format.

## Queries

The 12 queries are designed to test distinct spatial operations:

### Tier 1: GPU-Acceleratable (Priority)

| Query | Operation | Spatial Functions | GPU Path |
|-------|-----------|-------------------|----------|
| Q1 | Distance filter + sort | `ST_Distance`, `ST_X`, `ST_Y` | `great_circle_distance` on GPU; coordinate extraction planned |
| Q3 | Distance filter + aggregation | `ST_Distance` + `GROUP BY` | Same as Q1 |
| Q7 | Detour detection | `ST_Distance` point-to-point | Direct GPU acceleration |
| Q8 | Building proximity join | `ST_Distance` join | GPU distance computation |

### Tier 2: Requires Point-in-Polygon GPU Support

| Query | Operation | Spatial Functions | GPU Path |
|-------|-----------|-------------------|----------|
| Q2 | Point-in-polygon filter | `ST_Intersects` | cuSpatial or custom kernel |
| Q4 | Spatial join + aggregation | `ST_Within` | Same |
| Q6 | Zone statistics | `ST_Intersects`, `ST_Within` | Same |
| Q10 | Zone LEFT JOIN | `ST_Within` | Same |
| Q11 | Cross-zone count | `ST_Within` (double join) | Same |

### Tier 3: Complex Geometry Operations (CPU Fallback)

| Query | Operation | Spatial Functions | GPU Path |
|-------|-----------|-------------------|----------|
| Q5 | Convex hull area | `ST_ConvexHull`, `ST_Collect`, `ST_Area` | GEOS CPU fallback |
| Q9 | Building IoU | `ST_Intersection`, `ST_Area` | GEOS CPU fallback |
| Q12 | KNN join | `ST_Distance` ranking | Distance on GPU, ranking on CPU |

## Scale Factors

| SF | Trip Rows | Total Data Size | Target Environment |
|----|-----------|----------------|--------------------|
| 1 | ~5M | ~500 MB | Development, CI |
| 10 | ~50M | ~5 GB | Single GPU testing |
| 100 | ~500M | ~50 GB | Multi-GPU benchmarking |
| 1000 | ~5B | ~500 GB | Large-scale evaluation |

## GPU Acceleration Roadmap

### Phase 1: Foundation (Current)
- [x] `great_circle_distance` via cuDF AST (fused single-kernel)
- [ ] `ST_X`, `ST_Y` coordinate extraction
- [ ] `ST_Point` construction
- [ ] `ST_Distance` for Point-Point (Euclidean on GPU)

### Phase 2: Point-in-Polygon
- [ ] `ST_Contains` / `ST_Within` / `ST_Intersects` for Point-in-Polygon
- [ ] Spatial join operator using GPU-accelerated point-in-polygon
- [ ] Enables Q2, Q4, Q6, Q10, Q11

### Phase 3: Complex Operations
- [ ] `ST_Area`, `ST_Length` for simple geometries
- [ ] `ST_Envelope`, `ST_Buffer`
- [ ] CPU fallback for `ST_Union`, `ST_Intersection`, `ST_ConvexHull`

### Measurement

For each phase, SpatialBench provides a clear metric:

```
Speedup = T(Java Presto, SF100) / T(Native GPU Presto, SF100)
```

Per-query breakdown shows exactly which spatial operations benefit from GPU acceleration and which remain CPU-bound, guiding investment in the next phase.

## Implementation Status

| Component | Status | Location |
|-----------|--------|----------|
| Data generator script | Done | `benchmarks/spatialbench/scripts/generate_data.sh` |
| SQL queries (Q1-Q12) | Done | `benchmarks/spatialbench/queries/` |
| Table setup script | Done | `benchmarks/spatialbench/scripts/setup_tables.sh` |
| Benchmark runner | Done | `benchmarks/spatialbench/scripts/run_benchmark.sh` |
| pbench integration | Planned | `presto/pbench/benchmarks/spatialbench/` |
| `setup_benchmark_tables.sh` extension | Planned | `-b spatialbench` support |
| CI integration | Planned | GitHub Actions with SF1 |

## References

- [Apache Sedona SpatialBench](https://sedona.apache.org/spatialbench/)
- [SpatialBench GitHub](https://github.com/apache/sedona-spatialbench)
- [SpatialBench Methodology](https://sedona.apache.org/spatialbench/methodology/)
- [Velox Geospatial Functions](https://facebookincubator.github.io/velox/functions/presto/geospatial.html)
- [Velox Geospatial Discussion](https://github.com/facebookincubator/velox/discussions/4687)


Step	TPC-H	SpatialBench
1. Generate data	`generate_data_files.py --benchmark-type tpch --scale-factor 100`	`generate_data.sh -s "1 10 100"`
2. Start Presto	`start_native_gpu_presto.sh`	Same
3. Create tables	`setup_benchmark_tables.sh -b tpch -s tpch_sf100 -d sf100`	`setup_benchmark_tables.sh -b spatialbench -s spatialbench_sf1 -d sf1`
4. Run benchmark	`run_benchmark.sh -b tpch -s tpch_sf100`	`run_benchmark.sh -b spatialbench -s spatialbench_sf1`
5. Collect results	CSV + profiling	Same

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpatialBench for Presto Geospatial Benchmarking #289

Proposal: SpatialBench for Presto Geospatial Benchmarking

Overview

Why SpatialBench

The Problem

Why SpatialBench Specifically

Integration with velox-testing

Architecture

Data Pipeline

Workflow (mirrors TPC-H)

Dataset

Queries

Tier 1: GPU-Acceleratable (Priority)

Tier 2: Requires Point-in-Polygon GPU Support

Tier 3: Complex Geometry Operations (CPU Fallback)

Scale Factors

GPU Acceleration Roadmap

Phase 1: Foundation (Current)

Phase 2: Point-in-Polygon

Phase 3: Complex Operations

Measurement

Implementation Status

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Criteria	SpatialBench	Alternative (ad-hoc queries)
Reproducible	Deterministic data generator with scale factors	No guarantee
Standardized	12 queries covering all major spatial operations	Cherry-picked
Scalable	SF1 (1GB) to SF1000+ (1TB+)	Fixed datasets
Community	Apache-licensed, multi-engine (DuckDB, Spark, Sedona)	Single-use
Realistic	Transportation/urban mobility star schema	Synthetic points
Unbiased	No engine-specific optimizations baked in	Tuned to one engine

Table	Description	Key Spatial Columns	SF1 Rows	SF1 Size
trip	Taxi/rideshare trips (fact)	`t_pickuploc` (WKB Point), `t_dropoffloc` (WKB Point)	~5M	~350 MB
building	Building footprints (dimension)	`b_boundary` (WKB Polygon)	~20K	~2 MB
zone	Administrative zones (dimension)	`z_boundary` (WKB Polygon)	~160K	~80 MB
customer	Customer dimension	—	~150K	~5 MB
driver	Driver dimension	—	~10K	~400 KB
vehicle	Vehicle dimension	—	~10K	~400 KB

Query	Operation	Spatial Functions	GPU Path
Q1	Distance filter + sort	`ST_Distance`, `ST_X`, `ST_Y`	`great_circle_distance` on GPU; coordinate extraction planned
Q3	Distance filter + aggregation	`ST_Distance` + `GROUP BY`	Same as Q1
Q7	Detour detection	`ST_Distance` point-to-point	Direct GPU acceleration
Q8	Building proximity join	`ST_Distance` join	GPU distance computation

Query	Operation	Spatial Functions	GPU Path
Q2	Point-in-polygon filter	`ST_Intersects`	cuSpatial or custom kernel
Q4	Spatial join + aggregation	`ST_Within`	Same
Q6	Zone statistics	`ST_Intersects`, `ST_Within`	Same
Q10	Zone LEFT JOIN	`ST_Within`	Same
Q11	Cross-zone count	`ST_Within` (double join)	Same

Query	Operation	Spatial Functions	GPU Path
Q5	Convex hull area	`ST_ConvexHull`, `ST_Collect`, `ST_Area`	GEOS CPU fallback
Q9	Building IoU	`ST_Intersection`, `ST_Area`	GEOS CPU fallback
Q12	KNN join	`ST_Distance` ranking	Distance on GPU, ranking on CPU

SF	Trip Rows	Total Data Size	Target Environment
1	~5M	~500 MB	Development, CI
10	~50M	~5 GB	Single GPU testing
100	~500M	~50 GB	Multi-GPU benchmarking
1000	~5B	~500 GB	Large-scale evaluation

Component	Status	Location
Data generator script	Done	`benchmarks/spatialbench/scripts/generate_data.sh`
SQL queries (Q1-Q12)	Done	`benchmarks/spatialbench/queries/`
Table setup script	Done	`benchmarks/spatialbench/scripts/setup_tables.sh`
Benchmark runner	Done	`benchmarks/spatialbench/scripts/run_benchmark.sh`
pbench integration	Planned	`presto/pbench/benchmarks/spatialbench/`
`setup_benchmark_tables.sh` extension	Planned	`-b spatialbench` support
CI integration	Planned	GitHub Actions with SF1

SpatialBench for Presto Geospatial Benchmarking #289

Description

Proposal: SpatialBench for Presto Geospatial Benchmarking

Overview

Why SpatialBench

The Problem

Why SpatialBench Specifically

Integration with velox-testing

Architecture

Data Pipeline

Workflow (mirrors TPC-H)

Dataset

Queries

Tier 1: GPU-Acceleratable (Priority)

Tier 2: Requires Point-in-Polygon GPU Support

Tier 3: Complex Geometry Operations (CPU Fallback)

Scale Factors

GPU Acceleration Roadmap

Phase 1: Foundation (Current)

Phase 2: Point-in-Polygon

Phase 3: Complex Operations

Measurement

Implementation Status

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions