Skip to content

SpatialBench for Presto Geospatial Benchmarking #289

@patdevinwilson

Description

@patdevinwilson

Proposal: SpatialBench for Presto Geospatial Benchmarking

Overview

This proposal introduces Apache Sedona SpatialBench as a standardized geospatial benchmark for Presto Native (CPU and GPU), following the same integration pattern that velox-testing uses for TPC-H and TPC-DS.

SpatialBench fills a critical gap: while TPC-H and TPC-DS thoroughly exercise relational operators (joins, aggregations, sorting), they contain zero geospatial operations. As NVIDIA invests in GPU-accelerating Velox's spatial functions (ST_Distance, ST_Contains, ST_Intersects, etc.), we need a reproducible benchmark to measure progress and prevent regressions.

Why SpatialBench

The Problem

Standard database benchmarks (TPC-H, TPC-DS) do not test:

  • Spatial predicate evaluation (point-in-polygon, intersection)
  • Distance-based filtering and joins
  • Geometry construction and serialization (WKB/WKT)
  • Spatial aggregations (convex hull, union)
  • Mixed spatial + relational workloads (spatial joins with GROUP BY)

Without a spatial benchmark, we cannot quantify the impact of GPU acceleration on geospatial queries, nor compare Presto Native GPU against CPU or Java workers for spatial workloads.

Why SpatialBench Specifically

Criteria SpatialBench Alternative (ad-hoc queries)
Reproducible Deterministic data generator with scale factors No guarantee
Standardized 12 queries covering all major spatial operations Cherry-picked
Scalable SF1 (1GB) to SF1000+ (1TB+) Fixed datasets
Community Apache-licensed, multi-engine (DuckDB, Spark, Sedona) Single-use
Realistic Transportation/urban mobility star schema Synthetic points
Unbiased No engine-specific optimizations baked in Tuned to one engine

Integration with velox-testing

Architecture

SpatialBench integrates with velox-testing following the established TPC-H pattern:

velox-testing/
├── presto/
│   ├── pbench/benchmarks/
│   │   ├── tpch/                      # Existing
│   │   │   ├── queries/
│   │   │   ├── duckdb_queries/
│   │   │   └── sf100.json
│   │   └── spatialbench/              # New
│   │       ├── queries/
│   │       │   ├── q01.sql ... q12.sql
│   │       └── sf1.json, sf10.json
│   └── scripts/
│       ├── setup_benchmark_tables.sh  # Extended for spatialbench
│       └── run_benchmark.sh           # Already supports -b flag
├── benchmarks/
│   └── spatialbench/
│       ├── scripts/
│       │   ├── generate_data.sh       # Wraps spatialbench-cli
│       │   ├── setup_tables.sh        # Hive external tables
│       │   └── run_benchmark.sh       # Standalone runner
│       ├── queries/
│       │   └── q01.sql ... q12.sql    # Presto-syntax queries
│       └── README.md
└── benchmark_data_tools/
    └── generate_table_schemas.py      # Extended for spatialbench

Data Pipeline

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│ spatialbench-cli │────▶│  Parquet files    │────▶│  Hive external      │
│ (Rust, SF1-1000) │     │  /datasets/       │     │  tables in Presto   │
│                  │     │  spatialbench/    │     │  (file metastore)   │
└─────────────────┘     │  sf{N}/           │     └─────────────────────┘
                        │   ├── trip/       │              │
                        │   ├── building/   │              ▼
                        │   ├── zone/       │     ┌─────────────────────┐
                        │   ├── customer/   │     │  Benchmark runner   │
                        │   ├── driver/     │     │  (Q1-Q12, timing,   │
                        │   └── vehicle/    │     │   CSV results)      │
                        └──────────────────┘     └─────────────────────┘

Workflow (mirrors TPC-H)

Step TPC-H SpatialBench
1. Generate data generate_data_files.py --benchmark-type tpch --scale-factor 100 generate_data.sh -s "1 10 100"
2. Start Presto start_native_gpu_presto.sh Same
3. Create tables setup_benchmark_tables.sh -b tpch -s tpch_sf100 -d sf100 setup_benchmark_tables.sh -b spatialbench -s spatialbench_sf1 -d sf1
4. Run benchmark run_benchmark.sh -b tpch -s tpch_sf100 run_benchmark.sh -b spatialbench -s spatialbench_sf1
5. Collect results CSV + profiling Same

Dataset

SpatialBench uses a transportation/urban mobility star schema:

Table Description Key Spatial Columns SF1 Rows SF1 Size
trip Taxi/rideshare trips (fact) t_pickuploc (WKB Point), t_dropoffloc (WKB Point) ~5M ~350 MB
building Building footprints (dimension) b_boundary (WKB Polygon) ~20K ~2 MB
zone Administrative zones (dimension) z_boundary (WKB Polygon) ~160K ~80 MB
customer Customer dimension ~150K ~5 MB
driver Driver dimension ~10K ~400 KB
vehicle Vehicle dimension ~10K ~400 KB

Geometry columns are stored as WKB (Well-Known Binary) in VARBINARY Parquet columns, matching Presto's ST_GeomFromBinary() input format.

Queries

The 12 queries are designed to test distinct spatial operations:

Tier 1: GPU-Acceleratable (Priority)

Query Operation Spatial Functions GPU Path
Q1 Distance filter + sort ST_Distance, ST_X, ST_Y great_circle_distance on GPU; coordinate extraction planned
Q3 Distance filter + aggregation ST_Distance + GROUP BY Same as Q1
Q7 Detour detection ST_Distance point-to-point Direct GPU acceleration
Q8 Building proximity join ST_Distance join GPU distance computation

Tier 2: Requires Point-in-Polygon GPU Support

Query Operation Spatial Functions GPU Path
Q2 Point-in-polygon filter ST_Intersects cuSpatial or custom kernel
Q4 Spatial join + aggregation ST_Within Same
Q6 Zone statistics ST_Intersects, ST_Within Same
Q10 Zone LEFT JOIN ST_Within Same
Q11 Cross-zone count ST_Within (double join) Same

Tier 3: Complex Geometry Operations (CPU Fallback)

Query Operation Spatial Functions GPU Path
Q5 Convex hull area ST_ConvexHull, ST_Collect, ST_Area GEOS CPU fallback
Q9 Building IoU ST_Intersection, ST_Area GEOS CPU fallback
Q12 KNN join ST_Distance ranking Distance on GPU, ranking on CPU

Scale Factors

SF Trip Rows Total Data Size Target Environment
1 ~5M ~500 MB Development, CI
10 ~50M ~5 GB Single GPU testing
100 ~500M ~50 GB Multi-GPU benchmarking
1000 ~5B ~500 GB Large-scale evaluation

GPU Acceleration Roadmap

Phase 1: Foundation (Current)

  • great_circle_distance via cuDF AST (fused single-kernel)
  • ST_X, ST_Y coordinate extraction
  • ST_Point construction
  • ST_Distance for Point-Point (Euclidean on GPU)

Phase 2: Point-in-Polygon

  • ST_Contains / ST_Within / ST_Intersects for Point-in-Polygon
  • Spatial join operator using GPU-accelerated point-in-polygon
  • Enables Q2, Q4, Q6, Q10, Q11

Phase 3: Complex Operations

  • ST_Area, ST_Length for simple geometries
  • ST_Envelope, ST_Buffer
  • CPU fallback for ST_Union, ST_Intersection, ST_ConvexHull

Measurement

For each phase, SpatialBench provides a clear metric:

Speedup = T(Java Presto, SF100) / T(Native GPU Presto, SF100)

Per-query breakdown shows exactly which spatial operations benefit from GPU acceleration and which remain CPU-bound, guiding investment in the next phase.

Implementation Status

Component Status Location
Data generator script Done benchmarks/spatialbench/scripts/generate_data.sh
SQL queries (Q1-Q12) Done benchmarks/spatialbench/queries/
Table setup script Done benchmarks/spatialbench/scripts/setup_tables.sh
Benchmark runner Done benchmarks/spatialbench/scripts/run_benchmark.sh
pbench integration Planned presto/pbench/benchmarks/spatialbench/
setup_benchmark_tables.sh extension Planned -b spatialbench support
CI integration Planned GitHub Actions with SF1

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions