Skip to content

feat: Add comprehensive KNN join integration tests and benchmarks#65

Merged
jiayuasu merged 2 commits intoapache:mainfrom
zhangfengcdt:feature/knn_join.integration.test
Sep 12, 2025
Merged

feat: Add comprehensive KNN join integration tests and benchmarks#65
jiayuasu merged 2 commits intoapache:mainfrom
zhangfengcdt:feature/knn_join.integration.test

Conversation

@zhangfengcdt
Copy link
Member

Summary

  • Add integration tests for KNN join functionality with synthetic data
  • Include cross-verification against PostGIS for correctness validation
  • Add comprehensive benchmarking comparing SedonaDB, PostGIS, and DuckDB
  • Test various scenarios: basic joins, polygon joins, edge cases, and attribute preservation
  • Performance results show SedonaDB is 8-655× faster than competitors

Integration Tests Added

  • Basic KNN Join: Point-to-point KNN queries with configurable k values
  • Mixed Geometry Types: Point-to-polygon KNN operations
  • Edge Cases: Handle scenarios where k > available targets
  • Attribute Preservation: Verify additional columns are maintained in results
  • Correctness Validation: Cross-verify results against PostGIS using equivalent queries

Benchmark Results
Performance comparison across three engines using both small (100 trips × 1000 buildings) and large (1000 trips × 2000
buildings) datasets:

Large Dataset Results

Engine k=1 k=5 k=10
SedonaDB 3.32ms 6.50ms 9.38ms
DuckDB 211.19ms 209.19ms 202.81ms
PostGIS 2171.37ms 2237.47ms 2221.19ms

Small Dataset Results

Engine k=1 k=5 k=10
SedonaDB 0.93ms 1.16ms 1.49ms
DuckDB 11.52ms 11.71ms 12.21ms
PostGIS 161.93ms 162.62ms 165.21ms

SedonaDB demonstrates 8-655× faster performance than competitors across all scenarios.

- Add integration tests for KNN join functionality with synthetic data
- Include cross-verification against PostGIS for correctness validation
- Add comprehensive benchmarking comparing SedonaDB, PostGIS, and DuckDB
- Test various scenarios: basic joins, polygon joins, edge cases, and attribute preservation
- Performance results show SedonaDB is 8-655× faster than competitors
@zhangfengcdt zhangfengcdt marked this pull request as ready for review September 11, 2025 20:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive KNN join integration tests and benchmarks to validate functionality and measure performance across multiple database engines. The implementation adds thorough test coverage for SedonaDB's KNN join capabilities and provides comparative benchmarking against PostGIS and DuckDB.

Key changes:

  • Add extensive integration tests covering basic joins, mixed geometry types, edge cases, and attribute preservation
  • Implement cross-verification against PostGIS to ensure correctness
  • Add comprehensive benchmarking framework comparing SedonaDB, PostGIS, and DuckDB performance

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
python/sedonadb/tests/test_knnjoin.py New comprehensive test suite for KNN join functionality with various scenarios and PostGIS validation
benchmarks/test_knn.py Enhanced benchmark suite with multi-engine comparison and improved test coverage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jiayuasu jiayuasu merged commit 79cea80 into apache:main Sep 12, 2025
6 checks passed
@2010YOUY01
Copy link
Contributor

It seems even for the largest dataset (2000 buildings x 1000 trips), sedona is only using 1 CPU core for the SpatialJoinExec phase. I think if you try a larger dataset at the join probe (trips) side, the speedup relative to DuckDB/pg can be higher. (That's a really impressive work!)

I tried two data sizes:
2000 buildings x 1000 trips: 25ms
2000 buildings x 20000 trips: 50ms

reproducer in sedona-cli:

-- Setup
CREATE OR REPLACE VIEW knn_buildings AS
SELECT
    geometry AS geom,
    ROUND(random() * 1000) AS building_id,
    'Building_' || CAST(ROUND(random() * 1000) AS VARCHAR) AS name
FROM sd_random_geometry('{
    "geom_type": "Polygon",
    "target_rows": 2000,
    "vertices_per_linestring_range": [4, 8],
    "size_range": [0.001, 0.01],
    "seed": 42
}');

CREATE OR REPLACE VIEW knn_trips AS
SELECT
    geometry AS geom,
    ROUND(random() * 100000) AS trip_id
FROM sd_random_geometry('{
    "geom_type": "Point",
    "target_rows": 100000,
    "seed": 43
}');



-- Benchmark Query
WITH trip_sample AS (
    SELECT trip_id, geom AS trip_geom
    FROM knn_trips
    LIMIT 20000                               -- <== This line is changed
),
building_with_geom AS (
    SELECT building_id, name, geom AS building_geom
    FROM knn_buildings
)
SELECT
    t.trip_id,
    b.building_id,
    b.name,
    ST_Distance(t.trip_geom, b.building_geom) AS distance
FROM trip_sample t
JOIN building_with_geom b ON ST_KNN(t.trip_geom, b.building_geom, 5, FALSE)
ORDER BY t.trip_id, distance;

Reasons

The data source sd_random_geometry seems to produce batches in 1k size, and the execution layer is using the number of CPU cores as the parallelism degree by default. If there is only 1 batch, only one partition is working; my machine has 14 cores, so if I try 20000 as the probe side size, all partitions can be kept busy with a similar amount of work to do per partition as before.

It can be verified through explain analyze verbose + query, and see output_rows field inside SpatialJoinExec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants