feat: Add comprehensive KNN join integration tests and benchmarks by zhangfengcdt · Pull Request #65 · apache/sedona-db

zhangfengcdt · 2025-09-11T18:08:11Z

Summary

Add integration tests for KNN join functionality with synthetic data
Include cross-verification against PostGIS for correctness validation
Add comprehensive benchmarking comparing SedonaDB, PostGIS, and DuckDB
Test various scenarios: basic joins, polygon joins, edge cases, and attribute preservation
Performance results show SedonaDB is 8-655× faster than competitors

Integration Tests Added

Basic KNN Join: Point-to-point KNN queries with configurable k values
Mixed Geometry Types: Point-to-polygon KNN operations
Edge Cases: Handle scenarios where k > available targets
Attribute Preservation: Verify additional columns are maintained in results
Correctness Validation: Cross-verify results against PostGIS using equivalent queries

Benchmark Results
Performance comparison across three engines using both small (100 trips × 1000 buildings) and large (1000 trips × 2000
buildings) datasets:

Large Dataset Results

Engine	k=1	k=5	k=10
SedonaDB	3.32ms	6.50ms	9.38ms
DuckDB	211.19ms	209.19ms	202.81ms
PostGIS	2171.37ms	2237.47ms	2221.19ms

Small Dataset Results

Engine	k=1	k=5	k=10
SedonaDB	0.93ms	1.16ms	1.49ms
DuckDB	11.52ms	11.71ms	12.21ms
PostGIS	161.93ms	162.62ms	165.21ms

SedonaDB demonstrates 8-655× faster performance than competitors across all scenarios.

- Add integration tests for KNN join functionality with synthetic data - Include cross-verification against PostGIS for correctness validation - Add comprehensive benchmarking comparing SedonaDB, PostGIS, and DuckDB - Test various scenarios: basic joins, polygon joins, edge cases, and attribute preservation - Performance results show SedonaDB is 8-655× faster than competitors

Copilot

Pull Request Overview

This PR introduces comprehensive KNN join integration tests and benchmarks to validate functionality and measure performance across multiple database engines. The implementation adds thorough test coverage for SedonaDB's KNN join capabilities and provides comparative benchmarking against PostGIS and DuckDB.

Key changes:

Add extensive integration tests covering basic joins, mixed geometry types, edge cases, and attribute preservation
Implement cross-verification against PostGIS to ensure correctness
Add comprehensive benchmarking framework comparing SedonaDB, PostGIS, and DuckDB performance

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
python/sedonadb/tests/test_knnjoin.py	New comprehensive test suite for KNN join functionality with various scenarios and PostGIS validation
benchmarks/test_knn.py	Enhanced benchmark suite with multi-engine comparison and improved test coverage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

python/sedonadb/tests/test_knnjoin.py

benchmarks/test_knn.py

paleolimbot

Thank you!

2010YOUY01 · 2025-09-22T10:18:57Z

It seems even for the largest dataset (2000 buildings x 1000 trips), sedona is only using 1 CPU core for the SpatialJoinExec phase. I think if you try a larger dataset at the join probe (trips) side, the speedup relative to DuckDB/pg can be higher. (That's a really impressive work!)

I tried two data sizes:
2000 buildings x 1000 trips: 25ms
2000 buildings x 20000 trips: 50ms

reproducer in `sedona-cli`:

-- Setup
CREATE OR REPLACE VIEW knn_buildings AS
SELECT
    geometry AS geom,
    ROUND(random() * 1000) AS building_id,
    'Building_' || CAST(ROUND(random() * 1000) AS VARCHAR) AS name
FROM sd_random_geometry('{
    "geom_type": "Polygon",
    "target_rows": 2000,
    "vertices_per_linestring_range": [4, 8],
    "size_range": [0.001, 0.01],
    "seed": 42
}');

CREATE OR REPLACE VIEW knn_trips AS
SELECT
    geometry AS geom,
    ROUND(random() * 100000) AS trip_id
FROM sd_random_geometry('{
    "geom_type": "Point",
    "target_rows": 100000,
    "seed": 43
}');



-- Benchmark Query
WITH trip_sample AS (
    SELECT trip_id, geom AS trip_geom
    FROM knn_trips
    LIMIT 20000                               -- <== This line is changed
),
building_with_geom AS (
    SELECT building_id, name, geom AS building_geom
    FROM knn_buildings
)
SELECT
    t.trip_id,
    b.building_id,
    b.name,
    ST_Distance(t.trip_geom, b.building_geom) AS distance
FROM trip_sample t
JOIN building_with_geom b ON ST_KNN(t.trip_geom, b.building_geom, 5, FALSE)
ORDER BY t.trip_id, distance;

Reasons

The data source sd_random_geometry seems to produce batches in 1k size, and the execution layer is using the number of CPU cores as the parallelism degree by default. If there is only 1 batch, only one partition is working; my machine has 14 cores, so if I try 20000 as the probe side size, all partitions can be kept busy with a similar amount of work to do per partition as before.

It can be verified through explain analyze verbose + query, and see output_rows field inside SpatialJoinExec

zhangfengcdt marked this pull request as ready for review September 11, 2025 20:49

jiayuasu requested review from Kontinuation, Copilot and paleolimbot September 11, 2025 20:51

Copilot AI reviewed Sep 11, 2025

View reviewed changes

python/sedonadb/tests/test_knnjoin.py Outdated Show resolved Hide resolved

python/sedonadb/tests/test_knnjoin.py Outdated Show resolved Hide resolved

benchmarks/test_knn.py Show resolved Hide resolved

address copilot comments

84d7b84

paleolimbot approved these changes Sep 12, 2025

View reviewed changes

jiayuasu merged commit 79cea80 into apache:main Sep 12, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add comprehensive KNN join integration tests and benchmarks#65

feat: Add comprehensive KNN join integration tests and benchmarks#65
jiayuasu merged 2 commits intoapache:mainfrom
zhangfengcdt:feature/knn_join.integration.test

zhangfengcdt commented Sep 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot left a comment

Uh oh!

Uh oh!

2010YOUY01 commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhangfengcdt commented Sep 11, 2025

Large Dataset Results

Small Dataset Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2010YOUY01 commented Sep 22, 2025

reproducer in sedona-cli:

Reasons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reproducer in `sedona-cli`: