Skip to content

feat: Pytest benchmark for comparing against other engines locally#10

Merged
jiayuasu merged 9 commits intoapache:mainfrom
petern48:pytest-benchmark
Sep 3, 2025
Merged

feat: Pytest benchmark for comparing against other engines locally#10
jiayuasu merged 9 commits intoapache:mainfrom
petern48:pytest-benchmark

Conversation

@petern48
Copy link
Contributor

@petern48 petern48 commented Sep 2, 2025

See the new benchmarks/README.md for how to run and what the output looks like.

@petern48
Copy link
Contributor Author

petern48 commented Sep 2, 2025

Just pushing this somewhere for now. There are other benchmark library options too, so we don't need to commit to this one. I just found this very easy to setup and use.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Just a few suggestions to get started. The real work here is writing the actual queries and I'm happy to run them however works for you!


# Setup tables
num_rows = 10000
create_points_query = f"CREATE TABLE points AS SELECT ST_GeomFromText('POINT(0 0)') AS geom FROM range({num_rows})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DBEngine subclass has this abstracted already such that you can create a table from a GeoParquet file or GeoPandas data frame. You can use the geoarrow_data fixture to write benchmarks against actual data, or you can use the sd_random_geometry() table function to generate it (Kristin's join integration tests are a great example).

Probably synthetic data makes sense here: points, segments (linestrings with a vertex count of 2), polygon, complex_linestring, complex_polygon. The number of batches could be configurable so that you can run tiny benchmarks or big benchmarks (this is what we do in Rust, too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly would you want to quanity complex vs non-complex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you found vertices_per_linestring_range. I use the numbers 10 ("simple") and 500 ("complex") in the Rust benchmarks, which is sort of arbitrary but did the trick of weeding out predicate implementations that weren't using a prepared geometry (particularly when one side was a scalar). Totally optional!

@petern48
Copy link
Contributor Author

petern48 commented Sep 2, 2025

I've changed it so that we generate columns of random geometries points_10_000, polygons_10_000, polygons_100_000, etc. I'm not sure how 1. we should make this configurable or 2. to what extent we should make it configurable.

Like if we do different geometry types, simple / complex, and number of geometries, I feel that's a lot of dimensions. How much do we care to drill down?

Looking at the current implementation of test_st_area (which is parametrized, unlike the rest). We can group by table (dataset size, etc) and compare the engines at a more granular level.
(notice duckdb wins for one of the simpler datasets here, although sedonadb is faster for the rest and overall)
pytest --benchmark-group-by=param:table test_functions.py::TestBenchFunctions::test_st_area
image

or we can can just benchmark them at the function level (e.g st_buffer)
pytest --benchmark-group-by=func test_functions.py::TestBenchFunctions::test_st_buffer
image

@paleolimbot
Copy link
Member

Awesome!

How much do we care to drill down?

Not that far! The tests are all about correctness and corner cases...here we can stick to the most common cases. Most functions shouldn't need more than one or two benchmarks (one on a simple geometry, which is a benchmark of our per-geometry overhead, and one for complex geometry, which is more a test of the underlying implementation).

I'm not sure how 1. we should make this configurable or 2. to what extent we should make it configurable.

No need to make it configurable now, but maybe rename the tables to _small and _large in case we decide to change those numbers?

@petern48 petern48 changed the title WIP: Pytest benchmark proposal feat: Pytest benchmark for comparing against other engines locally Sep 2, 2025
@petern48
Copy link
Contributor Author

petern48 commented Sep 2, 2025

I added the separate 'simple' and 'complex' tables and removed the small sizing (using the original large size for everything). I didn't see much value from that dimension at the moment and this mimics the rust tests the most. I also don't think it makes sense to integrate with CI for now, since there might still be options other thanpytest-benchmark that are better suited for our needs. Just wanted a nice quick tool for comparing against different engines locally.

Here's an example of how it can be used locally at the moment. The below command will group the results of simple and complex separately.
pytest --benchmark-group-by=func,param:table test_functions.py::TestBenchFunctions::test_st_area

image

WDYT?

@petern48 petern48 requested a review from paleolimbot September 2, 2025 22:50
@petern48 petern48 marked this pull request as ready for review September 2, 2025 22:50
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start...thank you!

Can you add benchmarks/README.md (with the license header in a comment because Apache) with a brief description of the benchmarks and how to run them?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jiayuasu jiayuasu merged commit 8f00164 into apache:main Sep 3, 2025
2 checks passed
@petern48 petern48 deleted the pytest-benchmark branch September 3, 2025 21:25
@Kontinuation
Copy link
Member

Kontinuation commented Sep 5, 2025

I ran the benchmark and found that sedona-db uses all the CPU cores to run the benchmarking query, while DuckDB only uses one single core (on CPython main thread). This makes the benchmark results of sedona-db and DuckDB not directly comparable.

Have you ran into this issue before? Should we configure the benchmarked engines to force using single thread to execute the queries?

$ pytest --benchmark-group-by=param:table test_predicates.py::TestBenchPredicates::test_st_contains
============================================================================================ test session starts =============================================================================================
platform darwin -- Python 3.13.4, pytest-8.4.1, pluggy-1.6.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/bopeng/workspace/wherobots/sedona-db/benchmarks
plugins: anyio-4.10.0, benchmark-5.1.0
collected 2 items                                                                                                                                                                                            

test_predicates.py ..                                                                                                                                                                                  [100%]


---------------------------------------------------------------------------------------- benchmark 'table=polygons_simple': 2 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                     Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_st_contains[polygons_simple-SedonaDB]       171.6143 (1.0)        197.6855 (1.0)        183.8239 (1.0)      11.2549 (1.13)       184.8224 (1.0)      19.9745 (1.70)          2;0  5.4400 (1.0)           5           1
test_st_contains[polygons_simple-DuckDB]       1,113.0932 (6.49)     1,140.0075 (5.77)     1,125.7878 (6.12)      9.9178 (1.0)      1,123.1109 (6.08)     11.7343 (1.0)           2;0  0.8883 (0.16)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
image

@jiayuasu
Copy link
Member

jiayuasu commented Sep 5, 2025

Does this mean all the performance numbers we saw yesterday are wrong?

@paleolimbot
Copy link
Member

Does this mean all the performance numbers we saw yesterday are wrong?

I don't debate the diagnostics here, but I would be surprised if DuckDB's Python package was configured to use one thread by default always, and that this wasn't caught for the entire lifecycle of the 1.3 release. There are a number of things we need to consider on top of yesterday's numbers including this!

@petern48
Copy link
Contributor Author

petern48 commented Sep 5, 2025

very good catch 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants