Skip to content

Conversation

@pettyjamesm
Copy link
Member

Description

Follows up from #27340 with documentation in DEVELOPMENT.md for using the Vector API. Adding these docs should make it clear that it is safe to assume the Vector API is available when developing on Trino, but goes further in prescribing general best practices we should aim to follow when using the Vector API in Trino code. This PR can serve as a discussion point on that matter.

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

equivalent scalar code on whatever hardware the engine happens to be running on.

When adding implementations that use the Vector API, be sure to:
* Provide an equivalent scalar implementation in code, if one does not already
Copy link
Member

@martint martint Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had an offline conversation about this, so let me re-hash my comments here:

  • I would not make such a blanket statement of requiring a scalar implementation every time the Vector API. I there are cases where an implementation based on Vector API may run slower that a scalar version (or possibly, other variants of the same algorithm using a different subset of the Vector API or even other scalar variants), then by all means, add multiple versions and make the code able to switch dynamically. We use employ similar techniques for data-dependent optimizations (block types, specialized hash tables for certain types, etc).
  • Functional behavior tests can be implemented via unit tests, just like anything else. Using an equivalent scalar-based implementation is an option, but not necessarily the only approach.

I think these are reasonable guidelines for complex cases. But I would soften the language, as there's no one-size-fits all (e.g., rephrase as "when adding impls that use the Vector API, consider:")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've softened the language to "prefer the following approach unless the specifics of the situation dictate otherwise"

Comment on lines 162 to 163
* Include micro-benchmarks that demonstrate the performance benefits of the
vectorized implementation compared to the scalar equivalent logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picking, but this could be the first. I don't want this to overlooked
i can also see how all these things are important. However, if the benchmarks suggest Vector API-based is always faster (to the best of our knowledge), would we still want to implement equivalent scalar code path? What would be the logic switching between them in such case?

Copy link
Member Author

@pettyjamesm pettyjamesm Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything flows from the current first point: provide an equivalent scalar implementation- but agreed that this is an important one. Probably won't be overlooked since that should be the first thing a reviewer asks for.

As for whether we would require a scalar equivalent when the vectorized approach is always faster... that's a good point that I hadn't considered because with a sample size of 2, we haven't encountered that scenario.

In both cases so far, certain hardware either doesn't support the instructions to benefit from vectorization and falls back to emulated support from the vector API or the hardware does support it but with worse performance (@raunaqmorarka might know more, but I think this involved ARM CPUs with NEON but not SVE support).

There are also AVX-512 throttling issues that existed before the Intel's IceLake CPU generation which could result in CPU frequency being adjusted downwards for certain instruction sequences using 512 bit registers but not others. Hopefully the JVM SPECIES_PREFERRED issue handles that by selecting 256 bit registers where possible instead, but then again it might not.

So what I'm proposing is that for now, and for most cases- we have a way to fallback to scalar code because:

  1. We'll want a scalar code comparison point for benchmarks
  2. Some amount of hardware detection is likely to be necessary for many (most?) Vector API usages and will need a fallback option.
  3. Having the ability to opt-out of the vectorized implementation explicitly (via configuration) gives us the ability to debug issues that might arise (both for performance and correctness)

Other benefits include:

  • scalar code equivalents are likely to be easier to interpret by someone reading the code
  • we're more likely to catch any issues that arise from implementation details of the Vector API changing if we test against a scalar equivalent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note, that people should attempt to test on a few CPU architectures: intel (not sure which models), gravitron (new SVE one and old NEON one), AMD (again some version). We should also note that Apple Silicon is not representative of datacenter ARM chips.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a scalar implementation will anyway be a necessity for processing the tail when input doesn't exactly match vector preferred bit width.
+1 to benchmarking on an intel, a graviton and an AMD

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

I approved this, but you should check for any more feedback before merging.

Comment on lines 162 to 163
* Include micro-benchmarks that demonstrate the performance benefits of the
vectorized implementation compared to the scalar equivalent logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note, that people should attempt to test on a few CPU architectures: intel (not sure which models), gravitron (new SVE one and old NEON one), AMD (again some version). We should also note that Apple Silicon is not representative of datacenter ARM chips.

@pettyjamesm
Copy link
Member Author

I've added a new paragraph mentioning the need to test on AMD, ARM, and Intel CPU's as well as noting that Apple Silicon and datacenter class ARM CPUs can behave differently. I've also added some additional wording to the micro-benchmarking bullet point prescribing that the performance benefits should be verified on each of the hardware types that the vectorized implementation will be enabled on.

@pettyjamesm pettyjamesm merged commit 88b0d73 into trinodb:master Nov 19, 2025
99 checks passed
@pettyjamesm pettyjamesm deleted the add-vector-api-docs branch November 19, 2025 15:01
@github-actions github-actions bot added this to the 479 milestone Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants