Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Remote UDF, only serialize and process selected rows #12277

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vudung45
Copy link

@vudung45 vudung45 commented Feb 6, 2025

Summary:

  • In this diff, we are adding an optimization to Remote UDF, to only serialize and process selected rows.
  • This could greatly help in cases where the vector is heavily filtered.
  • The previous approach could also potentially have bugs where function would throw exception if it doesn't expect to process un-selected rows (e.g null conditions, unexpected arguments, etc.)

Differential Revision: D69231717

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69231717

Copy link

netlify bot commented Feb 6, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2944381
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67a47fd93f306800089b94ad

vudung45 pushed a commit to vudung45/velox that referenced this pull request Feb 6, 2025
…ncubator#12277)

Summary:

- In this diff, we are adding an optimization to Remote UDF, to only serialize and process selected rows.
- This could greatly help in cases where the vector is heavily filtered.
- The previous approach could also potentially have bugs where function would throw exception if it doesn't expect to process un-selected rows (e.g null conditions, unexpected arguments, etc.)
- XStream is starting to adopt this model for the CodeX project. This is to unblock dogfooding, and productionizing.

Differential Revision: D69231717
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69231717

facebook-github-bot pushed a commit that referenced this pull request Feb 6, 2025
Summary:

- In this diff, we are adding an optimization to Remote UDF, to only serialize and process selected rows.
- This could greatly help in cases where the vector is heavily filtered.
- The previous approach could also potentially have bugs where function would throw exception if it doesn't expect to process un-selected rows (e.g null conditions, unexpected arguments, etc.)
- XStream is starting to adopt this model for the CodeX project. This is to unblock dogfooding, and productionizing.

Differential Revision: D69231717
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69231717

…ncubator#12277)

Summary:

- In this diff, we are adding an optimization to Remote UDF, to only serialize and process selected rows.
- This could greatly help in cases where the vector is heavily filtered.
- The previous approach could also potentially have bugs where function would throw exception if it doesn't expect to process un-selected rows (e.g null conditions, unexpected arguments, etc.)
- XStream is starting to adopt this model for the CodeX project. This is to unblock dogfooding, and productionizing.

Differential Revision: D69231717
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69231717

@pedroerp
Copy link
Contributor

pedroerp commented Feb 6, 2025

Nice, thanks for looking into this!

Instead of wrapping the entire vector in a dictionary, a more efficient way would be to keep the vector as-is, but only serialize the rows active in the rowSet. That API doesn't do that today, but it would be better here to create a new version of rowVectorToIOBuf() in VectorStream.h that takes the rowSet and only serializes the rows alive. Inside that function, you will only have to convert the rowSet to a vector of IndexRange, and call the appropriate code, so it would actually be less code. There may be an utility already in the codebase to transform a rowSet into a vector of ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants