Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANN_BENCH enhanced dataset support #624

Merged
merged 12 commits into from
Feb 1, 2025

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Jan 29, 2025

Refactor the dataset module in the benchmark utility to add missing functionality:

  • Data in managed memory (in addition to host/device/mmap/pinned)
  • Basic filtering support: randomly generated bitset by setting 'filtering_rate' in the dataset config
  • Partial support within CUVS algorithms (bitset only)
  • Support in all algorithms
  • Using files for bitset filtering
  • Adjusting ground truth to account for the filtered data
  • Fine-grained control over where the bitset is located (like there is for the base set and query set)
  • Expose 2MB huge-pages support via config/cmd arguments
  • Add quantization as a dataset property

@achirkin achirkin added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 29, 2025
@achirkin achirkin self-assigned this Jan 29, 2025
@achirkin achirkin requested a review from a team as a code owner January 29, 2025 12:47
@github-actions github-actions bot added the cpp label Jan 29, 2025
@achirkin
Copy link
Contributor Author

Please note, the feature bullet list in the PR description is rather ambitious.
Given the close release burn-down, I'd suggest to leave the non-ticked items for the follow-on PR and treat this PR as an initial experimental feature set.

@achirkin achirkin requested a review from tfeher January 29, 2025 15:13
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR, it is very useful to add the capability to add the filtering options!

Overall the PR looks good. Would be great to add a few notes (developer docs) on how the dataset is represented as different types of blobs.

cpp/bench/ann/src/common/ann_types.hpp Show resolved Hide resolved
cpp/bench/ann/src/common/conf.hpp Show resolved Hide resolved
cpp/bench/ann/src/cuvs/cuvs_ann_bench_utils.h Show resolved Hide resolved
cpp/bench/ann/src/common/dataset.hpp Show resolved Hide resolved
cpp/bench/ann/src/common/dataset.hpp Outdated Show resolved Hide resolved
@achirkin achirkin requested a review from tfeher January 31, 2025 10:28
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the update, the PR looks good to me!

@achirkin
Copy link
Contributor Author

achirkin commented Feb 1, 2025

/merge

@rapids-bot rapids-bot bot merged commit 88f0dfc into rapidsai:branch-25.02 Feb 1, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp feature request New feature or request non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

2 participants