Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read all Parquet files in a folder #17757

Draft
wants to merge 3 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jan 16, 2025

Do not review. This is not to be merged into cudf.


This is actually an application, implemented in place of PARQUET_TEST. Given a path to a folder containing parquet files, the application will read each file using parquet chunked reader. We use this to check memory issue while reading parquet file using compute sanitizer:

compute-sanitizer --tool memcheck build/gtests/PARQUET_TEST --rmm_mode=cuda

The data folder can be given either through the environment variable PARQUET_PATH or hard code into the source file.

All the implementation is put in cpp/tests/io/spark_test.cpp: https://github.com/rapidsai/cudf/pull/17757/files#diff-712fdde5014f59e26a43b244beb3c000ad1ca5831faeeee8b184d3f2971e5e46

Copy link

copy-pr-bot bot commented Jan 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant