-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalFiles::FileOrFiles::uri_folder could be brittle, do we want exclude_invalid_files
?
#184
Comments
What do you think about a more concrete definition. For example, something like "exclude patterns" (with maybe a default set?) |
I like |
Thinking about this more, it seems pretty thorny to include any notion of validity at all. I think whether a consumer considers a directory a valid thing to run queries against will be consumer specific. Here are some scenarios where I would expect variation in consumer behavior, all the while accepting a directory as input:
I think any of these scenarios could be both invalid or valid depending on the consumer. I think the question is: should the spec dictate validity of a directory? |
Directory? No. This was more about the validity of individual files (e.g. magic numbers) Patterns are probably sufficient I think but |
In my (admittedly limited) experience it has been pretty rare that a dataset contains only data files and nothing else (e.g. metadata files, dataset descriptions, etc.) I know we have uri_glob but since we aren't requiring support for
**
.In arrow we have an
exclude_invalid_files
option which can be specified alongside a directory (and defaults to true so maybe the protobuf name isassume_files_valid
). If set totrue
then we will attempt to determine if a file is a valid data file which is a format-specific operation. For example, if we are reading parquet we will look for the magic bytes.The text was updated successfully, but these errors were encountered: