Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalFiles::FileOrFiles::uri_folder could be brittle, do we want exclude_invalid_files? #184

Open
westonpace opened this issue May 4, 2022 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea

Comments

@westonpace
Copy link
Member

In my (admittedly limited) experience it has been pretty rare that a dataset contains only data files and nothing else (e.g. metadata files, dataset descriptions, etc.) I know we have uri_glob but since we aren't requiring support for **.

In arrow we have an exclude_invalid_files option which can be specified alongside a directory (and defaults to true so maybe the protobuf name is assume_files_valid). If set to true then we will attempt to determine if a file is a valid data file which is a format-specific operation. For example, if we are reading parquet we will look for the magic bytes.

@jacques-n
Copy link
Contributor

What do you think about a more concrete definition. For example, something like "exclude patterns" (with maybe a default set?)

@cpcloud
Copy link
Contributor

cpcloud commented May 13, 2022

I like exclude_patterns. exclude_invalid_files implies a notion of validity which will not be uniform across producers and consumers and probably not even consistent across different versions of producers and consumers.

@cpcloud
Copy link
Contributor

cpcloud commented May 13, 2022

Thinking about this more, it seems pretty thorny to include any notion of validity at all. I think whether a consumer considers a directory a valid thing to run queries against will be consumer specific.

Here are some scenarios where I would expect variation in consumer behavior, all the while accepting a directory as input:

  1. A directory of mixed file types with the same schema
  2. A directory of the same file type where not all files have the same schema (how would you choose which set is valid?)
  3. A directory of the same file type and the same schema, but no support/metadata files
  4. A directory of the same file type and the same schema, with support/metadata files
  5. A directory with other directories that may or may not be related to one another
  6. Some combination of the above

I think any of these scenarios could be both invalid or valid depending on the consumer.

I think the question is: should the spec dictate validity of a directory?

@westonpace
Copy link
Member Author

I think the question is: should the spec dictate validity of a directory?

Directory? No. This was more about the validity of individual files (e.g. magic numbers)

Patterns are probably sufficient I think but include_patterns is likely more useful than exclude_patterns (thought it may be nice to have both). Of course, include_patterns is pretty close to a glob listing anyways. This makes me think that maybe the answer is just that we should be stricter about requiring support for the recursive globstar. **/*.parquet will probably be a sufficient replacement for exclude_invalid_files in most cases.

@westonpace westonpace added enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea labels Mar 1, 2023
rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea
Projects
None yet
Development

No branches or pull requests

3 participants