Skip to content

Conversation

pratiman-91
Copy link
Contributor

Added new argument in open_mfdataset to better handle the invalid files.

errors : {'ignore', 'raise', 'warn'}, default 'raise'
        - If 'raise', then invalid dataset will raise an exception.
        - If 'ignore', then invalid dataset will be ignored.
        - If 'warn', then a warning will be issued for each invalid dataset.

Copy link

welcome bot commented Jan 16, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@max-sixty
Copy link
Collaborator

I'm not the expert, but this looks reasonable! Any other thoughts?

Assuming no one thinks it's a bad idea, we would need tests.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea.

But the way it is implemented here seems overly complicated and repetitive.
I would suggest to revert the logic: first build up the list wrapped in a single try and then handle the three cases in the except block.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

Also, we should add tests for this.

@pratiman-91
Copy link
Contributor Author

@headtr1ck Thanks for the suggestions. I have added two tests (ignore and warn). Also, while testing, I found that a new argument broke combine="nested" due to invalid ids. I have now modified it to reflect the correct ids, and it is passing the tests. Please review the tests and the latest version.

@pratiman-91
Copy link
Contributor Author

Hi @headtr1ck, I have been thinking about the handling of ids. Current version looks like a patch work (I am not happy with it.). I think we can create ids after removing all the invalid datasets from path1d within the combine==nested block. Please let me know what do you think.
Thanks!

@pratiman-91
Copy link
Contributor Author

@max-sixty Can you please go through the PR. Thanks!

@max-sixty
Copy link
Collaborator

I'm admittedly much less familiar with this section of the code. nothing seems wrong though!

I think we should bias towards merging, so if no one has concerns then I'd vote to merge

could we fix the errors in the docs?

@pratiman-91
Copy link
Contributor Author

It seems like one test failed test_sparse_dask_dataset_repr (xarray.tests.test_sparse.TestSparseDataArrayAndDataset) . It is not related to this PR.

@pratiman-91
Copy link
Contributor Author

@headtr1ck

Some minor changes are still required.

I have made changes based on your suggestions.

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?

Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

@pratiman-91
Copy link
Contributor Author

@headtr1ck Can you please review this PR?
Thanks!

@headtr1ck
Copy link
Collaborator

You need to merge main and resolve the conflicts

@kmuehlbauer
Copy link
Contributor

Another question: what happens now if someone passes a e.g. 2x2 list of files where one is broken?
Because as far as I can tell, if errors="ignore" this file will be silently removed but then later on the dataset cannot be constructed and quite likely will throw an error that will confuse the user.

I agree, that would be the case. An important assumption is that removing the files does not affect the overall validity of the datasets. I think it should be up to the user to use that option.

Thanks @pratiman-91 for the explanation. For cases where unrelated files sneak into the file list for some reason the enhancements in this PR would really help the user to just get open_mfdataset to work. Without having to examine the file list. Thanks @pratiman-91.

@kmuehlbauer
Copy link
Contributor

I'm inclined to merge this, but unsure about the CI failures. I'll restart CI, let's see if this was just intermittent.

@kmuehlbauer kmuehlbauer reopened this Aug 11, 2025
@kmuehlbauer kmuehlbauer enabled auto-merge (squash) August 11, 2025 05:59
@kmuehlbauer kmuehlbauer merged commit 54ac2fe into pydata:main Aug 11, 2025
69 of 72 checks passed
Copy link

welcome bot commented Aug 11, 2025

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@kmuehlbauer
Copy link
Contributor

Thanks @pratiman-91 for sticking with us and congrats to your first contribution!

@pratiman-91
Copy link
Contributor Author

@max-sixty @kmuehlbauer @headtr1ck Thank you very much for help. It was a nice experience and I learned a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

better handling of invalid files in open_mfdataset
5 participants