-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pulling named subsets of data, or excluding files from pull #2825
Comments
To some extent related to this #2095 since one of the possible solutions to this can be specifying some setting per output in the DVC-file. Is it pulled/pushed by default, for example. |
@shcheklein I don't think this is related, this is not about different remotes. @r-zip There are two tricks you can employ:
|
@Suor yep, I think we are on the same page. It's very different. The only relation to the ticket is that potentially we can an option per output that specifies the default behavior for push/pull. Or it even can be a |
@shcheklein I was thinking about |
I think this can be closed now that there is |
Hello. I'm looking for a solution to a similar problem. I don't see the |
@pbailey-hf You can pull any subpath without |
I see thanks for the response! What I'm hoping to achieve is that a user can clone a git repo and |
Do you want to push that test data or you only need it locally for your own purposes? If you don't ever need to push it, you can mark that data with If you still need to push that test data, you could set up different remotes for you and the downstream users. You can set |
OK I've tried out your suggestion, putting the required files in the default remote and the test data in a separate one. It seems to work:
There seem to be a couple of drawbacks:
I appreciate your responsiveness on this issue, and I'll keep digging for solutions. |
You still need to pass |
Thanks I noticed my error after I posted that, but had similar output when I added |
Sorry, I also forgot to add that you need to
Can you explain what it is doing and how it solves the problem for you? |
It marks certain Outputs as explicit, so that a bare |
I'm fine to consider that option if you want to try to implement it! |
Hey @dberenbaum my PR is updated with tests, and docs. Feedback is welcome. Thanks for your consideration. |
I've been working on a large project with multiple datasets. One of these datasets is large (>100 GB). If I simply run
dvc pull
, then it will pull the huge dataset, which takes up most available disk space on my machine.The only way around this appears to be providing the file name to every data file to download. This is inconvenient, however, because there are many files I do want, and only one that I don't want.
I see two solutions to this:
dvc pull mnist
. The user would also be able to exclude them:dvc pull all --exclude mnist
.dvc pull --exclude data/mnist.dvc
.The text was updated successfully, but these errors were encountered: