-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd / conda format not enforcing file order in inner archives? #181
Comments
Alphabetical seems reasonable. The only requirement for .tar.bz2 is that the info directory should come first. But the old sorting code was not well tested. The current maintainers won't have any attachment to confusing old regex's. See also #172 The new transmute code, if conda-build produces .tar.bz2 instead of directly producing .conda, preserves the .tar.bz2 order. |
Great! How about:
And get rid of that wonky sorting code? |
Why not all alphabetical Does regular Python string compare use locale? |
Yep, we could do that. I think before we wanted small files first to be able to access them at the beginning of the tarball (at least for the I would need to educate myself about Python string sorting & potential locale issues. I mentioned it because of this page: https://reproducible-builds.org/docs/stable-inputs/ |
It would be interesting to look into the info order, conda index cares about it here https://github.com/conda/conda-index/blob/main/conda_index/index/sqlitecache.py#L237 , https://github.com/conda/conda-index/blob/main/conda_index/index/sqlitecache.py#L28. It makes a difference when the whole info/ is first in the tarball, but the order of the individual info/'s only makes a difference if you can detect you've seen all interesting info. |
Making the archive creation more stable/deterministic and in consequence reproducible is a good thing that I support! Apart from reproducibility, there are other factors, e.g., compression efficiency, to optimize for. |
It's still reproducible even if you sort it in a weird way, as long as someone else can run the same code on their machine. |
Right. What I meant to express was: Let's order the data in a reproducible, yet (compression-wise) efficient way. |
I am not very educated on this but do you think that really makes a big difference? Isn't the compression algorithm supposed to take care of that? |
I googled and researched a bit -- there is two hashes I found that would work: simhash and minhash. There is a linux program (simhash) that can be used to order files by similarity. Some blog article: https://morimori.tokyo/2017/11/sorting-files-for-better-compression/ A paper comparing the two: http://proceedings.mlr.press/v33/shrivastava14.pdf I think in the best case it could give a 10-20% file size improvement (but that's with older compression algorithms). Not sure if it's worth the added complexity. |
zstd has a larger window size than bz2 (depending on the compression level and the details are not in the main documentation), if the window size is larger than the entire uncompressed data then order shouldn't significantly affect compression? |
https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/tarball.py#L16-L30 is a pretty strange decision. It should be changed to info first then alphabetical/lexicographical. Looks like we don't use any of that logic when building |
Hi there, thank you for your contribution! This issue has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs. If you would like this issue to remain open please:
NOTE: If this issue was closed prematurely, please leave a comment. Thanks! |
Just to be sure, in |
Checklist
What happened?
I think it would be good to enforce some file order in the inner archives of the
.conda
format (could be alphabetical or the same weird hash scheme of the.tar.bz2
files) in order to guarantee better reproducibility.Conda Info
No response
Conda Config
No response
Conda list
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: