You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder if it might be possible to decouple the chunker/compressor configuration from the repository initialization.
Incidentally, this is already the case with borg, where the chunker parameters are a property of the individual backups (snapshots) and not of the repo. In borg this is not particularly problematic because in "restore" the chunks only need to be concatenated together, while in backup this may merely lead to loss of deduplication if two snapshots of the same data are backed up with different chunker parameters (must say that in borg it is not particularly useful either).
This would be a step in making the chunker and compressor parameters a function of the individual files, guided by globs in the config file. The rationale is the following:
For what concerns compression, rather than using a heuristics to determine whether files are compressible or not, in many cases it may be easy to know in advance. For instance it may be possible to avoid the cost of heuristics about the compressibility of the chunks of mp3, flac, ogg, jpeg, gzip files, etc.
For what concerns chunking, sources may include files with very different features. Side to side one might find:
"regular" files for which rolling-hash chunking enables deduplication for parts of the same file;
"disk image files" for which fixed-size chunking matching the sector size would be best;
files that cannot change for which chunking would better be avoided altogether since deduplication at the file level is the best that one can have. For instance, suppose that among your sources there is a collection of mp3 music files or of movies. Those will never have useful binary diffs.
Even among the regular files, there is always a tension between contrasting needs:
keeping the chunks small increases horribly the number of chunks and the overhead involved in storing the corresponding metadata and in managing it.
Hence, one may want a smaller average chunk size for trees where files tend to change by small bits (think a very large source code basis) and a larger one for files that change by large sections (think of statically linked executables where vendored libraries may get updated, or of documents where pictures may get renewed).
Currently, the best that one can do is to "split" the sources into multiple repos with different setups. For instance, backing up one's home dir that contains music folders, rather thank backing it up as a whole one might prefer to send the music files to a separate repo for max speed efficiency and to enable a small chunk size for the rest, while it would be nicer to just have a line in the config saying "do not chunk .mp3 files".
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I wonder if it might be possible to decouple the chunker/compressor configuration from the repository initialization.
Incidentally, this is already the case with borg, where the chunker parameters are a property of the individual backups (snapshots) and not of the repo. In borg this is not particularly problematic because in "restore" the chunks only need to be concatenated together, while in backup this may merely lead to loss of deduplication if two snapshots of the same data are backed up with different chunker parameters (must say that in borg it is not particularly useful either).
This would be a step in making the chunker and compressor parameters a function of the individual files, guided by globs in the config file. The rationale is the following:
For what concerns compression, rather than using a heuristics to determine whether files are compressible or not, in many cases it may be easy to know in advance. For instance it may be possible to avoid the cost of heuristics about the compressibility of the chunks of mp3, flac, ogg, jpeg, gzip files, etc.
For what concerns chunking, sources may include files with very different features. Side to side one might find:
Even among the regular files, there is always a tension between contrasting needs:
Hence, one may want a smaller average chunk size for trees where files tend to change by small bits (think a very large source code basis) and a larger one for files that change by large sections (think of statically linked executables where vendored libraries may get updated, or of documents where pictures may get renewed).
Currently, the best that one can do is to "split" the sources into multiple repos with different setups. For instance, backing up one's home dir that contains music folders, rather thank backing it up as a whole one might prefer to send the music files to a separate repo for max speed efficiency and to enable a small chunk size for the rest, while it would be nicer to just have a line in the config saying "do not chunk .mp3 files".
What do you think?
Beta Was this translation helpful? Give feedback.
All reactions