Decoupling chunker/compressor configuration from repository initialization #1656

callegar · 2026-01-31T14:42:54Z

callegar
Jan 31, 2026

Hi,

I wonder if it might be possible to decouple the chunker/compressor configuration from the repository initialization.

Incidentally, this is already the case with borg, where the chunker parameters are a property of the individual backups (snapshots) and not of the repo. In borg this is not particularly problematic because in "restore" the chunks only need to be concatenated together, while in backup this may merely lead to loss of deduplication if two snapshots of the same data are backed up with different chunker parameters (must say that in borg it is not particularly useful either).

This would be a step in making the chunker and compressor parameters a function of the individual files, guided by globs in the config file. The rationale is the following:

For what concerns compression, rather than using a heuristics to determine whether files are compressible or not, in many cases it may be easy to know in advance. For instance it may be possible to avoid the cost of heuristics about the compressibility of the chunks of mp3, flac, ogg, jpeg, gzip files, etc.
For what concerns chunking, sources may include files with very different features. Side to side one might find:
- "regular" files for which rolling-hash chunking enables deduplication for parts of the same file;
- "disk image files" for which fixed-size chunking matching the sector size would be best;
- files that cannot change for which chunking would better be avoided altogether since deduplication at the file level is the best that one can have. For instance, suppose that among your sources there is a collection of mp3 music files or of movies. Those will never have useful binary diffs.
Even among the regular files, there is always a tension between contrasting needs:
- keeping the chunks small helps deduplication (see Tweaking the chunker block size targets restic/restic#1071 (comment)). For instance the default chunk size in bup on a sample corpus leads to an almost 3-fold better deduplication than that of borg.
- keeping the chunks small increases horribly the number of chunks and the overhead involved in storing the corresponding metadata and in managing it.
  Hence, one may want a smaller average chunk size for trees where files tend to change by small bits (think a very large source code basis) and a larger one for files that change by large sections (think of statically linked executables where vendored libraries may get updated, or of documents where pictures may get renewed).

Currently, the best that one can do is to "split" the sources into multiple repos with different setups. For instance, backing up one's home dir that contains music folders, rather thank backing it up as a whole one might prefer to send the music files to a separate repo for max speed efficiency and to enable a small chunk size for the rest, while it would be nicer to just have a line in the config saying "do not chunk .mp3 files".

What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupling chunker/compressor configuration from repository initialization #1656

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Decoupling chunker/compressor configuration from repository initialization #1656

Uh oh!

callegar Jan 31, 2026

Replies: 0 comments

callegar
Jan 31, 2026