External hinting system for automatic layering #113

mikepurvis · 2024-01-30T17:45:02Z

Some limitations of the "popularity" based approach for automatic layering:

It can only think in single store paths, rather than recognizing clusters that make sense to group together.
It doesn't have any temporal context, for example optimizing blobs for how much their contents change over time.
It doesn't have any awareness of dependency chains (other than for the popularity number).
It doesn't account for the size of the store paths.

None of these are huge issues with "small" images, but they start to really limit the effectiveness of the layering once there are thousands of store paths going into an image.

One possible way to improve this situation would be to have some kind external scanner tool that could examine a bunch of related images, and maybe also instances of the images/closures over time, and produce an output that could be checked into source control and used to better optimize automatic layer generation for successive builds. By checking it in, builds remain pure and the developer is in control of how frequently to update the hint file (likely in conjunction with dependency changes or flake updates).

If there's interest in such a thing, perhaps this ticket can be a place to discuss what such a file could look like and how would be most effective to collect the data.

nlewo · 2024-02-22T21:58:20Z

That would be really fun to implement: collecting all image graphs and find common subgraphs to isolate them in layers!

some kind external scanner tool that could examine a bunch of related images

I don't know what exactly you mean by "related images" but i think to generate a pertinent profile, we would need to have the whole closure graph of all images which are not available in the built images. This means this profile should have to be generated by consuming image Nix expressions.
Maybe we could generate useful profile from the image JSON file, but this profile would be suboptimal, and i don't see an advantage of consuming the image JSON file instead of Nix expression.

It doesn't account for the size of the store paths.

I think this should be added to the current algorithm because generating tiny deep layers doesn't make sense.

It doesn't have any temporal context, for example optimizing blobs for how much their contents change over time.

In practice, i'm not sure this could be convenient since the analyzer would have to checkout several commits to compute a profile.
Or, we could store the graph in the image filesystem or image metadata: this would also to fetch a bunch of image from a registry to compute a profile.

adminy · 2025-02-28T18:21:03Z

tvix store is storing packages in their cool new object store, which is now got a metadata and a chunk service which might be OCI compliant since they are also trying to have OCI builders backends, so just wondering if that could be used to just pull nix packages directly as layers. The best layering is when someone else does it for you :)

mikepurvis mentioned this issue Jan 30, 2024

Layers contains duplicate dependencies #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External hinting system for automatic layering #113

External hinting system for automatic layering #113

mikepurvis commented Jan 30, 2024

nlewo commented Feb 22, 2024

adminy commented Feb 28, 2025

External hinting system for automatic layering #113

External hinting system for automatic layering #113

Comments

mikepurvis commented Jan 30, 2024

nlewo commented Feb 22, 2024

adminy commented Feb 28, 2025