WIP: Staged layer creation #378

Luap99 · 2025-10-08T10:49:30Z

No description provided.

podmanbot · 2025-10-08T10:50:41Z

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6414

Luap99 · 2025-10-08T11:41:13Z

Podman PR containers/podman#27251 and the buildah test PR containers/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref containers/podman#25862 (comment)

I still need to work on the actual feature here though to extract while the store in unlocked.

mtrmac

ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)

storage/store.go

Add a new function to stage additions. This should be used to extract the layer content into a temp directory without holding the storage lock and then under the lock just rename the directory into the final location to reduce the lock contention. Signed-off-by: Paul Holzinger <[email protected]>

Used to create temporary files that can be "commited" at a later point by renaming them. Signed-off-by: Paul Holzinger <[email protected]>

That caller in create() already had the layer created in memory so another lookup roundtrip is unnecessary here. Signed-off-by: Paul Holzinger <[email protected]>

It is not clear to me when it will hit the code path there, by normal layer creation we always pass a valid parent so this branch is never reached AFAICT. Let's remove it and see if all tests still pass in podman, buildah and others... Signed-off-by: Paul Holzinger <[email protected]>

MAke it so the apply logic can be provided as argument which should help the future work to call this function unlocked and let it extract to a temp dir instead. Signed-off-by: Paul Holzinger <[email protected]>

mtrmac

I’m mostly looking because I was curious — feel free to disregard.

The tar-split comment might explain some of the “unexpected EOF” test failures.

mtrmac · 2025-10-13T18:10:03Z

storage/internal/tempdir/tempdir.go

+	}
+	td.counter++
+	if err := callback(tmpAddPath); err != nil {
+		return nil, err


(Non-blocking: It might be useful to delete anything created inside at this point already, it’s clearly not going to be used. .Cleanup will eventually do it, so that’s fine — doing it earlier might make more of the disk space available again immediately. But then again, any users that care about such a difference are so out of space that they have bigger problems to worry about.

Applies similarly, but even less urgently, to StageFileAddition.)

mtrmac · 2025-10-13T18:15:57Z

storage/layers.go

-			return -1, err
-		}
-		return size, err
+		return applyFunc(layer.ID, layer.Parent, options, &tsdata)


This effectively moves the write of tsdata inside this closure, and I don’t think that works: we need compressor to be closed before consuming tsdata.

Yes that is what I figured out after some debugging as well, good catch.

storage/drivers/overlay/overlay_test.go

mtrmac · 2025-10-13T18:19:07Z

storage/drivers/overlay/overlay.go

+	tempDirRoot := d.getTempDirRoot(id)
+	t, err := tempdir.NewTempDir(tempDirRoot)
+	if err != nil {
+		return nil, nil, 0, err


(Generally I’d prefer -1 or other clearly invalid “size” values on error paths, to be a tiny bit more likely to fail in hypothetical error handling mistakes… *grumble* Rust does this so much better.)

mtrmac · 2025-10-13T18:23:24Z

storage/layers.go

+	if _, idInUse := r.byid[id]; idInUse {
+		return ErrDuplicateID
+	}
+	names = dedupeStrings(names)


(Absolutely non-blocking, and pre-existing: dedupeStrings does at least a hash table lookup, and potentially an allocation and more; AFAICS it would be more efficient to just do the r.byname[…] check for all entries of names, even if they were duplicates.

… and, anyway, the two callers provide de-duplicated names already.)

mtrmac · 2025-10-13T18:35:26Z

storage/layers.go

+	applyDiffTemporaryDriver, ok := r.driver.(drivers.ApplyDiffStaging)
+	if ok && diff != nil {
+		// CRITICAL, this releases the lock so we can extract this unlocked
+		r.stopWriting()


This kind of design rather worries me; it’s not transparent to callers who just see “// Requires startWriting.” in the documentation and assume that if they obtained a startWriting lock, their state will not change by the call to this create. It’s hard to reason about.

Conceptually, I think the overlay driver doesn’t really need to know the precise layer ID for a newly-created layer in order to determine the right getTempDirRoot, if this caller assures the driver that the ID is fresh and not conflicting with anything. (For image layers, the ID is deterministic, and we check that it doesn’t exist before trying to pull; but a concurrent process might create it before we finish, so conflicts can and do occur, and need to be carefully considered.) In such a design, I think most of the code in create before this point does not strictly need to run before the applyDiffUnlocked, but also I didn’t carefully read/check everything.

I guess I might have been to focused to make proper quick ID and name conflict lookup first before doing the expensive lookup to "fail fast" when possible.
I guess design wise it makes sense to push this all the way up the stack. I do agree that unlock/lock patter is quite dangerous and I have seen it fail to many times in podman already so if we can avoid it then we should do that

I agree that we probably want the “ID already exists” check to exist when creating image layers — so on the substance of the thing, this might be ~exactly right already.

Shaping the call stack is a maintainability concern that is really only worth worrying about after the code works.

I’m mentioning this early mostly in hope that it might avoid work on “perfect” implementation of the current approach, some of which would need to be re-done afterwards; and because the “give me a staging directory for an a future layer, I don’t know the ID yet” method would be a new concept not currently existing in the driver API.

BTW we can make the current design work — rename create to createTemporarilyUnlockingLock or something like that.

mtrmac · 2025-10-13T18:38:37Z

storage/layers.go

 	}
 	slices.Sort(layer.GIDs)

-	err = r.saveFor(layer)


If the caller is required to do this afterward, that should be documented.

mtrmac · 2025-10-13T18:41:05Z

storage/layers.go

+			return err
+		}
+
+		applyDiff = func() error {


(I know it’s way too early to have an opinion.) The nested closures-returning-closures might be somewhat difficult to track.

Maybe some of this should be a stateful object / interface (driverLayerDiffAdapter???) that ~hides the difference between drivers that do and don’t support unlocked layer staging.

Add a function to apply the diff into a tmporary directory so we can do that unlcoked and only rename under the lock. Signed-off-by: Paul Holzinger <[email protected]>

Signed-off-by: Paul Holzinger <[email protected]>

The compressor must be closed before we write the bytes. However overall I am not sure why we did write all bytes fully into memory first. So chnage it to directly write to a file but still use a buffer for that to avoid many small writes. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2025-10-15T16:43:26Z

@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures.

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

	diff := path.Join(dir, "diff")
	if err := idtools.MkdirAs(diff, forcedSt.Mode, forcedSt.IDs.UID, forcedSt.IDs.GID); err != nil {
		return err
	}

	if d.options.forceMask != nil {
		st.Mode |= os.ModeDir
		if err := idtools.SetContainersOverrideXattr(diff, st); err != nil {
			return err
		}
	}

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation?
https://api.cirrus-ci.com/v1/artifact/task/5906611702595584/html/sys-podman-debian-13-rootless-host-sqlite.log.html

I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever.
I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

mtrmac · 2025-10-15T17:03:30Z

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

I think that could work.

I was thinking StageAddition does not actually need to create (os.Create/os.Mkdir) the tmpAddPath at all. All of that happens inside a lock-protected td.tempDirPath, so there is ~nothing special, that I can see, about populating tmpAddPath — the provided callback can create the staged item without any help. (That could also mean StageDirectoryAddition and StageFileAddition could be consolidated into one. And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

Per the locking hierarchy documented at the top of store, I think you’re right here.

Luap99 · 2025-10-15T17:15:36Z

And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards.

But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that.

github-actions bot added the storage Related to "storage" package label Oct 8, 2025

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 8, 2025

dnm: Vendor changes from containers/container-libs#378

6d83f9f

podmanbot mentioned this pull request Oct 8, 2025

Sync: WIP: Staged layer creation containers/buildah#6414

Draft

mtrmac reviewed Oct 8, 2025

View reviewed changes

storage/store.go Outdated Show resolved Hide resolved

Luap99 added 5 commits October 13, 2025 14:17

storage/internal/tempdir: add StageFileAddition()

9a5db5e

Used to create temporary files that can be "commited" at a later point by renaming them. Signed-off-by: Paul Holzinger <[email protected]>

storage: avoid layer lookup in applyDiffWithOptions()

4799dd3

That caller in create() already had the layer created in memory so another lookup roundtrip is unnecessary here. Signed-off-by: Paul Holzinger <[email protected]>

storage: rework applyDiffWithOptions()

7418c95

MAke it so the apply logic can be provided as argument which should help the future work to call this function unlocked and let it extract to a temp dir instead. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the staged-layer-creation branch from 348a11e to b7780f2 Compare October 13, 2025 13:14

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 13, 2025

dnm: Vendor changes from containers/container-libs#378

ac18582

mtrmac reviewed Oct 13, 2025

View reviewed changes

Luap99 added 3 commits October 15, 2025 16:59

overlay: add StartStagingDiffToApply()

4cc2aa3

Add a function to apply the diff into a tmporary directory so we can do that unlcoked and only rename under the lock. Signed-off-by: Paul Holzinger <[email protected]>

storage: when creating layer apply diff unlocked

a7a2938

Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the staged-layer-creation branch from b7780f2 to bbb2266 Compare October 15, 2025 14:59

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 15, 2025

dnm: Vendor changes from containers/container-libs#378

45eda1a

WIP: Staged layer creation #378

Are you sure you want to change the base?

WIP: Staged layer creation #378

Uh oh!

Conversation

Luap99 commented Oct 8, 2025

Uh oh!

podmanbot commented Oct 8, 2025

Uh oh!

Luap99 commented Oct 8, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 commented Oct 15, 2025

Uh oh!

mtrmac commented Oct 15, 2025

Uh oh!

Luap99 commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants