Skip to content

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Oct 8, 2025

No description provided.

@github-actions github-actions bot added the storage Related to "storage" package label Oct 8, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 8, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6414

@Luap99
Copy link
Member Author

Luap99 commented Oct 8, 2025

Podman PR containers/podman#27251 and the buildah test PR containers/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref containers/podman#25862 (comment)

I still need to work on the actual feature here though to extract while the store in unlocked.

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)

Add a new function to stage additions. This should be used to extract
the layer content into a temp directory without holding the storage
lock and then under the lock just rename the directory into the final
location to reduce the lock contention.

Signed-off-by: Paul Holzinger <[email protected]>
Used to create temporary files that can be "commited" at a later point
by renaming them.

Signed-off-by: Paul Holzinger <[email protected]>
That caller in create() already had the layer created in memory so
another lookup roundtrip is unnecessary here.

Signed-off-by: Paul Holzinger <[email protected]>
It is not clear to me when it will hit the code path there, by normal
layer creation we always pass a valid parent so this branch is never
reached AFAICT.

Let's remove it and see if all tests still pass in podman, buildah and
others...

Signed-off-by: Paul Holzinger <[email protected]>
MAke it so the apply logic can be provided as argument which should help
the future work to call this function unlocked and let it extract to a
temp dir instead.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 348a11e to b7780f2 Compare October 13, 2025 13:14
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 13, 2025
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m mostly looking because I was curious — feel free to disregard.

The tar-split comment might explain some of the “unexpected EOF” test failures.

}
td.counter++
if err := callback(tmpAddPath); err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Non-blocking: It might be useful to delete anything created inside at this point already, it’s clearly not going to be used. .Cleanup will eventually do it, so that’s fine — doing it earlier might make more of the disk space available again immediately. But then again, any users that care about such a difference are so out of space that they have bigger problems to worry about.

Applies similarly, but even less urgently, to StageFileAddition.)

return -1, err
}
return size, err
return applyFunc(layer.ID, layer.Parent, options, &tsdata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively moves the write of tsdata inside this closure, and I don’t think that works: we need compressor to be closed before consuming tsdata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what I figured out after some debugging as well, good catch.

tempDirRoot := d.getTempDirRoot(id)
t, err := tempdir.NewTempDir(tempDirRoot)
if err != nil {
return nil, nil, 0, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Generally I’d prefer -1 or other clearly invalid “size” values on error paths, to be a tiny bit more likely to fail in hypothetical error handling mistakes… *grumble* Rust does this so much better.)

if _, idInUse := r.byid[id]; idInUse {
return ErrDuplicateID
}
names = dedupeStrings(names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Absolutely non-blocking, and pre-existing: dedupeStrings does at least a hash table lookup, and potentially an allocation and more; AFAICS it would be more efficient to just do the r.byname[…] check for all entries of names, even if they were duplicates.

… and, anyway, the two callers provide de-duplicated names already.)

applyDiffTemporaryDriver, ok := r.driver.(drivers.ApplyDiffStaging)
if ok && diff != nil {
// CRITICAL, this releases the lock so we can extract this unlocked
r.stopWriting()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of design rather worries me; it’s not transparent to callers who just see “// Requires startWriting.” in the documentation and assume that if they obtained a startWriting lock, their state will not change by the call to this create. It’s hard to reason about.

Conceptually, I think the overlay driver doesn’t really need to know the precise layer ID for a newly-created layer in order to determine the right getTempDirRoot, if this caller assures the driver that the ID is fresh and not conflicting with anything. (For image layers, the ID is deterministic, and we check that it doesn’t exist before trying to pull; but a concurrent process might create it before we finish, so conflicts can and do occur, and need to be carefully considered.) In such a design, I think most of the code in create before this point does not strictly need to run before the applyDiffUnlocked, but also I didn’t carefully read/check everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I might have been to focused to make proper quick ID and name conflict lookup first before doing the expensive lookup to "fail fast" when possible.
I guess design wise it makes sense to push this all the way up the stack. I do agree that unlock/lock patter is quite dangerous and I have seen it fail to many times in podman already so if we can avoid it then we should do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we probably want the “ID already exists” check to exist when creating image layers — so on the substance of the thing, this might be ~exactly right already.

Shaping the call stack is a maintainability concern that is really only worth worrying about after the code works.

I’m mentioning this early mostly in hope that it might avoid work on “perfect” implementation of the current approach, some of which would need to be re-done afterwards; and because the “give me a staging directory for an a future layer, I don’t know the ID yet” method would be a new concept not currently existing in the driver API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we can make the current design work — rename create to createTemporarilyUnlockingLock or something like that.

}
slices.Sort(layer.GIDs)

err = r.saveFor(layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caller is required to do this afterward, that should be documented.

return err
}

applyDiff = func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know it’s way too early to have an opinion.) The nested closures-returning-closures might be somewhat difficult to track.

Maybe some of this should be a stateful object / interface (driverLayerDiffAdapter???) that ~hides the difference between drivers that do and don’t support unlocked layer staging.

Add a function to apply the diff into a tmporary directory so we can do
that unlcoked and only rename under the lock.

Signed-off-by: Paul Holzinger <[email protected]>
The compressor must be closed before we write the bytes. However overall
I am not sure why we did write all bytes fully into memory first. So
chnage it to directly write to a file but still use a buffer for that to
avoid many small writes.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 force-pushed the staged-layer-creation branch from b7780f2 to bbb2266 Compare October 15, 2025 14:59
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 15, 2025
@Luap99
Copy link
Member Author

Luap99 commented Oct 15, 2025

@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures.

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

	diff := path.Join(dir, "diff")
	if err := idtools.MkdirAs(diff, forcedSt.Mode, forcedSt.IDs.UID, forcedSt.IDs.GID); err != nil {
		return err
	}

	if d.options.forceMask != nil {
		st.Mode |= os.ModeDir
		if err := idtools.SetContainersOverrideXattr(diff, st); err != nil {
			return err
		}
	}

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?


Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation?
https://api.cirrus-ci.com/v1/artifact/task/5906611702595584/html/sys-podman-debian-13-rootless-host-sqlite.log.html

I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever.
I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

@mtrmac
Copy link
Contributor

mtrmac commented Oct 15, 2025

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

I think that could work.

I was thinking StageAddition does not actually need to create (os.Create/os.Mkdir) the tmpAddPath at all. All of that happens inside a lock-protected td.tempDirPath, so there is ~nothing special, that I can see, about populating tmpAddPath — the provided callback can create the staged item without any help. (That could also mean StageDirectoryAddition and StageFileAddition could be consolidated into one. And I’m not immediately sure we need a callbackStageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)


I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

Per the locking hierarchy documented at the top of store, I think you’re right here.

@Luap99
Copy link
Member Author

Luap99 commented Oct 15, 2025

And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards.

But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants