-
Notifications
You must be signed in to change notification settings - Fork 43
Staged layer creation #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6414 |
|
Podman PR containers/podman#27251 and the buildah test PR containers/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref containers/podman#25862 (comment) I still need to work on the actual feature here though to extract while the store in unlocked. |
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)
348a11e to
b7780f2
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m mostly looking because I was curious — feel free to disregard.
The tar-split comment might explain some of the “unexpected EOF” test failures.
b7780f2 to
bbb2266
Compare
|
@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures. Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used? Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation? I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever. |
I think that could work. I was thinking
Per the locking hierarchy documented at the top of |
Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards. But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that. |
bbb2266 to
74d0e97
Compare
|
@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely. |
74d0e97 to
5cf326c
Compare
5cf326c to
e60d339
Compare
|
Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear. We have this code in putLayer() if options.HostUIDMapping {
options.UIDMap = nil
}
if options.HostGIDMapping {
options.GIDMap = nil
}
uidMap := options.UIDMap
gidMap := options.GIDMap
if parent != "" {
var ilayer *Layer
for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
lstore := l
if lstore != rlstore {
if err := lstore.startReading(); err != nil {
return nil, -1, err
}
defer lstore.stopReading()
}
if l, err := lstore.Get(parent); err == nil && l != nil {
ilayer = l
parent = ilayer.ID
break
}
}
if ilayer == nil {
return nil, -1, ErrLayerUnknown
}
parentLayer = ilayer
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
containers, err := s.containerStore.Containers()
if err != nil {
return nil, -1, err
}
for _, container := range containers {
if container.LayerID == parent {
return nil, -1, ErrParentIsContainer
}
}
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = ilayer.UIDMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = ilayer.GIDMap
}
} else {
// FIXME? It’s unclear why we are holding containerStore locked here at all
// (and because we are not modifying it, why it is a write lock, not a read lock).
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = s.uidMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = s.gidMap
}
}
if s.canUseShifting(uidMap, gidMap) {
options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
} else {
options.IDMappingOptions = types.IDMappingOptions{
HostUIDMapping: options.HostUIDMapping,
HostGIDMapping: options.HostGIDMapping,
UIDMap: copySlicePreferringNil(uidMap),
GIDMap: copySlicePreferringNil(gidMap),
}
}However we extract the layer with the caller specified I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true. |
|
(I didn’t look at the current code in this PR yet.) AFAICS IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope. |
storage/layers.go
Outdated
| Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap), | ||
| // FIXME: What to do here? We have no lock and assigned label yet. | ||
| // Overlayfs should not need it anyway so this seems fine for now. | ||
| MountLabel: "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.
(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)
Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)
storage/store.go
Outdated
| } | ||
| }() | ||
| // FIXME: type case should be safe for now but really there should be a better way to do this | ||
| err = m.stageWithUnlockedStore(rlstore.(*layerStore), lOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#378 (comment) talks about ID mapping values from the parent layer — if I’m reading the code right, that does not yet happen.
e60d339 to
91fcdbb
Compare
91fcdbb to
651c8f5
Compare
#!/usr/bin/env bpftrace
kfunc:fcntl_setlk
{
$lockname = str(args->filp->f_path.dentry->d_name.name);
if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
$lockname == "images.lock" || $lockname == "containers.lock" ) {
@blockedTime[tid, $lockname] = nsecs;
}
}
kretfunc:fcntl_setlk
{
$lockname = str(args->filp->f_path.dentry->d_name.name);
if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
$lockname == "images.lock" || $lockname == "containers.lock" ) {
// lock duration in msec
$lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
if ($lock_duration) {
printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
}
delete(@blockedTime[tid, $lockname]);
@lockholdTime[pid, args->fd] = (nsecs, $lockname);
}
}
tracepoint:syscalls:sys_enter_close
{
$a = @lockholdTime[pid, args->fd];
if ($a.0) {
// lock duration in msec
$lock_duration = (nsecs - $a.0)/1000000;
if ($lock_duration) {
printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
}
delete(@lockholdTime[pid, args->fd]);
}
}FYI this is the script I used to measure lock times during my demo last week, not sure if there is a decent place to store this. I guess it could be helpful in the future to find more lock contention, should I add it under hack/ maybe? |
|
Podman/buildah tests seems to be passing (minus unrelated flakes AFAICT) This should be good to review again. I think I addressed the comments. |
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlights:
- Locking stores that contain the parent layer while creating a child
- The pre-existing discussion about
MountLabel— I think the current feature set is fine, butstageWithUnlockedStoreshould refuse to stage if the input provides amountLabel, just to be sure we are not introducing a problem.
There are some minor pre-existing comments, and (note to self) I’d like to read the resulting code as a whole.
storage/internal/tempdir/tempdir.go
Outdated
| // struct. The returned type StagedAddition has a Commit() function to move the content from | ||
| // the temporary location to the final one. | ||
| // | ||
| // The caller MUST call Commit() before Cleanup(0 is called on the TempDir, otherwise the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // The caller MUST call Commit() before Cleanup(0 is called on the TempDir, otherwise the | |
| // The caller MUST call Commit() before Cleanup() is called on the TempDir, otherwise the |
storage/internal/tempdir/tempdir.go
Outdated
| if td.tempDirLock == nil { | ||
| return nil, fmt.Errorf("temp dir instance not initialized or already cleaned up") | ||
| } | ||
| fileName := fmt.Sprintf("%d-", td.counter) + "addition" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fileName := fmt.Sprintf("%d-", td.counter) + "addition" | |
| fileName := fmt.Sprintf("%d-addition", td.counter) |
(or, hypothetically, have the caller provide a name, but I doubt that would be worth the effort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switch this to strconv.FormatUint(td.counter, 10) + "-addition", while it shouldn't matter here it is more efficient as it doesn't need runtime type checking due the interface that the sprintf needs
storage/drivers/overlay/overlay.go
Outdated
|
|
||
| st := idtools.Stat{IDs: idPair, Mode: defaultPerms} | ||
|
|
||
| if parent != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was copied into getLayerPermissions, so I think it can be deleted here.
| return t.Cleanup, nil, -1, err | ||
| } | ||
|
|
||
| _, forcedSt, st, err := d.getLayerPermissions(parent, options.Mappings.UIDs(), options.Mappings.GIDs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls GetContainersOverrideXattr and stat on d.dir(parent), without holding locks.
I think that’s fine if we know that the parent layer existed = was previously fully created — otherwise we might race against Driver.create setting the fields, and see incorrect interim values. And we do know that, due to populateLayerOptions — so that’s all working fine, but it’s a very remote dependency, so worth carefully documenting in the StartStagingDiffToApply function and in callers across the call stack.
(If the layer existed and is later removed, we can fail with an error, but we won’t see invalid data. Failing with an error is semantically correct, creating a layer against a missing parent should fail.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, one thing I wonder should we turn that ENOENT into ErrLayerUnknown? Callers that currently match that would still have the possibility to retry and create the parent while an ENOENT could mean anything.
| if err != nil { | ||
| return -1, err | ||
| } | ||
| defer tarSplitFile.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m wondering whether this should have error checks… generally, the kernel can defer writes until close(2). OTOH we call .Sync(), so that should be good enough.
Just as a matter of comfort / lack of risk, I’d prefer error checks on the close operations, but I can’t think of a specific reason why that should matter.
(If this were changed, applies also to the other use of createTarSplitFile.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should good with sync(). I can certainly add error checks but also that pattern of missing error checks for close is far more than this call here where it shouldn't matter due the sync.
storage/store.go
Outdated
| defer rlstore.stopWriting() | ||
| return s.putLayer(rlstore, rlstores, id, parent, names, mountLabel, writeable, lOptions, diff, nil) | ||
|
|
||
| if parentLayer == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if options == nil, isn’t it? Some layers have no parents.
And then the comment just below can be dropped.
| if !compareIdMappings(freshLayer.UIDMap, parentLayer.UIDMap) || !compareIdMappings(freshLayer.GIDMap, parentLayer.GIDMap) { | ||
| // Fatal problem. Mappings changed so the parent must be considered different now. | ||
| // Since we consumed the diff there is no we to recover, return error to caller. The caller would need to retry. | ||
| // How likely is that and would need to return a special error so c/image could do the retries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This should currently not happen for c/image-created layers, because those have content-based IDs. It could happen only with a non-c/image writer that does something different. But I don’t know whether we will continue to have deterministic IDs with the sha512 work, I suspect we might not.
Either way, c/image pulls are currently not resilient to concurrent layer removals, so a hard failure here is ~fine. We’d need some kind of “lease a layer” feature that allows the pull to hold a found parent layer while downloading a child, and that does not exist.)
| @@ -1,52 +0,0 @@ | |||
| package graphdriver | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Unrelated: It turns out there are ~no non-trivial implementations of CreateFromTemplate now, so that method could be deleted and the caller simplified a bit not to compute its extra parameters. But that’s way out of scope.)
| // a layer potentially before we even take a look if the driver implements the | ||
| // ApplyDiffStaging interface. | ||
| // This should be initialized with layerStore.newMaybeStagedLayerExtraction() | ||
| type maybeStagedLayerExtraction struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Absolutely non-blocking: it might make sense to inline this into stagedLayerOptions/layerCreationContents/whatever — I think it would be nice, long term, for the chunked and non-chunked staging paths to be unified, by making the chunked code also use StagedAddition and the like. [That’s 100% out of scope.]
OTOH right now the semantics of “caller must call .cleanup()” makes more sense when the two structs are distinct.)
storage/store.go
Outdated
| // How likely is that and would need to return a special error so c/image could do the retries? | ||
| return nil, -1, fmt.Errorf("error during staged layer apply, parent layer %q changed id mappings while the content was extracted, must retry layer creation", parent) | ||
| } | ||
| // FIXME: do we need to validate something else that would affect the extracted content? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but it’s a somewhat remote dependency on populateLayerOptions; I’d like some fairly explicit comments linking the ID mapping comparison condition and the code in populateLayerOptions… or maybe even wrapping this condition into populateLayerOptionsInputsUnchanged function.
struct populateLayerOptionsInput { uidMap, gidMap }
func (… populateLayerOptionsInput) equals(populateLayerOptionsInput) bool {…}
func populateLayerOptionsInputFromLayer(*Layer) populateLayerOptionsInput {…}
// only consume parentLayer via populateLayerOptionsInput(parentLayer) in populateLayerOptionsInputwould be way too overengineered and not worth the mental overhead, but I think we need something.
Add a new function to stage additions. This should be used to extract the layer content into a temp directory without holding the storage lock and then under the lock just rename the directory into the final location to reduce the lock contention. Signed-off-by: Paul Holzinger <[email protected]>
It is not clear to me when it will hit the code path there, by normal layer creation we always pass a valid parent so this branch is never reached AFAICT. Let's remove it and see if all tests still pass in podman, buildah and others... Signed-off-by: Paul Holzinger <[email protected]>
Split out the layer permission gathering from the main create() function so it can be reused elsehwere, see the next commit. Signed-off-by: Paul Holzinger <[email protected]>
Add a function to apply the diff into a tmporary directory so we can do that unlocked and only rename under the lock. Signed-off-by: Paul Holzinger <[email protected]>
I cannot see any reason why we should buffer the full tar split content in memory before writing it. That layer is still mark partial at this point and the store is locked so there is no concurrent access either thus we do not need the atomic rename here. Signed-off-by: Paul Holzinger <[email protected]>
Split it into multiple function to make it reusable without having a layer and so that it can be used unlocked see the following commits. Signed-off-by: Paul Holzinger <[email protected]>
The extracting of the tar under the store lock is a bottleneck as many concurrent processes might hold the locks for a long time on big layers. To address this move the layer extraction before we take the locks if possible. Currently this only work when using the overlay driver as the implementation requires driver specifc details in order for a rename() to work. Signed-off-by: Paul Holzinger <[email protected]>
It doesn't seem needed here so don't take it. Signed-off-by: Paul Holzinger <[email protected]>
A minor rework to enable more changes in following commits. Note the caller still must hold the layer store locks so ensure we return the layer locked and let the caller unlock instead. Signed-off-by: Paul Holzinger <[email protected]>
Make it reusable for other callers, see next commit. Also while at it remove the dedupeStrings() call, as pointed out by Miloslav the work it is doing is more expensive than just checking the same name several times as it does a O(1) map lookup. Also most callers won't pass duplicated names to begin with. Signed-off-by: Paul Holzinger <[email protected]>
Just be safe based on the review feedback from the PR. containers#378 Signed-off-by: Paul Holzinger <[email protected]>
be42227 to
9c7995a
Compare
|
✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6508 |
Just be safe based on the review feedback from the PR. containers#378 Signed-off-by: Paul Holzinger <[email protected]>
9c7995a to
bac8c67
Compare
The untar can be quite expensive so check for id, name conflicts right away. Also we must populate the idmappings so we extract with the right uids/gids. Signed-off-by: Paul Holzinger <[email protected]>
This function was added in commit c577a81 and used by older drivers we no longer suppor, such as aufs and windows. As such this is dead code and can be removed. Signed-off-by: Paul Holzinger <[email protected]>
It is unused in all drivers now, so it can be removed. Signed-off-by: Paul Holzinger <[email protected]>
We use this this typo all the time now so make the naming a bit more clear. Signed-off-by: Paul Holzinger <[email protected]>
Just be safe based on the review feedback from the PR. containers#378 Signed-off-by: Paul Holzinger <[email protected]>
bac8c67 to
ceb1809
Compare
No description provided.