Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: Add timeout for cryptsetup commands #4912

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

black-dragon74
Copy link
Member

@black-dragon74 black-dragon74 commented Oct 17, 2024

Describe what this PR does

This PR introduces a new wrapper around LUKS utility functions that allow us to have context aware subsequent operations, util.NewLuksWrapperWithContext. This context is shared across the functions exposed by this wrapper.

A sample use case is to have a context that times out after certain period of time.

How to Test

We can simulate a timeout by replacing the cryptsetup binary with a wrapper script that introduces some delay.

  • Exec into the pod where the cryptsetup binary is located (rbd-plugin in case of key rotation)
  • mv /usr/sbin/cryptsetup /usr/sbin/cryptsetup_orig
  • Create a new /usr/sbin/cryptsetup and paste the following:
#!/usr/bin/env bash

set -m

sleep 140

exec /usr/sbin/cryptsetup_orig "$@"
  • chmod +x /usr/sbin/cryptsetup
  • When the process is killed due to timeout you will see a log like:
timeout occurred while running cryptsetup args: [...]

@mergify mergify bot added the component/rbd Issues related to RBD label Oct 17, 2024
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. Did you manage to test this somehow, and see how everything gets corrected by a different container?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Oct 18, 2024

This looks reasonable to me. Did you manage to test this somehow, and see how everything gets corrected by a different container?

we need to ensure that the CLI command we start is killed and doesn't continue to execute, please test it the changes to ensure that it works as expected and also provide some test results on how this is tested

const cryptsetupPBKDFMemoryLimit = 32 << 10 // 32768 KiB
const (
// Maximum time to wait for cryptsetup commands to complete.
CryptSetupExecutionTimeout = 3 * time.Minute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep the timeout which is almost less than the GRPC timeout between the sidecar and the cephcsi so that we try and if it didn't work as will kill the process before the next GRPC kicks in on any other nodes/same node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Let's keep them a minute apart then? Accounting for any delays that might occur in sending gRPC request and also to keep some buffer period after sending SIGKILL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minute might be too much, we can reduce it by 30 seconds to be on safer side, @nixpanic thoughts?

@black-dragon74
Copy link
Member Author

This looks reasonable to me. Did you manage to test this somehow, and see how everything gets corrected by a different container?

Trying but not able to yet. I am thinking of having a wrapper to cryptsetup that sleeps for some time and then tries to execute the intended command. The key is to kill the process in midst of operations.

I will test it and report.

@nixpanic
Copy link
Member

nixpanic commented Oct 18, 2024

Trying but not able to yet. I am thinking of having a wrapper to cryptsetup that sleeps for some time and then tries to execute the intended command. The key is to kill the process in midst of operations.

I also think the only way reproduce this, is by adding a wrapper for the command. It helps to make the command do something that can be verified after aborting, maybe something like this (untested):

#!/bin/sh
#
# Assume a 1GB block-device, write unencrypted data at 500MB offset.
#

# run normal crypsetup, it will be quick and exits cleanly
cryptsetup.real $@

# run dd in a loop, small I/O with sync to make it slower
# only runs the loop if /tmp/testing exists
while [ -e /tmp/testing ]
do
    # set correct device and verify parameters
    dd if=/dev/urandom of=/dev/rbd... bs=1 count=100m offset=500m
done

Make sure to name the script cryptsetup and place is in the location where cryptsetup normally lives. The original cryptsetup executable should be renamed to cryptsetup.real so that it still can be executed.

@nixpanic
Copy link
Member

Hey @black-dragon74 , I'll mark this as a draft for now, otherwise I keep checking it for updates. Please move it back as ready for review once you have tested it. Thanks!

@nixpanic nixpanic marked this pull request as draft October 22, 2024 14:40
@black-dragon74 black-dragon74 marked this pull request as ready for review October 28, 2024 09:30
@black-dragon74 black-dragon74 force-pushed the cryptsetup-add-timeout branch 4 times, most recently from 943698e to 911d52f Compare October 28, 2024 11:42
Madhu-1
Madhu-1 previously approved these changes Oct 30, 2024
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the type and functions make it better, please take care of that.

The suggestion with the interface and moving to a separate package are optional and can be done in a follow-up PR if you prefer.

internal/util/cryptsetup.go Outdated Show resolved Hide resolved
internal/util/cryptsetup.go Outdated Show resolved Hide resolved
internal/util/cryptsetup.go Outdated Show resolved Hide resolved
@mergify mergify bot dismissed Madhu-1’s stale review November 4, 2024 07:22

Pull request has been modified.

@black-dragon74 black-dragon74 force-pushed the cryptsetup-add-timeout branch 2 times, most recently from f42438d to 22bcec3 Compare November 4, 2024 07:58
Copy link
Contributor

@Rakshith-R Rakshith-R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just need to start context before obtaining locks and use the same context for get/put calls to kms.

Comment on lines 508 to 512
// Create a new luks wrapper
timeoutCtx, cancel := context.WithTimeout(ctx, cryptsetup.ExecutionTimeout)
luks := cryptsetup.NewLUKSWrapper(timeoutCtx)
defer cancel()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done before obtaining locks at line 481.
Pass the same context to get/put cryptopassphrase calls as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@nixpanic
Copy link
Member

nixpanic commented Nov 5, 2024

@Mergifyio rebase

This PR modifies the execCryptSetupCommand so that
the process is killed in an event of lock timeout.

Useful in cases where the volume lock is released but
the command is still running.

Signed-off-by: Niraj Yadav <[email protected]>
Copy link
Contributor

mergify bot commented Nov 5, 2024

rebase

✅ Branch has been successfully rebased

@nixpanic nixpanic dismissed Rakshith-R’s stale review November 5, 2024 09:49

Rakshits concern has been addressed

@nixpanic
Copy link
Member

nixpanic commented Nov 5, 2024

@Mergifyio queue

Copy link
Contributor

mergify bot commented Nov 5, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 1c02e69

@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Nov 5, 2024
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 5, 2024
@mergify mergify bot merged commit 1c02e69 into ceph:devel Nov 5, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants