-
Notifications
You must be signed in to change notification settings - Fork 148
feat(smb): add volume isolation and stage/unstage support to SMB CSI … #943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…driver - Enables NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME and ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME - Appends SHA-256 hash of SMB credentials to volume ID for credential-based volume isolation - NodePublishVolume updated to bind-mount from staging path to pod volume path - Implements NodeUnstageVolume with reference count check using /proc/mounts - Preserves all existing SMB CSI functionality (Kerberos, GID, subDir handling)
Welcome @MattPOlson! |
Hi @MattPOlson. Thanks for your PR. I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MattPOlson The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
The SMB CSI driver does not implement ControllerPublishVolume or ControllerUnpublishVolume since SMB shares are mounted directly by nodes and do not require controller-side attach/detach. Removing the ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME capability avoids advertising unsupported functionality and allows the sanity test suite to pass. This change does not impact node-side STAGE_UNSTAGE_VOLUME support, which remains fully functional.
…ort Windows and Darwin Refactored NodeUnstageVolume to use a platform-aware HasMountReferences() helper, moving Linux-specific /proc/mounts parsing into smb_common_linux.go and stubbing it out for Windows and Darwin to prevent test failures on non-Linux environments. - Fixes Windows e2e failures due to /proc/mounts not being available - Ensures future compatibility for multi-platform CSI driver builds - Preserves original Linux mount reference tracking behavior
Pull Request Test Coverage Report for Build 14562959558Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattPOlson thanks for the fix, some comments:
Adds credential-based hashing to the volume ID so PVCs using the same share + same credentials resolve to the same global mount.
from my understanding, this behavior change does not solve the issue since in common scenario one file share using one credential
Uses NodeStageVolume + bind mounts to avoid duplicate mounts entirely.
I don't find any code change in NodeStageVolume
function in your PR
Updates NodeUnstageVolume to safely unmount the global share only when it's no longer referenced by any bind mount.
I think this is the most important fix that solves this issue! could you only make this change in the PR and we could skip windows related change since Windows does not allow multiple mounts on the same smb share.
Thanks
Thanks for the thoughtful feedback, really appreciate the detailed review. You're absolutely right that the NodeUnstageVolume() enhancement is the core fix. That reference-checking logic is what ensures the driver does not prematurely unmount the global SMB share while bind mounts from other pods are still active. Without this, the kubelet can misinterpret active usage and trigger unmount errors, especially when multiple pods or workloads interact with the same SMB share. Regarding the credential-based hashing, I added that to address a secondary concern mentioned in #353 specifically, the risk of a new PVC unintentionally reusing a stale global mount that was originally mounted with different credentials. While that scenario is rare (most real-world use cases use a single credential per share), I wanted to guard against accidental credential leakage. That said, I’m completely fine with removing this from the PR to keep the scope focused. On the NodeStageVolume() question confirmed, there were no functional changes made to that method in this PR. All staging behavior remains as it was upstream. I originally had an idea around that but later changed my mind. The Windows-related changes were added to ensure both build compatibility and correct runtime behavior across platforms. The reference-checking logic in NodeUnstageVolume() depends on /proc/mounts, which only exists on Linux. To make this logic portable and safe:
In summary, I'm happy to: Keep this PR focused on just the NodeUnstageVolume() fix (with required platform support) Remove the credential hashing logic Let me know if you’d prefer this rebased down to just the essential fix, I can prep that immediately. Thanks again for helping guide this improvement! |
@@ -48,3 +51,23 @@ func prepareStagePath(_ string, _ *mount.SafeFormatAndMount) error { | |||
func Mkdir(_ *mount.SafeFormatAndMount, name string, perm os.FileMode) error { | |||
return os.Mkdir(name, perm) | |||
} | |||
|
|||
func HasMountReferences(stagingTargetPath string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add explanation for this func? I still doubt how did you solve the problem by only searching stagingTargetPath
since in my testing environment (with 2 replicas referencing to one PVC), the bind mount would be like following, you need to count the target mount share during NodeUnstageVolume
to make sure there are no other mounts referencing the smb share.
# cat /proc/mounts | grep smb
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/085ffd65e2835034cdf2a23f67a498673427c7497def5d808bfb505c0df0b1a4/globalmount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/4aad1717-a5cb-4f93-a28b-2a8a8c36d1bd/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/2164da99-d68f-4886-9bcb-9bb9a42c844f/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add explanation for this func? I still doubt how did you solve the problem by only searching
stagingTargetPath
since in my testing environment (with 2 replicas referencing to one PVC), the bind mount would be like following, you need to count the target mount share duringNodeUnstageVolume
to make sure there are no other mounts referencing the smb share.# cat /proc/mounts | grep smb //smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/085ffd65e2835034cdf2a23f67a498673427c7497def5d808bfb505c0df0b1a4/globalmount cifs rw,relatime,vers=3.1.1 //smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/4aad1717-a5cb-4f93-a28b-2a8a8c36d1bd/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1 //smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/2164da99-d68f-4886-9bcb-9bb9a42c844f/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1
@andyzhangx
You’re right that our initial implementation of HasMountReferences() was scanning /proc/mounts for paths prefixed by stagingTargetPath, which doesn't fully reflect how bind mounts work in real-world deployments. As you pointed out, kubelet performs bind mounts from the global path into pod-specific paths like /var/lib/kubelet/pods/.../volumes/.../mount, and these paths are not subdirectories of the staging path.
The correct approach here is to:
Parse /proc/mounts
Count the number of entries where the source (the SMB share URI) matches the mounted share
Only proceed with unmounting if stagingTargetPath is the last remaining mount target
I’ll update the implementation of HasMountReferences() to reflect this, by comparing entries that have the same mount source as the staging path. This aligns with how GetDeviceMountRefs() works internally in kubelet, and ensures that the global mount is only unmounted once all bind mounts are gone.
Thanks for flagging this — I’ll update the PR accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattPOlson thanks for the work, pls also provide the /proc/mounts examples when there are multiple PVCs using the same file share in the PR description, I think we only need this fix right now, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyzhangx
Thanks for the continued feedback and review.
After extensive testing and deeper inspection of /proc/mounts and /proc/self/mountinfo, I’ve confirmed that there is no reliable or portable way to determine which global mount path a pod bind mount is referencing when multiple global mounts exist for the same SMB share.
The Problem
When two PVCs reference the same SMB share (e.g., //smb-server/share) and get different volume handles, the driver ends up creating multiple global mounts like:
/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/<guid1>/globalmount
/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/<guid2>/globalmount
Then, all pod mounts — regardless of which global mount they bind from — show up in /proc/mounts with the same source:
source: //smb-server/share
device: 0:334
So:
We can detect that the share is still in use.
But we cannot detect which global mount path is actually being used.
There’s no way to know which global mount is safe to unmount.
This leads to orphaned global mounts, errors in GetDeviceMountRefs, and premature or blocked unmounts.
Why Normalizing the Global Mount is the Fix
If the driver normalizes volume handles (e.g., by hashing the share + credentials), then:
Kubernetes will use the same volumeHandle for identical mounts
Only one global mount path is created per node per share
All bind mounts share that one staging path
Cleanup is simple: once all bind mounts are gone, the staging path is unmounted
This aligns with the behavior of block devices and is consistent with Kubernetes’ expectations for how NodeStageVolume and NodePublishVolume interact.
Summary
Without volumeHandle normalization, there's no reliable mechanism to correlate bind mounts back to the correct staging path — especially with CIFS, where device IDs and mount sources are always the same.
That’s why I strongly recommend restoring the normalization logic. It’s the only way to ensure safe, deterministic cleanup of global mounts in multi-PVC, multi-pod SMB scenarios.
@MattPOlson thanks for the detailed explanations, pls only make necessary changes to fix the original issue. |
Thanks again for the review, I’d like to clarify why the volumeHandle normalization is essential to fully solving the problem we're seeing in production, even with the upstream behavior. How This Happens with the Current Upstream Driver Multiple PVCs, each with unique volumeHandles, stage the same SMB share. This results in multiple global mount paths, such as: /plugins/...//globalmount /plugins/...//globalmount But from the Linux kernel's perspective, all those mounts reference the same CIFS device. When one PVC is deleted, the kubelet tries to unmount its corresponding globalmount path. However, GetDeviceMountRefs() detects that the same source is still mounted in other locations, and kubelet refuses to unmount, thinking it's still in use. We've confirmed this with logs like:
This behavior becomes more common when Helm charts, templates, or automation generate many PVCs that point to the same SMB share. How volumeHandle Normalization Solves This The same share + creds → same volumeHandle Only one global mount is created per node Additional PVCs referencing the same share use bind mounts only (via NodePublishVolume) NodeUnstageVolume only unmounts when no other pods are using the mount This approach has several advantages:
TL;DR Linux sees them as the same device and blocks unmounts Our changes resolve this by ensuring the share is only mounted once per node The fix is safe, backward-compatible, and improves both reliability and security Happy to reintroduce this logic in a scoped commit if you'd like to review it in isolation — thanks again for considering! |
…driver
What type of PR is this?
/kind feature
/kind design
What this PR does / why we need it:
This PR enhances the SMB CSI driver to support STAGE_UNSTAGE_VOLUME, enabling a single SMB share mount per node with bind mounts for each pod. It also introduces credential-based volume ID hashing to isolate mounts for identical shares accessed with different credentials.
Which issue(s) this PR fixes:
Fixes #353
Fixes #935
Fixes #
Requirements:
Special notes for your reviewer:
This is a non-breaking change that enhances driver correctness and security without removing or deprecating existing functionality. Happy to add follow-up documentation or test case references.
Release note:
testresults.txt