Skip to content

feat(smb): add volume isolation and stage/unstage support to SMB CSI … #943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MattPOlson
Copy link

…driver

  • Enables NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME and ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME
  • Appends SHA-256 hash of SMB credentials to volume ID for credential-based volume isolation
  • NodePublishVolume updated to bind-mount from staging path to pod volume path
  • Implements NodeUnstageVolume with reference count check using /proc/mounts
  • Preserves all existing SMB CSI functionality (Kerberos, GID, subDir handling)

What type of PR is this?
/kind feature
/kind design

What this PR does / why we need it:
This PR enhances the SMB CSI driver to support STAGE_UNSTAGE_VOLUME, enabling a single SMB share mount per node with bind mounts for each pod. It also introduces credential-based volume ID hashing to isolate mounts for identical shares accessed with different credentials.

Which issue(s) this PR fixes:
Fixes #353
Fixes #935

Fixes #

Requirements:

Special notes for your reviewer:
This is a non-breaking change that enhances driver correctness and security without removing or deprecating existing functionality. Happy to add follow-up documentation or test case references.

Release note:

none

testresults.txt

…driver

- Enables NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME and ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME
- Appends SHA-256 hash of SMB credentials to volume ID for credential-based volume isolation
- NodePublishVolume updated to bind-mount from staging path to pod volume path
- Implements NodeUnstageVolume with reference count check using /proc/mounts
- Preserves all existing SMB CSI functionality (Kerberos, GID, subDir handling)
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 17, 2025
Copy link

linux-foundation-easycla bot commented Apr 17, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added kind/design Categorizes issue or PR as related to design. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @MattPOlson!

It looks like this is your first PR to kubernetes-csi/csi-driver-smb 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/csi-driver-smb has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @MattPOlson. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MattPOlson
Once this PR has been reviewed and has the lgtm label, please assign jingxu97 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 17, 2025
@gnufied
Copy link

gnufied commented Apr 18, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 18, 2025
The SMB CSI driver does not implement ControllerPublishVolume or ControllerUnpublishVolume
since SMB shares are mounted directly by nodes and do not require controller-side attach/detach.

Removing the ControllerServiceCapability_RPC_PUBLISH_UNPUBLISH_VOLUME capability avoids
advertising unsupported functionality and allows the sanity test suite to pass.

This change does not impact node-side STAGE_UNSTAGE_VOLUME support, which remains fully functional.
…ort Windows and Darwin

Refactored NodeUnstageVolume to use a platform-aware HasMountReferences() helper,
moving Linux-specific /proc/mounts parsing into smb_common_linux.go and stubbing
it out for Windows and Darwin to prevent test failures on non-Linux environments.

- Fixes Windows e2e failures due to /proc/mounts not being available
- Ensures future compatibility for multi-platform CSI driver builds
- Preserves original Linux mount reference tracking behavior
@coveralls
Copy link

Pull Request Test Coverage Report for Build 14562959558

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 61 of 91 (67.03%) changed or added relevant lines in 4 files are covered.
  • 23 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-2.3%) to 76.174%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/smb/controllerserver.go 8 10 80.0%
pkg/smb/smb_common_linux.go 12 16 75.0%
pkg/smb/nodeserver.go 39 63 61.9%
Files with Coverage Reduction New Missed Lines %
pkg/smb/smb_common_linux.go 6 61.76%
pkg/smb/nodeserver.go 17 68.49%
Totals Coverage Status
Change from base Build 14400027271: -2.3%
Covered Lines: 1119
Relevant Lines: 1469

💛 - Coveralls

Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattPOlson thanks for the fix, some comments:

Adds credential-based hashing to the volume ID so PVCs using the same share + same credentials resolve to the same global mount.

from my understanding, this behavior change does not solve the issue since in common scenario one file share using one credential

Uses NodeStageVolume + bind mounts to avoid duplicate mounts entirely.

I don't find any code change in NodeStageVolume function in your PR

Updates NodeUnstageVolume to safely unmount the global share only when it's no longer referenced by any bind mount.

I think this is the most important fix that solves this issue! could you only make this change in the PR and we could skip windows related change since Windows does not allow multiple mounts on the same smb share.

Thanks

@MattPOlson
Copy link
Author

MattPOlson commented Apr 21, 2025

@andyzhangx

Thanks for the thoughtful feedback, really appreciate the detailed review.

You're absolutely right that the NodeUnstageVolume() enhancement is the core fix. That reference-checking logic is what ensures the driver does not prematurely unmount the global SMB share while bind mounts from other pods are still active. Without this, the kubelet can misinterpret active usage and trigger unmount errors, especially when multiple pods or workloads interact with the same SMB share.

Regarding the credential-based hashing, I added that to address a secondary concern mentioned in #353 specifically, the risk of a new PVC unintentionally reusing a stale global mount that was originally mounted with different credentials. While that scenario is rare (most real-world use cases use a single credential per share), I wanted to guard against accidental credential leakage. That said, I’m completely fine with removing this from the PR to keep the scope focused.

On the NodeStageVolume() question confirmed, there were no functional changes made to that method in this PR. All staging behavior remains as it was upstream. I originally had an idea around that but later changed my mind.

The Windows-related changes were added to ensure both build compatibility and correct runtime behavior across platforms. The reference-checking logic in NodeUnstageVolume() depends on /proc/mounts, which only exists on Linux. To make this logic portable and safe:

  • I introduced a platform-aware HasMountReferences() helper.

  • On Linux, it performs the actual /proc/mounts inspection to prevent premature unmounting.

  • On Windows and Darwin, it’s safely stubbed to return false, which aligns with their behavior, since Windows does not support multiple mounts of the same SMB share, this logic isn’t applicable.

  • These guards are necessary to ensure the driver remains both functional and portable across supported operating systems.

In summary, I'm happy to:

Keep this PR focused on just the NodeUnstageVolume() fix (with required platform support)

Remove the credential hashing logic

Let me know if you’d prefer this rebased down to just the essential fix, I can prep that immediately. Thanks again for helping guide this improvement!

@@ -48,3 +51,23 @@ func prepareStagePath(_ string, _ *mount.SafeFormatAndMount) error {
func Mkdir(_ *mount.SafeFormatAndMount, name string, perm os.FileMode) error {
return os.Mkdir(name, perm)
}

func HasMountReferences(stagingTargetPath string) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add explanation for this func? I still doubt how did you solve the problem by only searching stagingTargetPath since in my testing environment (with 2 replicas referencing to one PVC), the bind mount would be like following, you need to count the target mount share during NodeUnstageVolume to make sure there are no other mounts referencing the smb share.

# cat /proc/mounts | grep smb
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/085ffd65e2835034cdf2a23f67a498673427c7497def5d808bfb505c0df0b1a4/globalmount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/4aad1717-a5cb-4f93-a28b-2a8a8c36d1bd/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/2164da99-d68f-4886-9bcb-9bb9a42c844f/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1

Copy link
Author

@MattPOlson MattPOlson Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add explanation for this func? I still doubt how did you solve the problem by only searching stagingTargetPath since in my testing environment (with 2 replicas referencing to one PVC), the bind mount would be like following, you need to count the target mount share during NodeUnstageVolume to make sure there are no other mounts referencing the smb share.

# cat /proc/mounts | grep smb
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/085ffd65e2835034cdf2a23f67a498673427c7497def5d808bfb505c0df0b1a4/globalmount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/4aad1717-a5cb-4f93-a28b-2a8a8c36d1bd/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1
//smb-server.default.svc.cluster.local/share/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9 /var/lib/kubelet/pods/2164da99-d68f-4886-9bcb-9bb9a42c844f/volumes/kubernetes.io~csi/pvc-00a18b4a-977d-4ce8-a911-fc661a7332f9/mount cifs rw,relatime,vers=3.1.1

@andyzhangx
You’re right that our initial implementation of HasMountReferences() was scanning /proc/mounts for paths prefixed by stagingTargetPath, which doesn't fully reflect how bind mounts work in real-world deployments. As you pointed out, kubelet performs bind mounts from the global path into pod-specific paths like /var/lib/kubelet/pods/.../volumes/.../mount, and these paths are not subdirectories of the staging path.

The correct approach here is to:

Parse /proc/mounts

Count the number of entries where the source (the SMB share URI) matches the mounted share

Only proceed with unmounting if stagingTargetPath is the last remaining mount target

I’ll update the implementation of HasMountReferences() to reflect this, by comparing entries that have the same mount source as the staging path. This aligns with how GetDeviceMountRefs() works internally in kubelet, and ensures that the global mount is only unmounted once all bind mounts are gone.

Thanks for flagging this — I’ll update the PR accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattPOlson thanks for the work, pls also provide the /proc/mounts examples when there are multiple PVCs using the same file share in the PR description, I think we only need this fix right now, thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyzhangx
Thanks for the continued feedback and review.

After extensive testing and deeper inspection of /proc/mounts and /proc/self/mountinfo, I’ve confirmed that there is no reliable or portable way to determine which global mount path a pod bind mount is referencing when multiple global mounts exist for the same SMB share.

The Problem
When two PVCs reference the same SMB share (e.g., //smb-server/share) and get different volume handles, the driver ends up creating multiple global mounts like:

/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/<guid1>/globalmount
/var/lib/kubelet/plugins/kubernetes.io/csi/smb.csi.k8s.io/<guid2>/globalmount

Then, all pod mounts — regardless of which global mount they bind from — show up in /proc/mounts with the same source:

source: //smb-server/share
device: 0:334

So:

We can detect that the share is still in use.

But we cannot detect which global mount path is actually being used.

There’s no way to know which global mount is safe to unmount.

This leads to orphaned global mounts, errors in GetDeviceMountRefs, and premature or blocked unmounts.

Why Normalizing the Global Mount is the Fix
If the driver normalizes volume handles (e.g., by hashing the share + credentials), then:

Kubernetes will use the same volumeHandle for identical mounts

Only one global mount path is created per node per share

All bind mounts share that one staging path

Cleanup is simple: once all bind mounts are gone, the staging path is unmounted

This aligns with the behavior of block devices and is consistent with Kubernetes’ expectations for how NodeStageVolume and NodePublishVolume interact.

Summary
Without volumeHandle normalization, there's no reliable mechanism to correlate bind mounts back to the correct staging path — especially with CIFS, where device IDs and mount sources are always the same.

That’s why I strongly recommend restoring the normalization logic. It’s the only way to ensure safe, deterministic cleanup of global mounts in multi-PVC, multi-pod SMB scenarios.

@andyzhangx
Copy link
Member

@MattPOlson thanks for the detailed explanations, pls only make necessary changes to fix the original issue.

@MattPOlson
Copy link
Author

@MattPOlson thanks for the detailed explanations, pls only make necessary changes to fix the original issue.

@andyzhangx

Thanks again for the review, I’d like to clarify why the volumeHandle normalization is essential to fully solving the problem we're seeing in production, even with the upstream behavior.

How This Happens with the Current Upstream Driver
Although the upstream SMB CSI driver generates a unique volumeHandle per PVC, those handles can still reference the same underlying SMB share (e.g., //smb-server/share). This leads to the following behavior:

Multiple PVCs, each with unique volumeHandles, stage the same SMB share.

This results in multiple global mount paths, such as:

/plugins/...//globalmount

/plugins/...//globalmount

But from the Linux kernel's perspective, all those mounts reference the same CIFS device.

When one PVC is deleted, the kubelet tries to unmount its corresponding globalmount path.

However, GetDeviceMountRefs() detects that the same source is still mounted in other locations, and kubelet refuses to unmount, thinking it's still in use.

We've confirmed this with logs like:

GetDeviceMountRefs check failed ... device mount path is still mounted by other references [...]

This behavior becomes more common when Helm charts, templates, or automation generate many PVCs that point to the same SMB share.

How volumeHandle Normalization Solves This
By reintroducing a deterministic volumeHandle, computed as a hash of source + credentials, we can ensure:

The same share + creds → same volumeHandle

Only one global mount is created per node

Additional PVCs referencing the same share use bind mounts only (via NodePublishVolume)

NodeUnstageVolume only unmounts when no other pods are using the mount

This approach has several advantages:

  • Prevents GetDeviceMountRefs() collisions and premature unmount failures
  • Reduces duplicate mounts and simplifies cleanup
  • Enforces credential-based isolation (if creds differ, mount paths differ)
  • Retains existing upstream logic for all other paths

TL;DR
The current upstream behavior causes multiple global mounts to the same share

Linux sees them as the same device and blocks unmounts

Our changes resolve this by ensuring the share is only mounted once per node

The fix is safe, backward-compatible, and improves both reliability and security

Happy to reintroduce this logic in a scoped commit if you'd like to review it in isolation — thanks again for considering!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants