Skip to content

feat: adds GPU mutation #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 6, 2024
Merged

feat: adds GPU mutation #591

merged 8 commits into from
May 6, 2024

Conversation

faiq
Copy link
Contributor

@faiq faiq commented Apr 29, 2024

What problem does this PR solve?:

Which issue(s) this PR fixes:
Addresses: https://jira.nutanix.com/browse/D2IQ-100465

How Has This Been Tested?:

This hasn't been tested yet. I am planning on doing this soon.

Special notes for your reviewer:

@faiq faiq marked this pull request as ready for review May 3, 2024 21:37
@faiq
Copy link
Contributor Author

faiq commented May 3, 2024

started a cluster with this manifest

apiVersion: v1
kind: Secret
metadata:
  labels:
    cluster.x-k8s.io/provider: nutanix
  name: ${CLUSTER_NAME}-dockerhub-credentials
stringData:
  password: ${DOCKER_HUB_PASSWORD}
  username: ${DOCKER_HUB_USERNAME}
type: Opaque
---
apiVersion: v1
kind: Secret
metadata:
  labels:
    cluster.x-k8s.io/provider: nutanix
  name: ${CLUSTER_NAME}-pc-creds-for-csi
stringData:
  key: ${NUTANIX_ENDPOINT}:${NUTANIX_PORT}:${NUTANIX_USER}:${NUTANIX_PASSWORD}
---
apiVersion: v1
kind: Secret
metadata:
  labels:
    cluster.x-k8s.io/provider: nutanix
  name: ${CLUSTER_NAME}-pc-creds
stringData:
  credentials: |
    [
      {
        "type": "basic_auth",
        "data": {
          "prismCentral":{
            "username": "${NUTANIX_USER}",
            "password": "${NUTANIX_PASSWORD}"
          }
        }
      }
    ]
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: ${CLUSTER_NAME}
    cluster.x-k8s.io/provider: nutanix
  name: ${CLUSTER_NAME}
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - ${POD_CIDR:-192.168.0.0/16}
    serviceDomain: ${SERVICE_DOMAIN:="cluster.local"}
    services:
      cidrBlocks:
      - ${SERVICE_CIDR:-10.128.0.0/12}
  topology:
    class: nutanix-quick-start
    controlPlane:
      metadata: {}
      replicas: ${CONTROL_PLANE_MACHINE_COUNT}
    variables:
    - name: clusterConfig
      value:
        addons:
          ccm:
            credentials:
              name: ${CLUSTER_NAME}-pc-creds
          clusterAutoscaler:
            strategy: HelmAddon
          cni:
            provider: Cilium
            strategy: HelmAddon
          csi:
            defaultStorage:
              providerName: nutanix
              storageClassConfigName: nutanix-volume
            providers:
            - credentials:
                name: ${CLUSTER_NAME}-pc-creds-for-csi
              name: nutanix
              storageClassConfig:
              - name: nutanix-volume
                parameters:
                  storageContainer: ${NUTANIX_STORAGE_CONTAINER_NAME}
              strategy: HelmAddon
          nfd:
            strategy: HelmAddon
        controlPlane:
          nutanix:
            machineDetails:
              bootType: legacy
              cluster:
                name: ${NUTANIX_PRISM_ELEMENT_CLUSTER_NAME}
                type: name
              image:
                name: ${NUTANIX_MACHINE_TEMPLATE_IMAGE_NAME}
                type: name
              memorySize: 4Gi
              subnets:
              - name: ${NUTANIX_SUBNET_NAME}
                type: name
              systemDiskSize: 40Gi
              vcpuSockets: 2
              vcpusPerSocket: 1
        imageRegistries:
        - credentials:
            secretRef:
              name: ${CLUSTER_NAME}-dockerhub-credentials
          url: https://docker.io
        nutanix:
          controlPlaneEndpoint:
            host: ${CONTROL_PLANE_ENDPOINT_IP}
            port: 6443
            virtualIP:
              provider: KubeVIP
          prismCentralEndpoint:
            credentials:
              name: ${CLUSTER_NAME}-pc-creds
            insecure: ${NUTANIX_INSECURE}
            url: https://${NUTANIX_ENDPOINT}:9440
    version: ${KUBERNETES_VERSION}
    workers:
      machineDeployments:
      - class: nutanix-quick-start-worker
        metadata:
          annotations:
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "${WORKER_MACHINE_COUNT}"
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "${WORKER_MACHINE_COUNT}"
        name: md-0
        variables:
          overrides:
          - name: workerConfig
            value:
              nutanix:
                machineDetails:
                  bootType: legacy
                  cluster:
                    name: ${NUTANIX_PRISM_ELEMENT_CLUSTER_NAME}
                    type: name
                  image:
                    name: ${NUTANIX_MACHINE_TEMPLATE_IMAGE_NAME}
                    type: name
                  memorySize: 4Gi
                  subnets:
                  - name: ${NUTANIX_SUBNET_NAME}
                    type: name
                  systemDiskSize: 40Gi
                  vcpuSockets: 2
                  vcpusPerSocket: 1
      - class: nutanix-quick-start-worker
        metadata:
          annotations:
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "1"
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "1"
        name: gpu-0
        variables:
          overrides:
          - name: workerConfig
            value:
              nutanix:
                machineDetails:
                  bootType: legacy
                  cluster:
                    name: ${NUTANIX_PRISM_ELEMENT_CLUSTER_NAME}
                    type: name
                  image:
                    name: ${NUTANIX_MACHINE_TEMPLATE_IMAGE_NAME}
                    type: name
                  memorySize: 4Gi
                  subnets:
                  - name: ${NUTANIX_SUBNET_NAME}
                    type: name
                  gpus:
                  - type: name
                    name: "Ampere 40"
                  systemDiskSize: 40Gi
                  vcpuSockets: 2
                  vcpusPerSocket: 1

Saw my GPU was attached to my VM correctly

Screenshot from 2024-05-03 15-15-08

@faiq faiq force-pushed the faiq/adds-gpu branch from e11c8e6 to 2046bee Compare May 3, 2024 21:51
@faiq faiq force-pushed the faiq/adds-gpu branch from 2046bee to 3c15094 Compare May 3, 2024 22:36
@faiq faiq removed the needs-docs label May 3, 2024
@faiq faiq enabled auto-merge (squash) May 3, 2024 22:44
Co-authored-by: Daniel Lipovetsky <daniel.lipovetsky@nutanix.com>
@faiq faiq force-pushed the faiq/adds-gpu branch from 762198a to d5a279c Compare May 6, 2024 16:40
@deepakm-ntnx
Copy link
Contributor

@faiq could you please elaborate on how did you get the details of following gpu

gpus:
                  - type: name
                    name: "Ampere 40" <==

@faiq
Copy link
Contributor Author

faiq commented May 6, 2024

@deepakm-ntnx i looked it up in the GPU tab in the prism central UI

Copy link
Contributor

@supershal supershal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for through research and testing the complexity comes with supporting GPU devices.

@faiq faiq force-pushed the faiq/adds-gpu branch from 0ef3fb6 to 0827285 Compare May 6, 2024 19:07
@faiq faiq force-pushed the faiq/adds-gpu branch from 0827285 to ce6453f Compare May 6, 2024 21:02
@faiq faiq force-pushed the faiq/adds-gpu branch from ce6453f to 436704a Compare May 6, 2024 21:07
@faiq faiq merged commit 70a8726 into main May 6, 2024
17 checks passed
@faiq faiq deleted the faiq/adds-gpu branch May 6, 2024 21:29
@github-actions github-actions bot mentioned this pull request May 6, 2024
jimmidyson added a commit that referenced this pull request May 21, 2024
🤖 I have created a release *beep* *boop*
---


## 0.9.0 (2024-05-21)

<!-- Release notes generated using configuration in .github/release.yaml
at main -->

## What's Changed
### Exciting New Features 🎉
* feat: expose GenerateNoProxy func by @mhrabovcin in
#594
* feat: Add the ServiceLoadbalancer Addon, with MetalLB as first
provider by @dlipovetsky in
#592
* feat: adds GPU mutation by @faiq in
#591
* feat: Add GenericClusterConfig and add docs on usage with own CC by
@jimmidyson in
#606
* feat: Enable unprivileged ports sysctl in containerd config by
@jimmidyson in
#645
* feat: API for encryption at-rest by @supershal in
#610
* feat: Bump sigs.k8s.io/cluster-api to v1.7.2 by @jimmidyson in
#661
* feat: Pull calico images from quay.io instead of docker hub by
@jimmidyson in
#676
* feat: update cluster autoscaler to v1.30.0 by @dkoshkin in
#681
### Fixes 🔧
* fix: Fix error messages returned by HelmChartGetter by @dlipovetsky in
#598
* fix: use a consistent MachineDeployment class name by @dkoshkin in
#612
* fix: Do not return error if serviceLoadBalancer field is not set by
@dlipovetsky in
#611
* fix: use provided options for serverside apply by @supershal in
#627
* fix: Correct the CSI handler logic by @dlipovetsky in
#603
* fix: Fix the internal ClusterConfig type used for provider-agnostic
logic by @jimmidyson in
#607
* fix: log mutation failure errors by @supershal in
#649
* fix: Always apply containerd patches by @jimmidyson in
#644
* fix: cluster-autoscaler Helm values for workload clusters by @dkoshkin
in
#658
* fix: Make Cluster the owner of image registry credential secret by
@dlipovetsky in
#648
* fix: Upgrade dynamic-credential-provider to v0.5.3 by @jimmidyson in
#677
### Other Changes
* build: Add v0.8 release metadata by @jimmidyson in
#595
* refactor: Clean up API constants, and explain usage by @dlipovetsky in
#588
* docs: Add how to deploy CAREN by @jimmidyson in
#599
* docs: Upgrade hugo to latest by @jimmidyson in
#601
* docs: Update addons docs and tweak release doc by @jimmidyson in
#596
* build: Ensure provider metadata is up to date when releasing by
@jimmidyson in
#600
* docs: Add how to create clusters by @jimmidyson in
#602
* docs: Update docsy module by @jimmidyson in
#605
* refactor: Apply kubebuilder annotations for required/optional
everywhere by @jimmidyson in
#604
* docs: Cluster Autoscaler is deployed on the management cluster by
@dkoshkin in
#608
* docs: Fix missing placeholder in "create nutanix cluster" doc by
@dlipovetsky in
#609
* refactor: Remove unused api/variables package by @dlipovetsky in
#623
* refactor: move label helper functions to utils package by @supershal
in
#626
* build: Use go1.22.3 toolchain to mitigate vulnerabilties by
@jimmidyson in
#628
* build: Temporary lint config fix until next golangci-lint release by
@jimmidyson in
#629
* build: Update license for Nutanix by @jimmidyson in
#456
* test(e2e): Consistent core/bootstrap/control-plane provider versions
by @jimmidyson in
#639
* ci: free up disk space before running tests by @dkoshkin in
#643
* test: Add more context to panic in envtest helper by @dlipovetsky in
#641
* refactor: Use colon to separate context from wrapped error by
@dlipovetsky in
#642
* refactor: Remove unused test helper function by @dlipovetsky in
#647
* test: Add even more context to panic in envtest helper by @dlipovetsky
in
#650
* build: Make module-relative "go list -m" compatible with GOWORK by
@dlipovetsky in
#651
* test: Match cluster namespace to cluster name by @dlipovetsky in
#652
* refactor: Write configuration under /etc/caren by @dlipovetsky in
#656
* build: use a shorter namespace caren-system by @dkoshkin in
#662
* refactor: Use a Credentials struct consistently by @dlipovetsky in
#663
* test: add encryptionAtRest config in capi-quick-start by @supershal in
#659
* test(e2e): Fix up secret ownership checks by @jimmidyson in
#665
* test: Remove hard-coded text focus and label for e2e tests by
@dlipovetsky in
#667
* ci: Use new dependabot multimodule capabilities by @jimmidyson in
#664
* refactor: aggregate types to be used by clients by @dkoshkin in
#672
* test: Add E2E_DRYRUN and E2E_VERBOSE make vars by @dlipovetsky in
#666
* build: Ignore all gitlint rules for dependabot commits by @jimmidyson
in
#675
* build: Update all tools by @jimmidyson in
#678
* test(e2e): Use upstream CRS helpers by @jimmidyson in
#680
* build: Correct dry-run output by @jimmidyson in
#679
* build: Use k8s v1.29.4 as default Kubernetes version by @jimmidyson in
#646

## New Contributors
* @prajnutanix made their first contribution in
#638

**Full Changelog**:
v0.8.1...v0.9.0

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants