ROCm
diff --git a/‎clang/docs/HLSL/ExpectedDifferences.rst
+110 b/‎clang/docs/HLSL/ExpectedDifferences.rst
+110
diff --git a/‎clang/docs/HLSL/HLSLDocs.rst
+1 b/‎clang/docs/HLSL/HLSLDocs.rst
+1
diff --git a/‎clang/docs/ReleaseNotes.rst
+11 b/‎clang/docs/ReleaseNotes.rst
+11
diff --git a/‎clang/lib/CodeGen/CodeGenPGO.cpp
+7-4 b/‎clang/lib/CodeGen/CodeGenPGO.cpp
+7-4
diff --git a/‎clang/lib/Driver/ToolChains/Arch/ARM.cpp
+12-12 b/‎clang/lib/Driver/ToolChains/Arch/ARM.cpp
+12-12
diff --git a/‎clang/lib/Driver/ToolChains/CommonArgs.cpp
+34-3 b/‎clang/lib/Driver/ToolChains/CommonArgs.cpp
+34-3
diff --git a/‎clang/lib/Driver/ToolChains/Cuda.cpp
+21-22 b/‎clang/lib/Driver/ToolChains/Cuda.cpp
+21-22
diff --git a/‎clang/test/Driver/arm-alignment.c
+15 b/‎clang/test/Driver/arm-alignment.c
+15
diff --git a/‎clang/test/Driver/cuda-cross-compiling.c
+8-1 b/‎clang/test/Driver/cuda-cross-compiling.c
+8-1
diff --git a/‎clang/test/Driver/openmp-offload-gpu.c
+17-3 b/‎clang/test/Driver/openmp-offload-gpu.c
+17-3
@@ -0,0 +1,110 @@
+
+Expected Differences vs DXC and FXC
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+HLSL currently has two reference compilers, the `DirectX Shader Compiler (DXC)
+<https://github.com/microsoft/DirectXShaderCompiler/>`_ and the
+`Effect-Compiler (FXC) <https://learn.microsoft.com/en-us/windows/win32/direct3dtools/fxc>`_.
+The two reference compilers do not fully agree. Some known disagreements in the
+references are tracked on
+`DXC's GitHub
+<https://github.com/microsoft/DirectXShaderCompiler/issues?q=is%3Aopen+is%3Aissue+label%3Afxc-disagrees>`_,
+but many more are known to exist.
+
+HLSL as implemented by Clang will also not fully match either of the reference
+implementations, it is instead being written to match the `draft language
+specification <https://microsoft.github.io/hlsl-specs/specs/hlsl.pdf>`_.
+
+This document is a non-exhaustive collection the known differences between
+Clang's implementation of HLSL and the existing reference compilers.
+
+General Principles
+------------------
+
+Most of the intended differences between Clang and the earlier reference
+compilers are focused on increased consistency and correctness. Both reference
+compilers do not always apply language rules the same in all contexts.
+
+Clang also deviates from the reference compilers by providing different
+diagnostics, both in terms of the textual messages and the contexts in which
+diagnostics are produced. While striving for a high level of source
+compatibility with conforming HLSL code, Clang may produce earlier and more
+robust diagnostics for incorrect code or reject code that a reference compiler
+incorrectly accepted.
+
+Language Version
+================
+
+Clang targets language compatibility for HLSL 2021 as implemented by DXC.
+Language features that were removed in earlier versions of HLSL may be added on
+a case-by-case basis, but are not planned for the initial implementation.
+
+Overload Resolution
+===================
+
+Clang's HLSL implementation adopts C++ overload resolution rules as proposed for
+HLSL 202x based on proposal
+`0007 <https://github.com/microsoft/hlsl-specs/blob/main/proposals/0007-const-instance-methods.md>`_
+and
+`0008 <https://github.com/microsoft/hlsl-specs/blob/main/proposals/0008-non-member-operator-overloading.md>`_.
+
+Clang's implementation extends standard overload resolution rules to HLSL
+library functionality. This causes subtle changes in overload resolution
+behavior between Clang and DXC. Some examples include:
+
+.. code-block:: c++
+
+  void halfOrInt16(half H);
+  void halfOrInt16(uint16_t U);
+  void halfOrInt16(int16_t I);
+
+  void takesDoubles(double, double, double);
+
+  cbuffer CB {
+    uint U;
+    int I;
+    float X, Y, Z;
+    double3 A, B;
+  }
+
+  export void call() {
+    halfOrInt16(U); // DXC: Fails with call ambiguous between int16_t and uint16_t overloads
+                    // Clang: Resolves to halfOrInt16(uint16_t).
+    halfOrInt16(I); // All: Resolves to halfOrInt16(int16_t).
+    half H;
+  #ifndef IGNORE_ERRORS
+    // asfloat16 is a builtin with overloads for half, int16_t, and uint16_t.
+    H = asfloat16(I); // DXC: Fails to resolve overload for int.
+                      // Clang: Resolves to asfloat16(int16_t).
+    H = asfloat16(U); // DXC: Fails to resolve overload for int.
+                      // Clang: Resolves to asfloat16(uint16_t).
+  #endif
+    H = asfloat16(0x01); // DXC: Resolves to asfloat16(half).
+                         // Clang: Resolves to asfloat16(uint16_t).
+
+    takesDoubles(X, Y, Z); // Works on all compilers
+  #ifndef IGNORE_ERRORS
+    fma(X, Y, Z); // DXC: Fails to resolve no known conversion from float to double.
+                  // Clang: Resolves to fma(double,double,double).
+  #endif
+    
+    double D = dot(A, B); // DXC: Resolves to dot(double3, double3), fails DXIL Validation.
+                          // FXC: Expands to compute double dot product with fmul/fadd
+                          // Clang: Resolves to dot(float3, float3), emits conversion warnings.
+
+  }
+
+.. note::
+
+  In Clang, a conscious decision was made to exclude the ``dot(vector<double,N>, vector<double,N>)`` 
+  overload and allow overload resolution to resolve the
+  ``vector<float,N>`` overload. This approach provides ``-Wconversion``
+  diagnostic notifying the user of the conversion rather than silently altering
+  precision relative to the other overloads (as FXC does) or generating code
+  that will fail validation (as DXC does).
@@ -11,6 +11,7 @@ HLSL Design and Implementation
 .. toctree::
    :maxdepth: 1
 
+   ExpectedDifferences
    HLSLIRReference
    ResourceTypes
    EntryFunctions
 
@@ -307,6 +307,17 @@ X86 Support
 Arm and AArch64 Support
 ^^^^^^^^^^^^^^^^^^^^^^^
 
+- ARMv7+ targets now default to allowing unaligned access, except Armv6-M, and
+  Armv8-M without the Main Extension. Baremetal targets should check that the
+  new default will work with their system configurations, since it requires
+  that SCTLR.A is 0, SCTLR.U is 1, and that the memory in question is
+  configured as "normal" memory. This brings Clang in-line with the default
+  settings for GCC and Arm Compiler. Aside from making Clang align with other
+  compilers, changing the default brings major performance and code size
+  improvements for most targets. We have not changed the default behavior for
+  ARMv6, but may revisit that decision in the future. Users can restore the old
+  behavior with -m[no-]unaligned-access.
+
 Android Support
 ^^^^^^^^^^^^^^^
 
 
@@ -239,9 +239,12 @@ struct MapRegionCounters : public RecursiveASTVisitor<MapRegionCounters> {
     if (MCDCMaxCond == 0)
       return true;
 
-    /// At the top of the logical operator nest, reset the number of conditions.
-    if (LogOpStack.empty())
+    /// At the top of the logical operator nest, reset the number of conditions,
+    /// also forget previously seen split nesting cases.
+    if (LogOpStack.empty()) {
       NumCond = 0;
+      SplitNestedLogicalOp = false;
+    }
 
     if (const Expr *E = dyn_cast<Expr>(S)) {
       const BinaryOperator *BinOp = dyn_cast<BinaryOperator>(E->IgnoreParens());
@@ -292,7 +295,7 @@ struct MapRegionCounters : public RecursiveASTVisitor<MapRegionCounters> {
                 "contains an operation with a nested boolean expression. "
                 "Expression will not be covered");
             Diag.Report(S->getBeginLoc(), DiagID);
-            return false;
+            return true;
           }
 
           /// Was the maximum number of conditions encountered?
@@ -303,7 +306,7 @@ struct MapRegionCounters : public RecursiveASTVisitor<MapRegionCounters> {
                 "number of conditions (%0) exceeds max (%1). "
                 "Expression will not be covered");
             Diag.Report(S->getBeginLoc(), DiagID) << NumCond << MCDCMaxCond;
-            return false;
+            return true;
           }
 
           // Otherwise, allocate the number of bytes required for the bitmap
 
@@ -890,25 +890,25 @@ llvm::ARM::FPUKind arm::getARMTargetFeatures(const Driver &D,
     // SCTLR.U bit, which is architecture-specific. We assume ARMv6
     // Darwin and NetBSD targets support unaligned accesses, and others don't.
     //
-    // ARMv7 always has SCTLR.U set to 1, but it has a new SCTLR.A bit
-    // which raises an alignment fault on unaligned accesses. Linux
-    // defaults this bit to 0 and handles it as a system-wide (not
-    // per-process) setting. It is therefore safe to assume that ARMv7+
-    // Linux targets support unaligned accesses. The same goes for NaCl
-    // and Windows.
+    // ARMv7 always has SCTLR.U set to 1, but it has a new SCTLR.A bit which
+    // raises an alignment fault on unaligned accesses. Assume ARMv7+ supports
+    // unaligned accesses, except ARMv6-M, and ARMv8-M without the Main
+    // Extension. This aligns with the default behavior of ARM's downstream
+    // versions of GCC and Clang.
     //
-    // The above behavior is consistent with GCC.
+    // Users can change the default behavior via -m[no-]unaliged-access.
     int VersionNum = getARMSubArchVersionNumber(Triple);
     if (Triple.isOSDarwin() || Triple.isOSNetBSD()) {
       if (VersionNum < 6 ||
           Triple.getSubArch() == llvm::Triple::SubArchType::ARMSubArch_v6m)
         Features.push_back("+strict-align");
-    } else if (Triple.isOSLinux() || Triple.isOSNaCl() ||
-               Triple.isOSWindows()) {
-      if (VersionNum < 7)
-        Features.push_back("+strict-align");
-    } else
+    } else if (VersionNum < 7 ||
+               Triple.getSubArch() ==
+                   llvm::Triple::SubArchType::ARMSubArch_v6m ||
+               Triple.getSubArch() ==
+                   llvm::Triple::SubArchType::ARMSubArch_v8m_baseline) {
       Features.push_back("+strict-align");
+    }
   }
 
   // llvm does not support reserving registers in general. There is support
 
@@ -1157,10 +1157,41 @@ static void addOpenMPDeviceLibC(const ToolChain &TC, const ArgList &Args,
                           "llvm-libc-decls");
   bool HasLibC = llvm::sys::fs::exists(LibCDecls) &&
                  llvm::sys::fs::is_directory(LibCDecls);
-  if (Args.hasFlag(options::OPT_gpulibc, options::OPT_nogpulibc, HasLibC)) {
-    CmdArgs.push_back("-lcgpu");
-    CmdArgs.push_back("-lmgpu");
+  if (!Args.hasFlag(options::OPT_gpulibc, options::OPT_nogpulibc, HasLibC))
+    return;
+
+  // We don't have access to the offloading toolchains here, so determine from
+  // the arguments if we have any active NVPTX or AMDGPU toolchains.
+  llvm::DenseSet<const char *> Libraries;
+  if (const Arg *Targets = Args.getLastArg(options::OPT_fopenmp_targets_EQ)) {
+    if (llvm::any_of(Targets->getValues(),
+                     [](auto S) { return llvm::Triple(S).isAMDGPU(); })) {
+      Libraries.insert("-lcgpu-amdgpu");
+      Libraries.insert("-lmgpu-amdgpu");
+    }
+    if (llvm::any_of(Targets->getValues(),
+                     [](auto S) { return llvm::Triple(S).isNVPTX(); })) {
+      Libraries.insert("-lcgpu-nvptx");
+      Libraries.insert("-lmgpu-nvptx");
+    }
   }
+
+  for (StringRef Arch : Args.getAllArgValues(options::OPT_offload_arch_EQ)) {
+    if (llvm::any_of(llvm::split(Arch, ","), [](StringRef Str) {
+          return IsAMDGpuArch(StringToCudaArch(Str));
+        })) {
+      Libraries.insert("-lcgpu-amdgpu");
+      Libraries.insert("-lmgpu-amdgpu");
+    }
+    if (llvm::any_of(llvm::split(Arch, ","), [](StringRef Str) {
+          return IsNVIDIAGpuArch(StringToCudaArch(Str));
+        })) {
+      Libraries.insert("-lcgpu-nvptx");
+      Libraries.insert("-lmgpu-nvptx");
+    }
+  }
+
+  llvm::append_range(CmdArgs, Libraries);
 }
 
 void tools::addOpenMPRuntimeLibraryPath(const ToolChain &TC,
 
@@ -630,11 +630,6 @@ void NVPTX::Linker::ConstructJob(Compilation &C, const JobAction &JA,
       continue;
     }
 
-    // Currently, we only pass the input files to the linker, we do not pass
-    // any libraries that may be valid only for the host.
-    if (!II.isFilename())
-      continue;
-
     AddStaticDeviceLibsLinking(C, *this, JA, Inputs, Args, CmdArgs, "nvptx",
                              GPUArch, /*isBitCodeSDL=*/false,
                              /*postClangLink=*/false);
@@ -643,26 +638,30 @@ void NVPTX::Linker::ConstructJob(Compilation &C, const JobAction &JA,
     // file and device linking when given a '.cubin' file. We always want to
     // perform device linking, so just rename any '.o' files.
     // FIXME: This should hopefully be removed if NVIDIA updates their tooling.
-    auto InputFile = getToolChain().getInputFilename(II);
-    if (llvm::sys::path::extension(InputFile) != ".cubin") {
-      // If there are no actions above this one then this is direct input and we
-      // can copy it. Otherwise the input is internal so a `.cubin` file should
-      // exist.
-      if (II.getAction() && II.getAction()->getInputs().size() == 0) {
-        const char *CubinF =
-            Args.MakeArgString(getToolChain().getDriver().GetTemporaryPath(
-                llvm::sys::path::stem(InputFile), "cubin"));
-        if (llvm::sys::fs::copy_file(InputFile, C.addTempFile(CubinF)))
-          continue;
+    if (II.isFilename()) {
+      auto InputFile = getToolChain().getInputFilename(II);
+      if (llvm::sys::path::extension(InputFile) != ".cubin") {
+        // If there are no actions above this one then this is direct input and
+        // we can copy it. Otherwise the input is internal so a `.cubin` file
+        // should exist.
+        if (II.getAction() && II.getAction()->getInputs().size() == 0) {
+          const char *CubinF =
+              Args.MakeArgString(getToolChain().getDriver().GetTemporaryPath(
+                  llvm::sys::path::stem(InputFile), "cubin"));
+          if (llvm::sys::fs::copy_file(InputFile, C.addTempFile(CubinF)))
+            continue;
 
-        CmdArgs.push_back(CubinF);
+          CmdArgs.push_back(CubinF);
+        } else {
+          SmallString<256> Filename(InputFile);
+          llvm::sys::path::replace_extension(Filename, "cubin");
+          CmdArgs.push_back(Args.MakeArgString(Filename));
+        }
       } else {
-        SmallString<256> Filename(InputFile);
-        llvm::sys::path::replace_extension(Filename, "cubin");
-        CmdArgs.push_back(Args.MakeArgString(Filename));
+        CmdArgs.push_back(Args.MakeArgString(InputFile));
       }
-    } else {
-      CmdArgs.push_back(Args.MakeArgString(InputFile));
+    } else if (!II.isNothing()) {
+      II.getInputArg().renderAsInput(Args, CmdArgs);
     }
   }
 
 
@@ -22,6 +22,21 @@
 // RUN: %clang -target armv7-windows -### %s 2> %t
 // RUN: FileCheck --check-prefix=CHECK-UNALIGNED-ARM < %t %s
 
+// RUN: %clang --target=armv6 -### %s 2> %t
+// RUN: FileCheck --check-prefix=CHECK-ALIGNED-ARM < %t %s
+
+// RUN: %clang --target=armv7 -### %s 2> %t
+// RUN: FileCheck --check-prefix=CHECK-UNALIGNED-ARM < %t %s
+
+// RUN: %clang -target thumbv6m-none-gnueabi -mcpu=cortex-m0 -### %s 2> %t
+// RUN: FileCheck --check-prefix CHECK-ALIGNED-ARM <%t %s
+
+// RUN: %clang -target thumb-none-gnueabi -mcpu=cortex-m0 -### %s 2> %t
+// RUN: FileCheck --check-prefix CHECK-ALIGNED-ARM <%t %s
+
+// RUN: %clang -target thumbv8m.base-none-gnueabi -### %s 2> %t
+// RUN: FileCheck --check-prefix CHECK-ALIGNED-ARM <%t %s
+
 // RUN: %clang --target=aarch64 -munaligned-access -### %s 2> %t
 // RUN: FileCheck --check-prefix=CHECK-UNALIGNED-AARCH64 < %t %s
 
 
@@ -69,6 +69,13 @@
 // LOWERING: -cc1" "-triple" "nvptx64-nvidia-cuda" {{.*}} "-mllvm" "--nvptx-lower-global-ctor-dtor"
 
 //
+// Test passing arguments directly to nvlink.
+//
+// RUN: %clang -target nvptx64-nvidia-cuda -Wl,-v -Wl,a,b -### %s 2>&1 \
+// RUN:   | FileCheck -check-prefix=LINKER-ARGS %s
+
+// LINKER-ARGS: nvlink{{.*}}"-v"{{.*}}"a" "b"
+
 // Tests for handling a missing architecture.
 //
 // RUN: not %clang -target nvptx64-nvidia-cuda %s -### 2>&1 \
@@ -80,4 +87,4 @@
 // RUN: %clang -target nvptx64-nvidia-cuda -flto -c %s -### 2>&1 \
 // RUN:   | FileCheck -check-prefix=GENERIC %s
 
-// GENERIC-NOT: -cc1" "-triple" "nvptx64-nvidia-cuda" {{.*}} "-target-cpu"
+// GENERIC-NOT: -cc1" "-triple" "nvptx64-nvidia-cuda" {{.*}} "-target-cpu"
@@ -393,14 +393,28 @@
 //
 // RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
 // RUN:      --libomptarget-nvptx-bc-path=%S/Inputs/libomptarget/libomptarget-nvptx-test.bc \
+// RUN:      --libomptarget-amdgpu-bc-path=%S/Inputs/hip_dev_lib/libomptarget-amdgpu-gfx803.bc \
 // RUN:      --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda \
-// RUN:      --offload-arch=sm_52 -gpulibc -nogpuinc %s 2>&1 \
+// RUN:      --rocm-path=%S/Inputs/rocm \
+// RUN:      --offload-arch=sm_52,gfx803 -gpulibc -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=LIBC-GPU %s
-// LIBC-GPU: "-lcgpu"{{.*}}"-lmgpu"
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:      --libomptarget-nvptx-bc-path=%S/Inputs/libomptarget/libomptarget-nvptx-test.bc \
+// RUN:      --libomptarget-amdgpu-bc-path=%S/Inputs/hip_dev_lib/libomptarget-amdgpu-gfx803.bc \
+// RUN:      --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda \
+// RUN:      --rocm-path=%S/Inputs/rocm \
+// RUN:      -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_52 \
+// RUN:      -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx803 \
+// RUN:      -fopenmp-targets=nvptx64-nvidia-cuda,amdgcn-amd-amdhsa -gpulibc -nogpuinc %s 2>&1 \
+// RUN:   | FileCheck --check-prefix=LIBC-GPU %s
+// LIBC-GPU-DAG: "-lcgpu-amdgpu"
+// LIBC-GPU-DAG: "-lmgpu-amdgpu"
+// LIBC-GPU-DAG: "-lcgpu-nvptx"
+// LIBC-GPU-DAG: "-lmgpu-nvptx"
 
 // RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
 // RUN:      --libomptarget-nvptx-bc-path=%S/Inputs/libomptarget/libomptarget-nvptx-test.bc \
 // RUN:      --cuda-path=%S/Inputs/CUDA_102/usr/local/cuda \
 // RUN:      --offload-arch=sm_52 -nogpulibc -nogpuinc %s 2>&1 \
 // RUN:   | FileCheck --check-prefix=NO-LIBC-GPU %s
-// NO-LIBC-GPU-NOT: "-lcgpu"{{.*}}"-lmgpu"
+// NO-LIBC-GPU-NOT: -lmgpu{{.*}}-lcgpu