1 =============================
2 User Guide for NVPTX Back-end
3 =============================
13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14 along with a defined set of conventions used to represent GPU programming
15 concepts. This document provides an overview of the general usage of the back-
16 end, including a description of the conventions used and the set of accepted
21 This document assumes a basic familiarity with CUDA and the PTX
22 assembly language. Information about the CUDA Driver API and the PTX assembly
23 language can be found in the `CUDA documentation
24 <http://docs.nvidia.com/cuda/index.html>`_.
31 Marking Functions as Kernels
32 ----------------------------
34 In PTX, there are two types of functions: *device functions*, which are only
35 callable by device code, and *kernel functions*, which are callable by host
36 code. By default, the back-end will emit device functions. Metadata is used to
37 declare a function as a kernel function. This metadata is attached to the
38 ``nvvm.annotations`` named metadata object, and has the following format:
42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
44 The first parameter is a reference to the kernel function. The following
45 example shows a kernel function calling a device function in LLVM IR. The
46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
50 define float @my_fmad(float %x, float %y, float %z) {
51 %mul = fmul float %x, %y
52 %add = fadd float %mul, %z
56 define void @my_kernel(float* %ptr) {
57 %val = load float* %ptr
58 %ret = call float @my_fmad(float %val, float %val, float %val)
59 store float %ret, float* %ptr
63 !nvvm.annotations = !{!1}
64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
66 When compiled, the PTX kernel functions are callable by host-side code.
74 The NVPTX back-end uses the following address space mapping:
76 ============= ======================
77 Address Space Memory Space
78 ============= ======================
85 ============= ======================
87 Every global variable and pointer type is assigned to one of these address
88 spaces, with 0 being the default address space. Intrinsics are provided which
89 can be used to convert pointers between the generic and non-generic address
92 As an example, the following IR will define an array ``@g`` that resides in
97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
99 LLVM IR functions can read and write to this array, and host-side code can
100 copy data to it by name with the CUDA Driver API.
102 Note that since address space 0 is the generic space, it is illegal to have
103 global variables in address space 0. Address space 0 is the default address
104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
111 The NVPTX target uses the module triple to select between 32/64-bit code
112 generation and the driver-compiler interface to use. The triple architecture
113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114 operating system should be one of ``cuda`` or ``nvcl``, which determines the
115 interface used by the generated code to communicate with the driver. Most
116 users will want to use ``cuda`` as the operating system, which makes the
117 generated PTX compatible with the CUDA Driver API.
119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
125 .. _nvptx_intrinsics:
130 Address Space Conversion
131 ------------------------
133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
139 These are overloaded intrinsics. You can use these on any pointer types.
143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152 address space to a generic address space pointer.
157 These intrinsics modify the pointer value to be a valid generic address space
161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
167 These are overloaded intrinsics. You can use these on any pointer types.
171 declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
172 declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
173 declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
174 declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180 address space to a pointer in the target address space. Note that these
181 intrinsics are only useful if the address space of the target address space of
182 the pointer is known. It is not legal to use address space conversion
183 intrinsics to convert a pointer from one non-generic address space to another
184 non-generic address space.
189 These intrinsics modify the pointer value to be a valid pointer in the target
190 non-generic address space.
193 Reading PTX Special Registers
194 -----------------------------
196 '``llvm.nvvm.read.ptx.sreg.*``'
197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222 special registers, in particular the kernel launch bounds. These registers
223 map in the following way to CUDA builtins:
225 ============ =====================================
226 CUDA Builtin PTX Special Register Intrinsic
227 ============ =====================================
228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232 ============ =====================================
238 '``llvm.nvvm.barrier0``'
239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
246 declare void @llvm.nvvm.barrier0()
251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
258 For the full set of NVPTX intrinsics, please see the
259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
264 Linking with Libdevice
265 ======================
267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268 implements many common mathematical functions. This library can be used as a
269 high-performance math library for any compilers using the LLVM NVPTX target.
270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271 there is a separate version for each compute architecture.
273 For a list of all math functions implemented in libdevice, see
274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
276 To accommodate various math-related compiler flags that can affect code
277 generation of libdevice code, the library code depends on a special LLVM IR
278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280 with constants based on the defined reflection parameters. Such conditional
281 code often follows a pattern:
285 float my_function(float a) {
286 if (__nvvm_reflect("FASTMATH"))
287 return my_function_fast(a);
289 return my_function_precise(a);
292 The default value for all unspecified reflection parameters is zero.
294 The ``NVVMReflect`` pass should be executed early in the optimization
295 pipeline, immediately after the link stage. The ``internalize`` pass is also
296 recommended to remove unused math functions from the resulting PTX. For an
297 input IR module ``module.bc``, the following compilation flow is recommended:
299 1. Save list of external functions in ``module.bc``
300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
301 3. Internalize all functions not in list from (1)
302 4. Eliminate all unused internal functions
303 5. Run ``NVVMReflect`` pass
304 6. Run standard optimization pipeline
308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309 libdevice functions. It is possible to link two IR modules that have been
310 linked against libdevice using different reflection variables.
312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
313 often leave behind dead code of the form:
319 br i1 true, label %foo, label %bar
326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the
327 optimization pipeline before dead-code elimination.
330 Reflection Parameters
331 ---------------------
333 The libdevice library currently uses the following reflection parameters to
334 control code generation:
336 ==================== ======================================================
338 ==================== ======================================================
339 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
340 ==================== ======================================================
346 To ensure that all dead code caused by the reflection pass is eliminated, it
347 is recommended that the reflection pass is executed early in the LLVM IR
348 optimization pipeline. The pass takes an optional mapping of reflection
349 parameter name to an integer value. This mapping can be specified as either a
350 command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
351 programmatically creating a pass pipeline.
357 # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
360 With programmatic pass pipeline:
364 extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
366 StringMap<int> ReflectParams;
367 ReflectParams["__CUDA_FTZ"] = 1;
368 Passes.add(createNVVMReflectPass(ReflectParams));
375 The most common way to execute PTX assembly on a GPU device is to use the CUDA
376 Driver API. This API is a low-level interface to the GPU driver and allows for
377 JIT compilation of PTX code to native GPU machine code.
379 Initializing the Driver API:
386 // Initialize the driver API
388 // Get a handle to the first compute device
389 cuDeviceGet(&device, 0);
390 // Create a compute device context
391 cuCtxCreate(&context, 0, device);
393 JIT compiling a PTX string to a device binary:
400 // JIT compile a null-terminated PTX string
401 cuModuleLoadData(&module, (void*)PTXString);
403 // Get a handle to the "myfunction" kernel function
404 cuModuleGetFunction(&function, module, "myfunction");
406 For full examples of executing PTX assembly, please see the `CUDA Samples
407 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
413 ptxas complains of undefined function: __nvvm_reflect
414 -----------------------------------------------------
416 When linking with libdevice, the ``NVVMReflect`` pass must be used. See
417 :ref:`libdevice` for more information.
420 Tutorial: A Simple Compute Kernel
421 =================================
423 To start, let us take a look at a simple compute kernel written directly in
424 LLVM IR. The kernel implements vector addition, where each thread computes one
425 element of the output vector C from the input vectors A and B. To make this
426 easier, we also assume that only a single CTA (thread block) will be launched,
427 and that it will be one dimensional.
435 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
436 target triple = "nvptx64-nvidia-cuda"
438 ; Intrinsic to read X component of thread ID
439 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
441 define void @kernel(float addrspace(1)* %A,
442 float addrspace(1)* %B,
443 float addrspace(1)* %C) {
446 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
448 ; Compute pointers into A, B, and C
449 %ptrA = getelementptr float addrspace(1)* %A, i32 %id
450 %ptrB = getelementptr float addrspace(1)* %B, i32 %id
451 %ptrC = getelementptr float addrspace(1)* %C, i32 %id
454 %valA = load float addrspace(1)* %ptrA, align 4
455 %valB = load float addrspace(1)* %ptrB, align 4
458 %valC = fadd float %valA, %valB
461 store float %valC, float addrspace(1)* %ptrC, align 4
466 !nvvm.annotations = !{!0}
467 !0 = metadata !{void (float addrspace(1)*,
469 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
472 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
476 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
481 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
482 in the module data layout string and use ``nvptx-nvidia-cuda`` as the
486 The output we get from ``llc`` (as of LLVM 3.4):
491 // Generated by LLVM NVPTX Back-End
500 .visible .entry kernel(
501 .param .u64 kernel_param_0,
502 .param .u64 kernel_param_1,
503 .param .u64 kernel_param_2
511 ld.param.u64 %rl1, [kernel_param_0];
513 mul.wide.s32 %rl2, %r1, 4;
514 add.s64 %rl3, %rl1, %rl2;
515 ld.param.u64 %rl4, [kernel_param_1];
516 add.s64 %rl5, %rl4, %rl2;
517 ld.param.u64 %rl6, [kernel_param_2];
518 add.s64 %rl7, %rl6, %rl2;
519 ld.global.f32 %f1, [%rl3];
520 ld.global.f32 %f2, [%rl5];
521 add.f32 %f3, %f1, %f2;
522 st.global.f32 [%rl7], %f3;
527 Dissecting the Kernel
528 ---------------------
530 Now let us dissect the LLVM IR that makes up this kernel.
535 The data layout string determines the size in bits of common data types, their
536 ABI alignment, and their storage size. For NVPTX, you should use one of the
543 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
549 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
555 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
556 read the X component of the current thread's ID, which corresponds to a read
557 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
558 intrinsics. A short list is shown below; please see
559 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
562 ================================================ ====================
563 Intrinsic CUDA Equivalent
564 ================================================ ====================
565 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z}
566 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z}
567 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z}
568 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z}
569 ``void @llvm.cuda.syncthreads()`` __syncthreads()
570 ================================================ ====================
576 You may have noticed that all of the pointer types in the LLVM IR example had
577 an explicit address space specifier. What is address space 1? NVIDIA GPU
578 devices (generally) have four types of memory:
580 - Global: Large, off-chip memory
581 - Shared: Small, on-chip memory shared among all threads in a CTA
582 - Local: Per-thread, private memory
583 - Constant: Read-only memory shared across all threads
585 These different types of memory are represented in LLVM IR as address spaces.
586 There is also a fifth address space used by the NVPTX code generator that
587 corresponds to the "generic" address space. This address space can represent
588 addresses in any other address space (with a few exceptions). This allows
589 users to write IR functions that can load/store memory using the same
590 instructions. Intrinsics are provided to convert pointers between the generic
591 and non-generic address spaces.
593 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
599 In PTX, a function can be either a `kernel` function (callable from the host
600 program), or a `device` function (callable only from GPU code). You can think
601 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
602 function as a `kernel` function, we make use of special LLVM metadata. The
603 NVPTX back-end will look for a named metadata node called
604 ``nvvm.annotations``. This named metadata must contain a list of metadata that
605 describe the IR. For our purposes, we need to declare a metadata node that
606 assigns the "kernel" attribute to the LLVM IR function that should be emitted
607 as a PTX `kernel` function. These metadata nodes take the form:
611 metadata !{<function ref>, metadata !"kernel", i32 1}
613 For the previous example, we have:
617 !nvvm.annotations = !{!0}
618 !0 = metadata !{void (float addrspace(1)*,
620 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
622 Here, we have a single metadata declaration in ``nvvm.annotations``. This
623 metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
629 Generating PTX from LLVM IR is all well and good, but how do we execute it on
630 a real GPU device? The CUDA Driver API provides a convenient mechanism for
631 loading and JIT compiling PTX to a native GPU device, and launching a kernel.
632 The API is similar to OpenCL. A simple example showing how to load and
633 execute our vector addition code is shown below. Note that for brevity this
634 code does not perform much error checking!
638 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
639 compile PTX to machine code (SASS) for a specific GPU architecture. Such
640 binaries can be loaded by the CUDA Driver API in the same way as PTX. This
641 can be useful for reducing startup time by precompiling the PTX kernels.
652 void checkCudaErrors(CUresult err) {
653 assert(err == CUDA_SUCCESS);
656 /// main - Program entry point
657 int main(int argc, char **argv) {
665 // CUDA initialization
666 checkCudaErrors(cuInit(0));
667 checkCudaErrors(cuDeviceGetCount(&devCount));
668 checkCudaErrors(cuDeviceGet(&device, 0));
671 checkCudaErrors(cuDeviceGetName(name, 128, device));
672 std::cout << "Using CUDA Device [0]: " << name << "\n";
674 int devMajor, devMinor;
675 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
676 std::cout << "Device Compute Capability: "
677 << devMajor << "." << devMinor << "\n";
679 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
683 std::ifstream t("kernel.ptx");
685 std::cerr << "kernel.ptx not found\n";
688 std::string str((std::istreambuf_iterator<char>(t)),
689 std::istreambuf_iterator<char>());
691 // Create driver context
692 checkCudaErrors(cuCtxCreate(&context, 0, device));
694 // Create module for object
695 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
697 // Get kernel function
698 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
701 CUdeviceptr devBufferA;
702 CUdeviceptr devBufferB;
703 CUdeviceptr devBufferC;
705 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
706 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
707 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
709 float* hostA = new float[16];
710 float* hostB = new float[16];
711 float* hostC = new float[16];
714 for (unsigned i = 0; i != 16; ++i) {
716 hostB[i] = (float)(2*i);
720 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
721 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
724 unsigned blockSizeX = 16;
725 unsigned blockSizeY = 1;
726 unsigned blockSizeZ = 1;
727 unsigned gridSizeX = 1;
728 unsigned gridSizeY = 1;
729 unsigned gridSizeZ = 1;
732 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
734 std::cout << "Launching kernel\n";
737 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
738 blockSizeX, blockSizeY, blockSizeZ,
739 0, NULL, KernelParams, NULL));
741 // Retrieve device data
742 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
745 std::cout << "Results:\n";
746 for (unsigned i = 0; i != 16; ++i) {
747 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
751 // Clean up after ourselves
757 checkCudaErrors(cuMemFree(devBufferA));
758 checkCudaErrors(cuMemFree(devBufferB));
759 checkCudaErrors(cuMemFree(devBufferC));
760 checkCudaErrors(cuModuleUnload(cudaModule));
761 checkCudaErrors(cuCtxDestroy(context));
767 You will need to link with the CUDA driver and specify the path to cuda.h.
771 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
773 We don't need to specify a path to ``libcuda.so`` since this is installed in a
774 system location by the driver, not the CUDA toolkit.
776 If everything goes as planned, you should see the following output when
777 running the compiled program:
781 Using CUDA Device [0]: GeForce GTX 680
782 Device Compute Capability: 3.0
804 You will likely see a different device identifier based on your hardware
807 Tutorial: Linking with Libdevice
808 ================================
810 In this tutorial, we show a simple example of linking LLVM IR with the
811 libdevice library. We will use the same kernel as the previous tutorial,
812 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
813 Libdevice provides an ``__nv_powf`` function that we will use.
817 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
818 target triple = "nvptx64-nvidia-cuda"
820 ; Intrinsic to read X component of thread ID
821 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
823 declare float @__nv_powf(float, float)
825 define void @kernel(float addrspace(1)* %A,
826 float addrspace(1)* %B,
827 float addrspace(1)* %C) {
830 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
832 ; Compute pointers into A, B, and C
833 %ptrA = getelementptr float addrspace(1)* %A, i32 %id
834 %ptrB = getelementptr float addrspace(1)* %B, i32 %id
835 %ptrC = getelementptr float addrspace(1)* %C, i32 %id
838 %valA = load float addrspace(1)* %ptrA, align 4
839 %valB = load float addrspace(1)* %ptrB, align 4
841 ; Compute C = pow(A, B)
842 %valC = call float @__nv_powf(float %valA, float %valB)
845 store float %valC, float addrspace(1)* %ptrC, align 4
850 !nvvm.annotations = !{!0}
851 !0 = metadata !{void (float addrspace(1)*,
853 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
856 To compile this kernel, we perform the following steps:
858 1. Link with libdevice
859 2. Internalize all but the public kernel function
860 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
861 4. Optimize the linked module
862 5. Codegen the module
865 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
866 tools. In a complete compiler, these steps can also be performed entirely
867 programmatically by setting up an appropriate pass configuration (see
872 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
873 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
874 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
878 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
879 undefined variables will default to zero. It is shown here for evaluation
883 This gives us the following PTX (excerpt):
888 // Generated by LLVM NVPTX Back-End
897 .visible .entry kernel(
898 .param .u64 kernel_param_0,
899 .param .u64 kernel_param_1,
900 .param .u64 kernel_param_2
909 ld.param.u64 %rl2, [kernel_param_0];
911 ld.param.u64 %rl3, [kernel_param_1];
912 mul.wide.s32 %rl4, %r3, 4;
913 add.s64 %rl5, %rl2, %rl4;
914 ld.param.u64 %rl6, [kernel_param_2];
915 add.s64 %rl7, %rl3, %rl4;
916 add.s64 %rl1, %rl6, %rl4;
917 ld.global.f32 %f1, [%rl5];
918 ld.global.f32 %f2, [%rl7];
919 setp.eq.f32 %p1, %f1, 0f3F800000;
920 setp.eq.f32 %p2, %f2, 0f00000000;
921 or.pred %p3, %p1, %p2;
925 mov.f32 %f110, 0f3F800000;
926 st.global.f32 [%rl1], %f110;
928 BB0_2: // %__nv_isnanf.exit.i
930 setp.gtu.f32 %p4, %f4, 0f7F800000;
932 // BB#3: // %__nv_isnanf.exit5.i
934 setp.le.f32 %p5, %f5, 0f7F800000;
936 BB0_4: // %.critedge1.i
937 add.f32 %f110, %f1, %f2;
938 st.global.f32 [%rl1], %f110;
940 BB0_5: // %__nv_isinff.exit.i
944 BB0_26: // %__nv_truncf.exit.i.i.i.i.i
945 mul.f32 %f90, %f107, 0f3FB8AA3B;
946 cvt.rzi.f32.f32 %f91, %f90;
947 mov.f32 %f92, 0fBF317200;
948 fma.rn.f32 %f93, %f91, %f92, %f107;
949 mov.f32 %f94, 0fB5BFBE8E;
950 fma.rn.f32 %f95, %f91, %f94, %f93;
951 mul.f32 %f89, %f95, 0f3FB8AA3B;
953 ex2.approx.ftz.f32 %f88,%f89;
955 add.f32 %f96, %f91, 0f00000000;
956 ex2.approx.f32 %f97, %f96;
957 mul.f32 %f98, %f88, %f97;
958 setp.lt.f32 %p15, %f107, 0fC2D20000;
959 selp.f32 %f99, 0f00000000, %f98, %p15;
960 setp.gt.f32 %p16, %f107, 0f42D20000;
961 selp.f32 %f110, 0f7F800000, %f99, %p16;
962 setp.eq.f32 %p17, %f110, 0f7F800000;
965 fma.rn.f32 %f110, %f110, %f108, %f110;
966 BB0_28: // %__internal_accurate_powf.exit.i
967 setp.lt.f32 %p18, %f1, 0f00000000;
968 setp.eq.f32 %p19, %f3, 0f3F800000;
969 and.pred %p20, %p18, %p19;
974 xor.b32 %r10, %r9, -2147483648;
976 BB0_30: // %__nv_powf.exit
977 st.global.f32 [%rl1], %f110;