docs/NVPTXUsage.rst

   1 =============================
   2 User Guide for NVPTX Back-end
   3 =============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 3
   8
   9
  10 Introduction
  11 ============
  12
  13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
  14 along with a defined set of conventions used to represent GPU programming
  15 concepts. This document provides an overview of the general usage of the back-
  16 end, including a description of the conventions used and the set of accepted
  17 LLVM IR.
  18
  19 .. note::
  20
  21    This document assumes a basic familiarity with CUDA and the PTX
  22    assembly language. Information about the CUDA Driver API and the PTX assembly
  23    language can be found in the `CUDA documentation
  24    <http://docs.nvidia.com/cuda/index.html>`_.
  25
  26
  27
  28 Conventions
  29 ===========
  30
  31 Marking Functions as Kernels
  32 ----------------------------
  33
  34 In PTX, there are two types of functions: *device functions*, which are only
  35 callable by device code, and *kernel functions*, which are callable by host
  36 code. By default, the back-end will emit device functions. Metadata is used to
  37 declare a function as a kernel function. This metadata is attached to the
  38 ``nvvm.annotations`` named metadata object, and has the following format:
  39
  40 .. code-block:: llvm
  41
  42    !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
  43
  44 The first parameter is a reference to the kernel function. The following
  45 example shows a kernel function calling a device function in LLVM IR. The
  46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
  47
  48 .. code-block:: llvm
  49
  50     define float @my_fmad(float %x, float %y, float %z) {
  51       %mul = fmul float %x, %y
  52       %add = fadd float %mul, %z
  53       ret float %add
  54     }
  55
  56     define void @my_kernel(float* %ptr) {
  57       %val = load float* %ptr
  58       %ret = call float @my_fmad(float %val, float %val, float %val)
  59       store float %ret, float* %ptr
  60       ret void
  61     }
  62
  63     !nvvm.annotations = !{!1}
  64     !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
  65
  66 When compiled, the PTX kernel functions are callable by host-side code.
  67
  68
  69 .. _address_spaces:
  70
  71 Address Spaces
  72 --------------
  73
  74 The NVPTX back-end uses the following address space mapping:
  75
  76    ============= ======================
  77    Address Space Memory Space
  78    ============= ======================
  79    0             Generic
  80    1             Global
  81    2             Internal Use
  82    3             Shared
  83    4             Constant
  84    5             Local
  85    ============= ======================
  86
  87 Every global variable and pointer type is assigned to one of these address
  88 spaces, with 0 being the default address space. Intrinsics are provided which
  89 can be used to convert pointers between the generic and non-generic address
  90 spaces.
  91
  92 As an example, the following IR will define an array ``@g`` that resides in
  93 global device memory.
  94
  95 .. code-block:: llvm
  96
  97     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
  98
  99 LLVM IR functions can read and write to this array, and host-side code can
 100 copy data to it by name with the CUDA Driver API.
 101
 102 Note that since address space 0 is the generic space, it is illegal to have
 103 global variables in address space 0.  Address space 0 is the default address
 104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 105 variables.
 106
 107
 108 Triples
 109 -------
 110
 111 The NVPTX target uses the module triple to select between 32/64-bit code
 112 generation and the driver-compiler interface to use. The triple architecture
 113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
 114 operating system should be one of ``cuda`` or ``nvcl``, which determines the
 115 interface used by the generated code to communicate with the driver.  Most
 116 users will want to use ``cuda`` as the operating system, which makes the
 117 generated PTX compatible with the CUDA Driver API.
 118
 119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
 120
 121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
 122
 123
 124
 125 .. _nvptx_intrinsics:
 126
 127 NVPTX Intrinsics
 128 ================
 129
 130 Address Space Conversion
 131 ------------------------
 132
 133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 135
 136 Syntax:
 137 """""""
 138
 139 These are overloaded intrinsics.  You can use these on any pointer types.
 140
 141 .. code-block:: llvm
 142
 143     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
 144     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
 145     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
 146     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
 147
 148 Overview:
 149 """""""""
 150
 151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 152 address space to a generic address space pointer.
 153
 154 Semantics:
 155 """"""""""
 156
 157 These intrinsics modify the pointer value to be a valid generic address space
 158 pointer.
 159
 160
 161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 163
 164 Syntax:
 165 """""""
 166
 167 These are overloaded intrinsics.  You can use these on any pointer types.
 168
 169 .. code-block:: llvm
 170
 171     declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
 172     declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
 173     declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
 174     declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
 175
 176 Overview:
 177 """""""""
 178
 179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 180 address space to a pointer in the target address space.  Note that these
 181 intrinsics are only useful if the address space of the target address space of
 182 the pointer is known.  It is not legal to use address space conversion
 183 intrinsics to convert a pointer from one non-generic address space to another
 184 non-generic address space.
 185
 186 Semantics:
 187 """"""""""
 188
 189 These intrinsics modify the pointer value to be a valid pointer in the target
 190 non-generic address space.
 191
 192
 193 Reading PTX Special Registers
 194 -----------------------------
 195
 196 '``llvm.nvvm.read.ptx.sreg.*``'
 197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 198
 199 Syntax:
 200 """""""
 201
 202 .. code-block:: llvm
 203
 204     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
 205     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
 206     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
 207     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 208     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
 209     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
 210     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
 211     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
 212     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
 213     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
 214     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
 215     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
 216     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
 217
 218 Overview:
 219 """""""""
 220
 221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 222 special registers, in particular the kernel launch bounds.  These registers
 223 map in the following way to CUDA builtins:
 224
 225    ============ =====================================
 226    CUDA Builtin PTX Special Register Intrinsic
 227    ============ =====================================
 228    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
 229    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
 230    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
 231    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
 232    ============ =====================================
 233
 234
 235 Barriers
 236 --------
 237
 238 '``llvm.nvvm.barrier0``'
 239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 240
 241 Syntax:
 242 """""""
 243
 244 .. code-block:: llvm
 245
 246   declare void @llvm.nvvm.barrier0()
 247
 248 Overview:
 249 """""""""
 250
 251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 252 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 253
 254
 255 Other Intrinsics
 256 ----------------
 257
 258 For the full set of NVPTX intrinsics, please see the
 259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
 260
 261
 262 .. _libdevice:
 263
 264 Linking with Libdevice
 265 ======================
 266
 267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
 268 implements many common mathematical functions. This library can be used as a
 269 high-performance math library for any compilers using the LLVM NVPTX target.
 270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
 271 there is a separate version for each compute architecture.
 272
 273 For a list of all math functions implemented in libdevice, see
 274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
 275
 276 To accommodate various math-related compiler flags that can affect code
 277 generation of libdevice code, the library code depends on a special LLVM IR
 278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
 279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
 280 with constants based on the defined reflection parameters. Such conditional
 281 code often follows a pattern:
 282
 283 .. code-block:: c++
 284
 285   float my_function(float a) {
 286     if (__nvvm_reflect("FASTMATH"))
 287       return my_function_fast(a);
 288     else
 289       return my_function_precise(a);
 290   }
 291
 292 The default value for all unspecified reflection parameters is zero.
 293
 294 The ``NVVMReflect`` pass should be executed early in the optimization
 295 pipeline, immediately after the link stage. The ``internalize`` pass is also
 296 recommended to remove unused math functions from the resulting PTX. For an
 297 input IR module ``module.bc``, the following compilation flow is recommended:
 298
 299 1. Save list of external functions in ``module.bc``
 300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
 301 3. Internalize all functions not in list from (1)
 302 4. Eliminate all unused internal functions
 303 5. Run ``NVVMReflect`` pass
 304 6. Run standard optimization pipeline
 305
 306 .. note::
 307
 308   ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
 309   libdevice functions. It is possible to link two IR modules that have been
 310   linked against libdevice using different reflection variables.
 311
 312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
 313 often leave behind dead code of the form:
 314
 315 .. code-block:: llvm
 316
 317   entry:
 318     ..
 319     br i1 true, label %foo, label %bar
 320   foo:
 321     ..
 322   bar:
 323     ; Dead code
 324     ..
 325
 326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the
 327 optimization pipeline before dead-code elimination.
 328
 329
 330 Reflection Parameters
 331 ---------------------
 332
 333 The libdevice library currently uses the following reflection parameters to
 334 control code generation:
 335
 336 ==================== ======================================================
 337 Flag                 Description
 338 ==================== ======================================================
 339 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
 340 ==================== ======================================================
 341
 342
 343 Invoking NVVMReflect
 344 --------------------
 345
 346 To ensure that all dead code caused by the reflection pass is eliminated, it
 347 is recommended that the reflection pass is executed early in the LLVM IR
 348 optimization pipeline. The pass takes an optional mapping of reflection
 349 parameter name to an integer value. This mapping can be specified as either a
 350 command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
 351 programmatically creating a pass pipeline.
 352
 353 With ``opt``:
 354
 355 .. code-block:: text
 356
 357   # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
 358
 359
 360 With programmatic pass pipeline:
 361
 362 .. code-block:: c++
 363
 364   extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
 365
 366   StringMap<int> ReflectParams;
 367   ReflectParams["__CUDA_FTZ"] = 1;
 368   Passes.add(createNVVMReflectPass(ReflectParams));
 369
 370
 371
 372 Executing PTX
 373 =============
 374
 375 The most common way to execute PTX assembly on a GPU device is to use the CUDA
 376 Driver API. This API is a low-level interface to the GPU driver and allows for
 377 JIT compilation of PTX code to native GPU machine code.
 378
 379 Initializing the Driver API:
 380
 381 .. code-block:: c++
 382
 383     CUdevice device;
 384     CUcontext context;
 385
 386     // Initialize the driver API
 387     cuInit(0);
 388     // Get a handle to the first compute device
 389     cuDeviceGet(&device, 0);
 390     // Create a compute device context
 391     cuCtxCreate(&context, 0, device);
 392
 393 JIT compiling a PTX string to a device binary:
 394
 395 .. code-block:: c++
 396
 397     CUmodule module;
 398     CUfunction funcion;
 399
 400     // JIT compile a null-terminated PTX string
 401     cuModuleLoadData(&module, (void*)PTXString);
 402
 403     // Get a handle to the "myfunction" kernel function
 404     cuModuleGetFunction(&function, module, "myfunction");
 405
 406 For full examples of executing PTX assembly, please see the `CUDA Samples
 407 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
 408
 409
 410 Common Issues
 411 =============
 412
 413 ptxas complains of undefined function: __nvvm_reflect
 414 -----------------------------------------------------
 415
 416 When linking with libdevice, the ``NVVMReflect`` pass must be used. See
 417 :ref:`libdevice` for more information.
 418
 419
 420 Tutorial: A Simple Compute Kernel
 421 =================================
 422
 423 To start, let us take a look at a simple compute kernel written directly in
 424 LLVM IR. The kernel implements vector addition, where each thread computes one
 425 element of the output vector C from the input vectors A and B.  To make this
 426 easier, we also assume that only a single CTA (thread block) will be launched,
 427 and that it will be one dimensional.
 428
 429
 430 The Kernel
 431 ----------
 432
 433 .. code-block:: llvm
 434
 435   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 436   target triple = "nvptx64-nvidia-cuda"
 437
 438   ; Intrinsic to read X component of thread ID
 439   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 440
 441   define void @kernel(float addrspace(1)* %A,
 442                       float addrspace(1)* %B,
 443                       float addrspace(1)* %C) {
 444   entry:
 445     ; What is my ID?
 446     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 447
 448     ; Compute pointers into A, B, and C
 449     %ptrA = getelementptr float addrspace(1)* %A, i32 %id
 450     %ptrB = getelementptr float addrspace(1)* %B, i32 %id
 451     %ptrC = getelementptr float addrspace(1)* %C, i32 %id
 452
 453     ; Read A, B
 454     %valA = load float addrspace(1)* %ptrA, align 4
 455     %valB = load float addrspace(1)* %ptrB, align 4
 456
 457     ; Compute C = A + B
 458     %valC = fadd float %valA, %valB
 459
 460     ; Store back to C
 461     store float %valC, float addrspace(1)* %ptrC, align 4
 462
 463     ret void
 464   }
 465
 466   !nvvm.annotations = !{!0}
 467   !0 = metadata !{void (float addrspace(1)*,
 468                         float addrspace(1)*,
 469                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
 470
 471
 472 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
 473
 474 .. code-block:: text
 475
 476   # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
 477
 478
 479 .. note::
 480
 481   If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
 482   in the module data layout string and use ``nvptx-nvidia-cuda`` as the
 483   target triple.
 484
 485
 486 The output we get from ``llc`` (as of LLVM 3.4):
 487
 488 .. code-block:: text
 489
 490   //
 491   // Generated by LLVM NVPTX Back-End
 492   //
 493
 494   .version 3.1
 495   .target sm_20
 496   .address_size 64
 497
 498     // .globl kernel
 499                                           // @kernel
 500   .visible .entry kernel(
 501     .param .u64 kernel_param_0,
 502     .param .u64 kernel_param_1,
 503     .param .u64 kernel_param_2
 504   )
 505   {
 506     .reg .f32   %f<4>;
 507     .reg .s32   %r<2>;
 508     .reg .s64   %rl<8>;
 509
 510   // BB#0:                                // %entry
 511     ld.param.u64    %rl1, [kernel_param_0];
 512     mov.u32         %r1, %tid.x;
 513     mul.wide.s32    %rl2, %r1, 4;
 514     add.s64         %rl3, %rl1, %rl2;
 515     ld.param.u64    %rl4, [kernel_param_1];
 516     add.s64         %rl5, %rl4, %rl2;
 517     ld.param.u64    %rl6, [kernel_param_2];
 518     add.s64         %rl7, %rl6, %rl2;
 519     ld.global.f32   %f1, [%rl3];
 520     ld.global.f32   %f2, [%rl5];
 521     add.f32         %f3, %f1, %f2;
 522     st.global.f32   [%rl7], %f3;
 523     ret;
 524   }
 525
 526
 527 Dissecting the Kernel
 528 ---------------------
 529
 530 Now let us dissect the LLVM IR that makes up this kernel.
 531
 532 Data Layout
 533 ^^^^^^^^^^^
 534
 535 The data layout string determines the size in bits of common data types, their
 536 ABI alignment, and their storage size.  For NVPTX, you should use one of the
 537 following:
 538
 539 32-bit PTX:
 540
 541 .. code-block:: llvm
 542
 543   target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 544
 545 64-bit PTX:
 546
 547 .. code-block:: llvm
 548
 549   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 550
 551
 552 Target Intrinsics
 553 ^^^^^^^^^^^^^^^^^
 554
 555 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
 556 read the X component of the current thread's ID, which corresponds to a read
 557 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
 558 intrinsics.  A short list is shown below; please see
 559 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
 560
 561
 562 ================================================ ====================
 563 Intrinsic                                        CUDA Equivalent
 564 ================================================ ====================
 565 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
 566 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
 567 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
 568 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
 569 ``void @llvm.cuda.syncthreads()``                __syncthreads()
 570 ================================================ ====================
 571
 572
 573 Address Spaces
 574 ^^^^^^^^^^^^^^
 575
 576 You may have noticed that all of the pointer types in the LLVM IR example had
 577 an explicit address space specifier. What is address space 1? NVIDIA GPU
 578 devices (generally) have four types of memory:
 579
 580 - Global: Large, off-chip memory
 581 - Shared: Small, on-chip memory shared among all threads in a CTA
 582 - Local: Per-thread, private memory
 583 - Constant: Read-only memory shared across all threads
 584
 585 These different types of memory are represented in LLVM IR as address spaces.
 586 There is also a fifth address space used by the NVPTX code generator that
 587 corresponds to the "generic" address space.  This address space can represent
 588 addresses in any other address space (with a few exceptions).  This allows
 589 users to write IR functions that can load/store memory using the same
 590 instructions. Intrinsics are provided to convert pointers between the generic
 591 and non-generic address spaces.
 592
 593 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
 594
 595
 596 Kernel Metadata
 597 ^^^^^^^^^^^^^^^
 598
 599 In PTX, a function can be either a `kernel` function (callable from the host
 600 program), or a `device` function (callable only from GPU code). You can think
 601 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
 602 function as a `kernel` function, we make use of special LLVM metadata. The
 603 NVPTX back-end will look for a named metadata node called
 604 ``nvvm.annotations``. This named metadata must contain a list of metadata that
 605 describe the IR. For our purposes, we need to declare a metadata node that
 606 assigns the "kernel" attribute to the LLVM IR function that should be emitted
 607 as a PTX `kernel` function. These metadata nodes take the form:
 608
 609 .. code-block:: text
 610
 611   metadata !{<function ref>, metadata !"kernel", i32 1}
 612
 613 For the previous example, we have:
 614
 615 .. code-block:: llvm
 616
 617   !nvvm.annotations = !{!0}
 618   !0 = metadata !{void (float addrspace(1)*,
 619                         float addrspace(1)*,
 620                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
 621
 622 Here, we have a single metadata declaration in ``nvvm.annotations``. This
 623 metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
 624
 625
 626 Running the Kernel
 627 ------------------
 628
 629 Generating PTX from LLVM IR is all well and good, but how do we execute it on
 630 a real GPU device? The CUDA Driver API provides a convenient mechanism for
 631 loading and JIT compiling PTX to a native GPU device, and launching a kernel.
 632 The API is similar to OpenCL.  A simple example showing how to load and
 633 execute our vector addition code is shown below. Note that for brevity this
 634 code does not perform much error checking!
 635
 636 .. note::
 637
 638   You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
 639   compile PTX to machine code (SASS) for a specific GPU architecture. Such
 640   binaries can be loaded by the CUDA Driver API in the same way as PTX. This
 641   can be useful for reducing startup time by precompiling the PTX kernels.
 642
 643
 644 .. code-block:: c++
 645
 646   #include <iostream>
 647   #include <fstream>
 648   #include <cassert>
 649   #include "cuda.h"
 650
 651
 652   void checkCudaErrors(CUresult err) {
 653     assert(err == CUDA_SUCCESS);
 654   }
 655
 656   /// main - Program entry point
 657   int main(int argc, char **argv) {
 658     CUdevice    device;
 659     CUmodule    cudaModule;
 660     CUcontext   context;
 661     CUfunction  function;
 662     CUlinkState linker;
 663     int         devCount;
 664
 665     // CUDA initialization
 666     checkCudaErrors(cuInit(0));
 667     checkCudaErrors(cuDeviceGetCount(&devCount));
 668     checkCudaErrors(cuDeviceGet(&device, 0));
 669
 670     char name[128];
 671     checkCudaErrors(cuDeviceGetName(name, 128, device));
 672     std::cout << "Using CUDA Device [0]: " << name << "\n";
 673
 674     int devMajor, devMinor;
 675     checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
 676     std::cout << "Device Compute Capability: "
 677               << devMajor << "." << devMinor << "\n";
 678     if (devMajor < 2) {
 679       std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
 680       return 1;
 681     }
 682
 683     std::ifstream t("kernel.ptx");
 684     if (!t.is_open()) {
 685       std::cerr << "kernel.ptx not found\n";
 686       return 1;
 687     }
 688     std::string str((std::istreambuf_iterator<char>(t)),
 689                       std::istreambuf_iterator<char>());
 690
 691     // Create driver context
 692     checkCudaErrors(cuCtxCreate(&context, 0, device));
 693
 694     // Create module for object
 695     checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
 696
 697     // Get kernel function
 698     checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
 699
 700     // Device data
 701     CUdeviceptr devBufferA;
 702     CUdeviceptr devBufferB;
 703     CUdeviceptr devBufferC;
 704
 705     checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
 706     checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
 707     checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
 708
 709     float* hostA = new float[16];
 710     float* hostB = new float[16];
 711     float* hostC = new float[16];
 712
 713     // Populate input
 714     for (unsigned i = 0; i != 16; ++i) {
 715       hostA[i] = (float)i;
 716       hostB[i] = (float)(2*i);
 717       hostC[i] = 0.0f;
 718     }
 719
 720     checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
 721     checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
 722
 723
 724     unsigned blockSizeX = 16;
 725     unsigned blockSizeY = 1;
 726     unsigned blockSizeZ = 1;
 727     unsigned gridSizeX  = 1;
 728     unsigned gridSizeY  = 1;
 729     unsigned gridSizeZ  = 1;
 730
 731     // Kernel parameters
 732     void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
 733
 734     std::cout << "Launching kernel\n";
 735
 736     // Kernel launch
 737     checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
 738                                    blockSizeX, blockSizeY, blockSizeZ,
 739                                    0, NULL, KernelParams, NULL));
 740
 741     // Retrieve device data
 742     checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
 743
 744
 745     std::cout << "Results:\n";
 746     for (unsigned i = 0; i != 16; ++i) {
 747       std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
 748     }
 749
 750
 751     // Clean up after ourselves
 752     delete [] hostA;
 753     delete [] hostB;
 754     delete [] hostC;
 755
 756     // Clean-up
 757     checkCudaErrors(cuMemFree(devBufferA));
 758     checkCudaErrors(cuMemFree(devBufferB));
 759     checkCudaErrors(cuMemFree(devBufferC));
 760     checkCudaErrors(cuModuleUnload(cudaModule));
 761     checkCudaErrors(cuCtxDestroy(context));
 762
 763     return 0;
 764   }
 765
 766
 767 You will need to link with the CUDA driver and specify the path to cuda.h.
 768
 769 .. code-block:: text
 770
 771   # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
 772
 773 We don't need to specify a path to ``libcuda.so`` since this is installed in a
 774 system location by the driver, not the CUDA toolkit.
 775
 776 If everything goes as planned, you should see the following output when
 777 running the compiled program:
 778
 779 .. code-block:: text
 780
 781   Using CUDA Device [0]: GeForce GTX 680
 782   Device Compute Capability: 3.0
 783   Launching kernel
 784   Results:
 785   0 + 0 = 0
 786   1 + 2 = 3
 787   2 + 4 = 6
 788   3 + 6 = 9
 789   4 + 8 = 12
 790   5 + 10 = 15
 791   6 + 12 = 18
 792   7 + 14 = 21
 793   8 + 16 = 24
 794   9 + 18 = 27
 795   10 + 20 = 30
 796   11 + 22 = 33
 797   12 + 24 = 36
 798   13 + 26 = 39
 799   14 + 28 = 42
 800   15 + 30 = 45
 801
 802 .. note::
 803
 804   You will likely see a different device identifier based on your hardware
 805
 806
 807 Tutorial: Linking with Libdevice
 808 ================================
 809
 810 In this tutorial, we show a simple example of linking LLVM IR with the
 811 libdevice library. We will use the same kernel as the previous tutorial,
 812 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
 813 Libdevice provides an ``__nv_powf`` function that we will use.
 814
 815 .. code-block:: llvm
 816
 817   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 818   target triple = "nvptx64-nvidia-cuda"
 819
 820   ; Intrinsic to read X component of thread ID
 821   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 822   ; libdevice function
 823   declare float @__nv_powf(float, float)
 824
 825   define void @kernel(float addrspace(1)* %A,
 826                       float addrspace(1)* %B,
 827                       float addrspace(1)* %C) {
 828   entry:
 829     ; What is my ID?
 830     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 831
 832     ; Compute pointers into A, B, and C
 833     %ptrA = getelementptr float addrspace(1)* %A, i32 %id
 834     %ptrB = getelementptr float addrspace(1)* %B, i32 %id
 835     %ptrC = getelementptr float addrspace(1)* %C, i32 %id
 836
 837     ; Read A, B
 838     %valA = load float addrspace(1)* %ptrA, align 4
 839     %valB = load float addrspace(1)* %ptrB, align 4
 840
 841     ; Compute C = pow(A, B)
 842     %valC = call float @__nv_powf(float %valA, float %valB)
 843
 844     ; Store back to C
 845     store float %valC, float addrspace(1)* %ptrC, align 4
 846
 847     ret void
 848   }
 849
 850   !nvvm.annotations = !{!0}
 851   !0 = metadata !{void (float addrspace(1)*,
 852                         float addrspace(1)*,
 853                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
 854
 855
 856 To compile this kernel, we perform the following steps:
 857
 858 1. Link with libdevice
 859 2. Internalize all but the public kernel function
 860 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
 861 4. Optimize the linked module
 862 5. Codegen the module
 863
 864
 865 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
 866 tools. In a complete compiler, these steps can also be performed entirely
 867 programmatically by setting up an appropriate pass configuration (see
 868 :ref:`libdevice`).
 869
 870 .. code-block:: text
 871
 872   # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
 873   # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
 874   # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
 875
 876 .. note::
 877
 878   The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
 879   undefined variables will default to zero. It is shown here for evaluation
 880   purposes.
 881
 882
 883 This gives us the following PTX (excerpt):
 884
 885 .. code-block:: text
 886
 887   //
 888   // Generated by LLVM NVPTX Back-End
 889   //
 890
 891   .version 3.1
 892   .target sm_20
 893   .address_size 64
 894
 895     // .globl kernel
 896                                           // @kernel
 897   .visible .entry kernel(
 898     .param .u64 kernel_param_0,
 899     .param .u64 kernel_param_1,
 900     .param .u64 kernel_param_2
 901   )
 902   {
 903     .reg .pred  %p<30>;
 904     .reg .f32   %f<111>;
 905     .reg .s32   %r<21>;
 906     .reg .s64   %rl<8>;
 907
 908   // BB#0:                                // %entry
 909     ld.param.u64  %rl2, [kernel_param_0];
 910     mov.u32   %r3, %tid.x;
 911     ld.param.u64  %rl3, [kernel_param_1];
 912     mul.wide.s32  %rl4, %r3, 4;
 913     add.s64   %rl5, %rl2, %rl4;
 914     ld.param.u64  %rl6, [kernel_param_2];
 915     add.s64   %rl7, %rl3, %rl4;
 916     add.s64   %rl1, %rl6, %rl4;
 917     ld.global.f32   %f1, [%rl5];
 918     ld.global.f32   %f2, [%rl7];
 919     setp.eq.f32 %p1, %f1, 0f3F800000;
 920     setp.eq.f32 %p2, %f2, 0f00000000;
 921     or.pred   %p3, %p1, %p2;
 922     @%p3 bra  BB0_1;
 923     bra.uni   BB0_2;
 924   BB0_1:
 925     mov.f32   %f110, 0f3F800000;
 926     st.global.f32   [%rl1], %f110;
 927     ret;
 928   BB0_2:                                  // %__nv_isnanf.exit.i
 929     abs.f32   %f4, %f1;
 930     setp.gtu.f32  %p4, %f4, 0f7F800000;
 931     @%p4 bra  BB0_4;
 932   // BB#3:                                // %__nv_isnanf.exit5.i
 933     abs.f32   %f5, %f2;
 934     setp.le.f32 %p5, %f5, 0f7F800000;
 935     @%p5 bra  BB0_5;
 936   BB0_4:                                  // %.critedge1.i
 937     add.f32   %f110, %f1, %f2;
 938     st.global.f32   [%rl1], %f110;
 939     ret;
 940   BB0_5:                                  // %__nv_isinff.exit.i
 941
 942     ...
 943
 944   BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
 945     mul.f32   %f90, %f107, 0f3FB8AA3B;
 946     cvt.rzi.f32.f32 %f91, %f90;
 947     mov.f32   %f92, 0fBF317200;
 948     fma.rn.f32  %f93, %f91, %f92, %f107;
 949     mov.f32   %f94, 0fB5BFBE8E;
 950     fma.rn.f32  %f95, %f91, %f94, %f93;
 951     mul.f32   %f89, %f95, 0f3FB8AA3B;
 952     // inline asm
 953     ex2.approx.ftz.f32 %f88,%f89;
 954     // inline asm
 955     add.f32   %f96, %f91, 0f00000000;
 956     ex2.approx.f32  %f97, %f96;
 957     mul.f32   %f98, %f88, %f97;
 958     setp.lt.f32 %p15, %f107, 0fC2D20000;
 959     selp.f32  %f99, 0f00000000, %f98, %p15;
 960     setp.gt.f32 %p16, %f107, 0f42D20000;
 961     selp.f32  %f110, 0f7F800000, %f99, %p16;
 962     setp.eq.f32 %p17, %f110, 0f7F800000;
 963     @%p17 bra   BB0_28;
 964   // BB#27:
 965     fma.rn.f32  %f110, %f110, %f108, %f110;
 966   BB0_28:                                 // %__internal_accurate_powf.exit.i
 967     setp.lt.f32 %p18, %f1, 0f00000000;
 968     setp.eq.f32 %p19, %f3, 0f3F800000;
 969     and.pred    %p20, %p18, %p19;
 970     @!%p20 bra  BB0_30;
 971     bra.uni   BB0_29;
 972   BB0_29:
 973     mov.b32    %r9, %f110;
 974     xor.b32   %r10, %r9, -2147483648;
 975     mov.b32    %f110, %r10;
 976   BB0_30:                                 // %__nv_powf.exit
 977     st.global.f32   [%rl1], %f110;
 978     ret;
 979   }
 980