=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= NVIDIA CUDA Toolkit v4.0 RC Release Notes for Windows, Linux, and Mac OS X =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= --- Contents: ------------- --- Release Highlights --- Documentation --- List of important files --- Supported NVIDIA hardware --- Supported Operating Systems for: ------ Windows ------ Linux ------ Mac OS X --- Installation Notes --- Upgrading from CUDA Toolkit Release 3.2 --- Notes on New Features and Performance Improvements --- Known Issues --- Resolved Issues --- Source code for Open64 and cuda-gdb --- Revision History --- More information --- Acknowledgements This release contains: * NVIDIA CUDA Toolkit documentation * NVIDIA OpenCL documentation * NVIDIA CUDA compiler (nvcc) and supporting tools * NVIDIA CUDA runtime libraries * NVIDIA CUBLAS, CUFFT, CUSPARSE, CURAND, Thrust, and NPP libraries =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Release Highlights =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= NVIDA CUDA Toolkit version 4.0 has the following new features: Easier application porting * Share GPUs across multiple threads * Single thread access to GPUs * No-copy pinning of system memory * New CUDA C/C++ language features * Thrust templated primitives library * NPP image/video processing library * Layered Textures Faster multi-GPU programming * Unified virtual addressing * GPUDirect v2.0 with peer-to-peer communication New and improved developer tools * Automated performance analysis * C++ debugging * Debugger cuda-gdb for Mac OS * GPU binary disassembler =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Documentation =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= For a list of documents supplied with this release, please refer to the /doc directory of your CUDA Toolkit installation. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= List of important files =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= bin/nvcc Command line compiler include/ cuda.h CUDA driver API header cudaGL.h CUDA OpenGL interop header for driver API cudaVDPAU.h CUDA VDPAU interop header for driver API (Linux only) cuda_gl_interop.h CUDA OpenGL interop header for toolkit API (Linux only) cuda_vdpau_interop.h CUDA VDPAU interop header for toolkit API (Linux only) cudaD3D9.h CUDA DirectX 9 interop header (Windows only) cudaD3D10.h CUDA DirectX 10 interop header (Windows only) cudaD3D11.h CUDA Directx 11 interop header (Windows only) cufft.h CUFFT API header cublas.h CUBLAS API header cusparse.h CUSPARSE API header curand.h CURAND API header curand_kernel.h CURAND device API header thrust/* Thrust Headers npp.h NPP API Header nvcuvid.h CUDA Video Decoder header (Windows and Linux) cuviddec.h CUDA Video Decoder header (Windows and Linux) NVEncodeDataTypes.h CUDA Video Encoder (C-library or DirectShow) required for ............................... projects (Windows only) NVEncodeAPI.h CUDA Video Encoder (C-library) required for projects ............................... (Windows only) INvTranscodeFilterGUIDs.h CUDA Video Encoder (DirectShow) required for projects ............................... (Windows only) INVVESetting.h CUDA Video Encoder (DirectShow) required for projects ............................... (Windows only) ------------------ Windows lib files ------------------ lib/ cuda.lib CUDA driver library cudart.lib CUDA runtime library cublas.lib CUDA BLAS library cufft.lib CUDA FFT library cusparse.lib CUDA Sparse Matrix library curand.lib CUDA Random Number Generation library npp.lib NVIDIA Performance Primitives library nvcuvenc.lib CUDA Video Encoder library nvcuvid.lib CUDA Video Decoder library ------------------ Linux lib files ------------------ lib/ libcuda.so CUDA driver library libcudart.so CUDA runtime library libcublas.so CUDA BLAS library libcufft.so CUDA FFT library libcusparse.so CUDA Sparse Matrix library libcurand.so CUDA Random Number Generation library libnpp.so NVIDIA Performance Primitives library ------------------ Mac OS X lib files ------------------ lib/ libcuda.dylib CUDA driver library libcudart.dylib CUDA runtime library libcublas.dylib CUDA BLAS library libcufft.dylib CUDA FFT library libcusparse.dylib CUDA Sparse Matrix library libcurand.dylib CUDA Random Number Generation library libnpp.dylib NVIDIA Performance Primitives library =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Supported NVIDIA Hardware =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * See http://www.nvidia.com/object/cuda_gpus.html =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Supported Operating Systems for Windows, Linux, and MAC OS X =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ------------------ Windows ------------------ * Supported Operating Systems (32-bit and 64-bit) - Windows XP - Windows Vista - Windows Server 2003 - Windows Server 2008 - Windows Server 2008 R2 - Windows 7 ------------------ Linux ------------------ * Supported Operating Systems (32-bit and 64-bit) - Red Hat Enterprise Linux 5.5* - OpenSUSE 11.2 - SUSE Linux Enterprise 11 SP1* - Fedora 13* - Ubuntu 10.04* * Eliminated Operating Systems Support - Fedora 12 - Red Hat Enterprise Linux 5.4 - SUSE Linux Enterprise 11 - Ubuntu 9.10 -------------------------------------------------------------------------------- Supported Linux Distros -------------------------------------------------------------------------------- Distro Kernel GCC GLIBC ------ ------ --- ----- SLES11-SP1 2.6.32.12-0.7-pae 4.3-62.198 2.11.1-0.17.4 RHEL-6.0 2.6.18-194.el5 4.1.2 2.5 Ubuntu-10.10 2.6.35-23-generic 4.4.5 2.12.1 OpenSUSE-11.2 2.6.31.5-0.1 4.4.1 2.10.1 Fedora13 2.6.33.3-85 4.4.4 2.12 RHEL-4.8 2.6.9-89.ELsmpl 3.4.6 2.3.4 RHEL-5.5 2.6.18-194.el5 4.1.2 2.5 ------------------ Mac OS X ------------------ Mac OS X 10.6 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Installation Notes =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Windows: Silent Installation: Install using msiexec.exe from the shell and pass the following arguments: msiexec.exe /i cudatoolkit.msi /qn To uninstall: Use /x instead of /i -------------------------------------------------------------------------------- Linux: On some Linux releases, due to a GRUB bug in the handling of upper memory and a default vmalloc too small on 32-bit systems, it may be necessary to pass this information to the bootloader: vmalloc=256MB, uppermem=524288 Example of grub conf: title Red Hat Desktop (2.6.9-42.ELsmp) root (hd0,0) uppermem 524288 kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB pci=nommconf initrd /initrd-2.6.9-42.ELsmp.img =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Upgrading from previous CUDA Toolkit 3.2 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Please refer to the CUDA_4.0_Readiness_Tech_Brief.pdf. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Notes on New Features and Performance Improvements =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= CUDA Driver Features: --------------------- * cudaMemcpyAsync works with non pinned heap memory. The asynchronous copy APIs (cudaMemcpyAsync et al in the runtime API and cuMemcpyHtoDAsync et al in the driver API) may take ordinary pageable host memory as its source or destination argument. This is in contrast to CUDA 3.2 where host memory could only be used if it was allocated through CUDA (using cudaMallocHost et al through the runtime API or cuMemAllocHost through the driver API). Note that while using pageable host memory is now permitted for use with the asynchronous copy APIs, using pageable host memory will result in the copies being performed synchronously. * cudaMemcpy is supported across contexts. The ability to copy memory between devices in the runtime API (and between context in the driver API) has been added. When using unified addressing, the function cudaMemcpy (and its variants) with the copy direction cudaMemcpyDefault may be used to copy between devices in the runtime API (the function cuMemcpy may be used in the driver API). When not using unified addressing, the function cudaMemcpyPeer in the runtime API (and cuMemcpyPeer in the driver API) and its variants may be used to copy between devices. This functionality is supported on all platforms and all devices. This functionality will take advantage of direct peer access where it is enabled. Note that this functionality may not be optimal on compute level 1.0 devices and across non-SLI-linked devices using the WDDM driver model on Vista and Win7. * cudaStreamWaitEvent supported across contexts. The function cudaStreamWaitEvent (or cuStreamWaitEvent in the driver API) may be used to effect cross-device (or cross-context, in the driver API) synchronization. An event recorded on one device may be waited on by a stream created by another device. The dependency added will be resolved asynchronously, and this will be very efficient. Note that this may not be optimally efficient yet for compute 1.0 devices or for devices that are not in SLI on Windows Vista/7, using the WDDM driver model. * Added flag for property "Concurrent Data Transfer" to indicate two simultaneous DMA transfers. The ability of the device to concurrently pull data (from host or a peer device) and push data (to host or a peer) may be queried. In the runtime API, this may be done by examining the device property "asyncEngineCount" will be set to 1 if only one direction of a transfer may be active at a time and 2 if both directions may be active at a time. The driver API device property query is CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT. * (Windows and Linux) Added support for unified virtual address space. Devices supporting 64-bit and compute 2.0 ahd higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified. The function cudaPointerGetAttribute in the runtime API (and cuPointerGetAttribute in the driver API) may be used to query attributes about a pointer. The copy direction cudaMemcpyDefault in the runtime API (and the functions cuMemcpy, its variants, and the memory type CU_MEMORYTYPE_UNIFIED in the driver API) may be used to copy data without specifying the direction. Note that this functionality is available only on Linux-64, Windows XP-64, and Windows Vista/7 using the TCC driver model. * The ability of directly accessing memory on peer devices has been added. If direct access of memory on the peer device is possible (which can be queried by runtime API cudaDeviceCanAccessPeer or driver API cuDeviceCanAccessPeer), this functionality can be enabled by cudaDeviceEnablePeerAccess (or cuCtxEnablePeerAccess). Memory on the peer device can be read and written directly after registering using cudaPeerRegister (or cuMemPeerRegister). This functionality is supported on NVIDIA Tesla devices with compute level 2.0 and up under Linux, XP, Vista/Win7 TCC drivers. * (Linux) DX and OGL textures (shared through interop), mapped as CUDA arrays, can now be bound to surface references in CUDA. In order to be able to do so, the DX/OGL resource should be registered with the appropriate register flag as follows: For the driver api, it's CU_GRAPHICS_REGISTER_FLAGS_SURFACE_LDST. For the runtime api, it's cudaGraphicsRegisterFlagsSurfaceLoadStore. Please note that surface has smaller width/height restrictions than textures. If the texture is registered with the surface load/store flag, and the size is too big, then that will generate an error. * Removed alignment requirments from cuMemcpy* functions. All restrictions on the alignment of the source and destination pointer and pitch for all 2D and 3D copies (using cudaMemcpy3D et al in the runtime API and cuMemcpy3D et al in the driver API) have been removed. Note that using unaligned operands for a copy may result in poorer performance than using aligned operands. * Added 64-bit support to WinXP-64 * (Windows and Linux) CUDA-OpenGL interop currently supports the following set of texture formats: {GL_R, GL_RG, GL_RGBA, GL_LUMINANCE, GL_LUMINANCE_ALPHA, GL_ALPHA, GL_INTENSITY} X {,8,16,16F,32F,8UI,16UI,32UI,8I,16I,32I} These formats are also supported for OpenCL-OpenGL interop. For further details on these texture formats, please refer to the OpenGL specification. * Event and stream creation/destruction improved in this version. The functions cudaStreamDestroy and cudaEventDestroy (cuStreamDestroy and cuEventDestroy) are now asynchronous and light-weight. Destroying a stream or event will return immediately, even if there is still pending work in the stream or pending work behind the event. The stream or event's resources will be released asynchronous once the stream or event has completed its work. * Added device attributes for memory clock and number of threads per SM. The following new device attributes are supported in the CUDA driver API: CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: gives the peak memory clock frequency in kilohertz. CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: gives the global memory bus width in bits. CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE: gives the size of the L2 cache in bytes. CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: gives the number of maximum threads that can be resident at one time on a multiprocessor. * (Windows) This version allows a single CUcontext to be current to multiple threads simultaneously. * A kernel that is compiled with a __launch_bounds__ directive will have the max threads/block taken into account when querying the max thread count via cuFuncGetAttribute(&i, CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, f). Also cuFuncSetBlockShape(f, x, y, z) will reject block shapes that exceed the max threads/block set via a __launch_bounds__. These changes in behavior will be likewise be visible in the CUDART counterparts to these CUDA APIs. * Querying the maximum grid Z dimension on Fermi and later architectures will now return values greater than 1 (on Fermi it is 65535). Methods for querying the max grid Z dimension are as follows: CUDART: 1) call cudaGetDeviceProperties(&prop, dev) and check prop.maxGridSize[2] CUDA driver: 1) call cuDeviceGetProperties(&devProps, hDev) and check devProps.maxGridSize[2] 2) call cuDeviceGetAttribute(&i, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z, hDev) Launching 3D grids is accomplished in CUDART by passing in a 3rd grid dimension in <<< >>> or in cudaConfigureCall(). Launching 3D grids with the CUDA driver requires the use of the new cuLaunchKernel API, which has gridDimX, gridDimY and gridDimZ parameters. It is important to note that only on Fermi and later architectures will an app be able to actually use 3D grid launches. Additional Information: * (Windows) Layered Textures (2D)implemented. Note: Layered textures are currently not supported on the Tesla architecture (sm_1x). Layered textures" are better known as "array textures" in graphics APIs. A layered texture is a collection of either 1D or 2D textures of identical size and format, arranged in layers. Such textures can be created as follows: - by specifying the flag CUDA_ARRAY3D_LAYERED when creating the CUDA array using the driver API. - by specifying the flag cudaArrayLayered when creating the CUDA array using the runtime API. Kernels can access any texel from any particular layer using a new set of intrinsics that have the following format: - tex1DLayered(texref, float x, int layer) - tex2DLayered(texref, floay x, float y, int layer) Note that in a 2D layered texture, no filtering is performed between layers i.e. there is no trilinear filtering done like it is done for 3D textures. Similarly, for 1D layered texture, there is no bilinear filtering done like the way it is for 2D textures. The second argument in the template for texture references now means "texture type" instead of "dim". i.e. instead of "texture", it is "texture" The "textureType" arguments can be one of the following #defines: #define cudaTextureType1D 0x01 #define cudaTextureType2D 0x02 #define cudaTextureType3D 0x03 #define cudaTextureType1DLayered 0xF1 #define cudaTextureType2DLayered 0xF2 Backward compatibility for the existing 1D, 2D and 3D textures is maintained by aliasing the corresponding #defines to their "dim" value. As a reult, sample texture references would look like: texture texRef3D; texture texRef1DLayered; * This version has a new launching API called cuLaunchKernel. This API offers many improvements over previous launching APIs: 1) All function state associated with a launch is specified via one API call. This makes multithreaded launching of kernels feasible. 2) Support for 3D dimensional grid launches on h/w that supports it (see associated NVbug 599870 - 3D grid launches) 3) Kernel parameter passing can either be done via an easy to use method where addresses of parameters are passed in and the driver worries about packing the parameters together, or an expert mode (much like cuParamSetv) where all parameters are pre-packed by the application in one chunk. * Added mechanism for registering system memory for DMA. CUDA Compiler Features: ------------------------ * Among the new features added in the CUDA 4.0 compiler are: Support for inline PTX: much like an __asm__ directive, PTX can now be inlined with CUDA C/C++. Support for driver-loadable fatbins: fatbin files can contain multiple PTX and cubin files targeted at different architectures. In previous releases, only applications that used the runtime API were able to use fatbin files. Now with CUDA 4.0, driver API applications can use them too. For more details on these features, please consult the nvcc documentation (nvcc.pdf) that comes with the release. * Starting with CUDA 4.0 release, the compiler implements enhanced error checks for function calls. The compiler checks that the calling function and the called function have compatible __host__, __device__ and __global__ attributes. The compatibility rules for calls between functions with such attributes are documented in the CUDA Programming Guide. If the compiler detects an incompatible call, it will generate error or warning messages. Warnings may be turned into errors in a future release. Additional error checks may be implemented in a future release. It is recommended that the user modify the calling function or the called function to ensure compatibility with the function call restrictions documented in the CUDA Programming Guide. CUDA Libraries Features: ------------------------ * The CUBLAS Library now supports a new API that is thread-safe and allows the application to more easily take advantage of parallelism using streams (especially for functions with scalar return parameters). Because this new API is thread-safe, the CUBLAS library will work cleanly with applications that use the new multi-threading features of the CUDA Runtime Library (CUDART) in the CUDA Toolkit v4.0. The legacy CUBLAS API is still supported, but it is not thread-safe and does not offer as many opportunities for parallelism with streams as the new API. Existing applications that use CUBLAS should work without any changes to the existing code, they only need to explicitly link to the CUDART dynamic library during compilation. Note that this link requirement was not necessary with the previous versions of CUBLAS if the application only used CUBLAS entry points (and hence did not use any explicit CUDART entry points). We recommend that new applications use the new API. In addition, we recommend that you convert to the new API for exisiting applications that need maximum stream parallelism or correct operation in a multi-threaded scenario. The documentation in doc/CUBLAS_Library.pdf has been rewritten to focus on the new API; some treatment of the legacy API is still included. * The TRMM routines in the CUBLAS Library can selectively operate either out-of-place or in-place (the traditional BLAS interface only operates in-place). The out-of-place option, which is new in this release, offers a significant speedup, up to 3x, on the Fermi architecture compared to the previous release, and a modest speedup on the Tesla architecture compared to the previous release. In general, as the input matrix sizes get larger, the performance of the TRMM routine can now approach the performance of the corresponding raw GEMM routines when operating out-of-place. * Added the cublasGetVersion() function to the CUBLAS Library. * Added the cufftGetVersion() function to the CUFFT Library. * In the previous version of the CUFFT Library, the "Bluestein" or "chirp" FFT algorithm was used to accelerate transforms for sizes that cannot be factored into a combination of powers-of-2, -3, -5, or -7 for 1D transforms only. This release employs the Bluestein algorithm to accelerate 2-D and 3-D transforms as well. * The CUFFT Library APIs now support multiple batches for all 1D, 2D and 3D transforms. The previous release had limited support for multiple batches for 2D and 3D transforms. * In this version of the CUDA Toolkit (v4.0 RC), the CUFFT Library now supports more complicated input and output data layouts as a Beta feature via the advanced data layout parameters inembed, istride, idist, onembed, ostride and odist, as accepted by the cufftPlanMany() API. In this release, these parameters are supported only for complex-to-complex (C2C) transforms. This feature allows transforming a subset of an input array, or outputting to only a portion of a larger data structure. If the user sets inembed or onembed to NULL, then the CUFFT Library will function as it did in the previous releases and assume a basic data layout and ignore the other advanced parameters. If the user intends to use the advanced parameters, then all of the advanced interface parameters should be specified correctly. Advanced parameters are defined in units of the relevant data type (cufftReal, cufftDoubleReal, cuComplex, cuDoubleComplex). * The CUSPARSE library now provides a solver for triangular sparse linear systems, via the cusparse*csrsv_analysis() and cusparse*csrsv_solve() APIs. Refer to the document, CUSPARSE_Library.pdf for detailed usage information. * The cusparse*csrmv() and cusparse*csrmm() routines in the CUSPARSE library now support symmetric (CUSPARSE_MATRIX_TYPE_SYMMETRIC) and Hermitian (CUSPARSE_MATRIX_TYPE_HERMITIAN) matrix types. * Renamed cudaDeviceBlockingSync to cudaDeviceScheduleBlockingSync * The cospi() routine has been added for single-precision and double-precision floating-point datatypes. The function cospi(x) implements cos(x * PI). No special include file is required to access this routine. Note: the sinpi() routine has already been available in previous releases. * In previous releases of the CUDA toolkit, the CUBLAS and CUSPARSE libraries included compiled kernel PTX and compiled kernel binaries for compute capability 1.0, 1.3 and 2.0. Starting with this release, the compiled kernel PTX will only be shipped for the highest supported compute capability (i.e., 2.0 for this release). This results in a significant reduction of file size for the CUBLAS and CUSPARSE dynamically linked libraries for all platforms. Note: there is no change to the compiled kernel binaries. * The CURAND library now supports generation of double-precision floating point Sobol' quasi-random sequences with 53 bits of randomness, as well as 64 bit integer Sobol' quasi-random sequences. These are accessed via the CURAND_RNG_QUASI_SOBOL64 and CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 generator types in the host API and the curandStateSobol64_t and curandStateScrambledSobol64_t generator structures in the device API. * The CURAND library now supports generation of log-normally distributed random numbers, via the curandGenerateLogNormal() and curandGenerateLogNormalDouble() host API functions and the curand_log_normal(), curand_log_normal2(), curand_log_normal_double() and curand_log_normal2_double() device API functions. * The CURAND library now supports generation of scrambled Sobol' quasi-random numbers, via the CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 and CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 generator types in the host API and the curandStateScrambledSobol32_t and curandStateScrambledSobol64_t generator structures in the device API. * The CURAND library documentation (doc/CURAND_Library.pdf) now contains a summary and selected detailed results of the statistical quality tests run against the generators provided by CURAND. * Beginning with this release, the NVIDIA Performance Primitives (NPP) library is included directly within the CUDA Toolkit. Currently, the NPP library supports a variety of basic signal and image processing primitives that are optimized across the range of CUDA capable GPUs. Documentation is found at doc/NPP_Library.pdf and the public header file is at include/npp.h. * Added a complete set of Arithmetic and Logical Signal Processing Primitives. * NPP has added Beta support for asynchronous operation using CUDA streams via the nppSetStream() and nppGetStream() functions. This feature is provided in an early form in this release and will be provided in a non-Beta fully tested form in a future release. * The Thrust CUDA library is now included with the CUDA Toolkit in the /include/thrust directory. A "Quick Start" document is available at doc/Thrust_Quick_Start_Guide.pdf. Additionally, several code samples in the NVIDIA GPU Computing SDK now employ Thrust. The Thrust library source code, additional detailed documentation, example programs and a discussion group will continue to be available at the project's original home at http://code.google.com/p/thrust/. * This version of Thrust introduces discard_iterator, an output iterator which ignores values assigned to it. discard_iterator is useful for discarding unnecessary output from algorithms with multiple output ranges (such as reduce_by_key), and measuring in advance the total size of the result of algorithms which produce variably-sized output (such as set_intersection). * The Thrust library now provides set operations for sorted ranges, including union, difference and symmetric difference. These new operations are exposed via thrust/set_operations.h. --------------------------------------- Performance --------------------------------------- * The performance of transforms in the CUFFT library that are pure powers of 3, 5, and 7 have been optimized significantly in this release, especially for double precision. * In version 3.2 of CUSPARSE, the csrmv() and csrmm() functions ran slower when the "beta" parameter was =0 than when it was =1. In this version, the performance variation has been removed, and csrmv() and csrmm() should run slightly faster when "beta" =0. * The GEMV routines, for all datatypes, in the CUBLAS library have been significantly optimized for the case in which the input matrix, A, is transposed. Performance has improved up to 2x, especially when the input matrix, A, is large. The performance improvements apply to both the Tesla (GT200) and Fermi (GF100) architectures. * The performance of the TRSM routines in the CUBLAS library for large matrices has been significantly improved on Fermi and Tesla architecture platforms. * The performance of the double-precision hyperbolic sine function, sinh(), has been improved significantly on GF100 (Fermi architecture) and GT200 (Tesla architecture). The exact improvement achieved for end applications using sinh will vary based on the specific characteristics of each application. * Improved performance of CUFFT on R2C and C2R transforms whose input data size along the X (or, least significant) dimension is a multiple of 2 but not a multiple of 4. In the previous release, the performance was much better when this size was a full multiple of 4; now, both cases should run at the same higher performance. * The performance of double-precision floating point division on the Fermi architecture has been significantly optimized for the round-to-nearest-even case, which is the default rounding mode employed when using the '/' operator in CUDA-C device code. The round-to-nearest-even mode can be explicitly employed in CUDA using the __ddiv_rn() intrinsic. The exact improvement achieved for end applications that perform double precision divides will vary based on the specific characteristics of each application. * CURAND supports a new ordering technique for pseudo-random generators (CURAND_ORDERING_PSEUDO_SEEDED) that significantly reduces the state setup time. However, since this ordering technique uses a different starting seed for each thread on the device, it may result in statistical weaknesses of the pseudorandom output for some user seed values. * The performance of the SYR2K and HER2K routines in the CUBLAS library has been optimized for the Fermi architecture. * The SYMM and HEMM routines in CUBLAS have been significantly optimized for the Fermi architecture. For instance, in some cases there is a 3x performance improvement over the previous version of these routines, both for single and for double precision. * The performance of the double-precision reciprocal square-root function, rsqrt(), has been improved significantly for GT200 (the Tesla architecture) and GF100 (the Fermi architecture). The exact improvement achieved for end applications that use rsqrt will vary based on the specific characteristics of each application. * The performance and accuracy of the double-precision erfc() function have been improved. This function is now accurate to 4 ulps, and the performance has significantly improved on both the Tesla and Fermi architectures. The exact improvement achieved for end applications that use erfc will vary based on the specific characteristics of each application. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Known Issues =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * GPUs without a display attached are not subject to the 2 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Thus, for devices like S1070 that do not have an attached display, users may disable the Windows TDR timeout. Disabling the TDR timeout will allow kernels to run for extended periods of time without triggering an error. The following is an example .reg script: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers] "TdrLevel"=dword:00000000 * The header file search locations and the order that they are visited have been revised. Until CUDA 3.2, nvcc searched the following locations, in order: (1)The toolkit include paths, (2)The current working directory, (3)The paths specified with -I, (4)The paths specified with -isystem, and (5)The system include paths. The header files in the toolkit include path could not be overridden as the toolkit include paths were always visited first. From CUDA 4.0, nvcc searches through the include paths in the following order: (1)The paths specified with -I, (2)The toolkit include paths, (3)The paths specified with -isystem, and (4)The system include paths. The current working directory is not added to the include paths by default anymore, adhering to the C/C++ compiler convention. That is, to add the current working directory to the include search paths, -I. or -isystem. must be given to nvcc, depending on the desired search order. Alternatively, the #include directives can be used in the quoted form, instead of the angle-bracket form, to include header files in the current working directory. * A CUDA program may not compile correctly if a type or typedef 'T' is private to a class or a structure, and at least one of the following is satisfied: - 'T' is a parameter type for a __global__ function. - 'T' is an argument type for a template instantiation of a __global__ function. This restriction will be fixed in a future release." * (Windows) Structure and union types with bit fields may not work correctly in device code on the Windows platform. In addition: - Transferring variables that contain such types, from host to device or from device to host, may not work correctly. - Use of variables with such types in device code may not work correctly. This issue will be addressed in a future release. * When compiling thrust::reduce cudafe generates use of private typedefs * For certain configurations, the CUFFT Library will produce slightly different results for the same input when ECC is on versus when ECC is off, even on the same architecture. Note: in both cases the results are mathematically within the expected tolerance. The difference arises from optimizations specific to the ECC on and ECC off cases that result in slightly different factorizations of the overall transform into smaller radixes. * The CUFFT library is not thread-safe, and hence cannot be access concurrently from multiple threads in the same process. This will be fixed in a future release. * In the NPP Library, the Rotate primitives incorrectly set pixel values inside the destination ROI to 0 (black) if there is no pixel value from the source image that corresponds to a particular destination pixel. The correct behavior would be to leave those desitnation pixels untouched so that they stay at the original background color. * When a program is terminated while waiting on a breakpoint, the system needs to be rebooted. This affects the TCC driver for Windows Vista and Windows 7.* There is a known driver bug when debugging CUDA applications which use TCC. If the application terminates while paused at a GPU breakpoint, internal driver state can be corrupted. Until the system is rebooted, further attempts to create CUDA contexts will enter an infinite loop during cuCtxCreate(). * GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. Vista, Server 2008 and Windows 7 related: ----------------------------------------- * In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU. * Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. * The CUDA Profiler does not support performance counter events on Windows Vista. All profiler configuration regarding performance counter events is ignored. * The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to: MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. * The OS may impose artificial limits on the amount of memory you can allocate using the Cuda APIs for both system and video memory. In many cases, these limits are significantly less than the size of physical system and video memory, but there are exceptions that make it difficult to quantify the expected behavior for a particular application. XP, Vista, Server 2008 and Windows 7 related: --------------------------------------------- * Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA. * malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at: http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix. * When compiling a source file that includes vector_types.h with the Microsoft compiler on a 32-bit Windows system, the 16-byte aligned vector types are not properly aligned at 16 bytes. XP related: ----------- * OpenGL interoperability - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview). * OpenCL program binary formats may change in this or future releases. Users should create programs from source and should not rely on compatibility of generated binaries between different versions of the driver. * (Windows and Linux) Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. * (Windows and Linux) It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. per email thread started by Cliff Woolley * For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release. Linux Only: ----------- * (Linux) There is a known bug in ICC with respect to passing 16-byte aligned types by value to GCC-built code such as the CUDA Toolkit libraries e.g. CUBLAS. At this time, passing a double2 or cuDoubleComplex or any other 16-byte aligned type by value to GCC-built code from ICC-built code will pass incorrect data. Intel has been informed of this bug. As a workaround, a GCC-built wrapper function that accepts the data by reference from the ICC-built code can be linked with the ICC-built code; the GCC-built wrapper can then, in turn, pass the data by value to the CUDA Toolkit libraries. * In order to run CUDA applications, the CUDA module must be loaded and the entries in /dev created. This may be achieved by initializing X Windows, or by creating a script to load the kernel module and create the entries. An example script (to be run at boot time): #!/bin/bash /sbin/modprobe nvidia if [ "$?" -eq 0 ]; then # Count the number of NVIDIA controllers found. N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l` NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l` N=`expr $N3D + $NVGA - 1` for i in `seq 0 $N`; do mknod -m 666 /dev/nvidia$i c 195 $i; done mknod -m 666 /dev/nvidiactl c 195 255 else exit 1 fi Linux and Mac: -------------- * When compiling with GCC, special care must be taken for structs that contain 64-bit integers. This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default. Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC. When using NVCC, this option is automatically passed to GCC. * It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. Mac Only: --------- * OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms. * CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot. * The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix. * On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory. * Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware. * The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time. As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired. - cuCtxSynchronize - cuEventSynchronize - cuStreamSynchronize - cudaThreadSynchronize - cudaEventSynchronize - cudaStreamSynchronize * When the profiler gathers performance signals on G80-based products, the driver reduces the clock rate on the device. If the CUDA app crashes or otherwise exits uncleanly, the clocks will not be reset to their previous values. The system must be rebooted to restore the original clock rate. * The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Resolved Issues =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The following known issues that were published in CUDA Toolkit 3.2 release notes and errata documents have been fixed: Mac Related: ------------ * To save power, some Apple products automatically power-down the CUDA-capable GPU in the system. If the operating system has powered down the CUDA-capable GPU, CUDA fails to run and the system returns an error that no device was found. In order to ensure that your CUDA-capable GPU is not powered down by the operating system do the following: 1. Go to "System Preferences" 2. Open the "Energy Saver" section 3. Un-check the "Automatic graphics switching" check box in the upper left * This issue described in the previous version has been fixed in CUDA Toolkit 4.0. On Mac OS only, the NVIDIA C Compiler (nvcc) handles size_t incorrectly during 64-bit compilation. The version of nvcc included with CUDA Toolkit 3.2 fails to handle variables of type size_t as an 8-byte entity in PTX when compiling 64-bit device code. To address this issue, NVIDIA has released a patch that updates components of nvcc. The patch is available as "CUDA Toolkit: GFEC Patch for MacOS" from the following location: http://developer.nvidia.com/object/cuda_3_2_downloads.html Please refer to additional information and installation instructions in the README file distributed with the patch. * The following issue reported in the previous version has been fixed in CUDA Toolkit 4.0. In CUBLAS 3.2, the GEMM, SYRK, and HERK routines for Fermi GPUs can enter an infinite recursion leading to an application crash for certain input sizes meeting the criteria below. To work around this problem, the input to CUBLAS must be recursively subdivided until the individual calls to these CUBLAS routines do not match these criteria. Given threshold size T, where T is equal to 2^27 - 512 (i.e., 134217216), the crash might be seen in any of the following circumstances: 1) A is not transposed, lda * k >= T, and T is divisible by lda. 2) B is not transposed, ldb * n >= T, T is divisible by n, and n is divisible by 32 3) A is transposed, lda * m >= T, T is divisible by m, and m is divisible by 32 4) B is transposed, ldb * k >= T, and T is divisible by ldb. * In the previous release of the CUBLAS Library, the cublasDgemm() routine produced incorrect results in some cases when k < 32 and matrix A is transposed. This has been fixed in this release. * (Windows and Linux) In the previous version, divergent_branch counter in Visual Profiler reported an incorrect value (of zero) for Fermi. This issue has been fixed in CUDA Toolkit 4.0. * (Windows) cudaMempy3D no longer ignores src and dst position parameters for host memory. * The cublasCgemm() routine in the CUBLAS library would crash in a few specific cases in the previous release; fixed in this release. * In previous releases, the nppiNormDiff_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. * Improved the accuracy of the generation of normally distributed single-precision pseudo-random numbers in the CURAND library. The main observed impacts of this improvement are (1) the maximum difference between the results generated by a GPU generator and a HOST generator are much smaller for single-precision normally distributed random numbers; and (2) the performance of GPU random number generation is now slower than the previous version for single-precision normally distributed random numbers. * The Sobol' direction vectors used by the CURAND library have been updated using the latest Joe-Kuo file new-joe-kuo-6.21201. The file was obtained from this website: http://web.maths.unsw.edu.au/~fkuo/sobol/. The smallest dimension with updated values in the new file is the 212th dimension. Therefore, the exact Sobol' sequences generated by CURAND may differ from the previous release even for the same exact input parameters, if more than 211 dimensions are requested. The authors of the direction vectors indicate that the previous set of vectors were corrupted and that their use be discontinued. * The previous version of the NPP library had a bug in the nppsDiv_32s_C1R primitive when dividing by 0. This bug has been fixed, and now the primitive will correctly return NPP_MAX_32S or NPP_MIN_32S when dividing by 0. * (Windows) In the previous version a setup consisting of GF100 M2070-Q + R260.27 driver resulted in SDK sample DeviceQuery not running when switched from OS to regular user account. This has been fixed in this version. (Operating Systems: Windows2008 Server64, WinXP-x64 ) * In the previous release of the NPP Library, the nppiMinMax_8u_C1R() function would not work in certain situations; this has been fixed in this release. * For an OpenCL C program, the maximum alignment of a function scope local variable and a function parameter variable is limited to 16-byte. * In previous releases, the nppiMean_StdDev_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. * In the previous release, the *Filter_8u_C1R functions in the NPP library produced incorrect results when the nSrcStep input parameter was not a multiple of 4. This has been corrected, and now the functions work for all values of nSrcStep. The exact list of impacted functions is nppiFilterRow_8u_C1R, nppiFilterBox_8u_C1R, nppiFilter_8u_C1R, nppiFilterMax_8u_C1R, and nppiFilterMin_8u_C1R. * In previous releases, the nppiMinMax_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. * The accuracy of single-precision transforms in the CUFFT Library has been significantly improved, especially for larger transforms and multi-dimensional transforms. * In previous versions of the CUFFT Library, for some 1D transform sizes larger than 32M elements, the first call to cufftExec*() would fail due to insufficient memory or due to grid size limitations. These resource limitations are now properly checked for and reported by cufftPlan*() such that if sufficient resources are not available to execute an FFT of the requested size, the error will be reported at plan time rather than at execution time. * Thrust no longer supports scatter and gather directly between host and device memory; instead the output needs to be staged through a temporary object and copied explicitly with thrust::copy(). * Thrust no longer supports operations on device_vector when the backend is CUDA in the absence of nvcc. Hence, operations which modify device_vector's size or elements are unavailable in a .cpp file. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Source code for Open64 and cuda-gdb =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The Open64 and cuda-gdb source files are controlled under terms of the GPL license. Current and previously released versions are located at: are located at ftp://download.nvidia.com/CUDAOpen64 Linux users: * Please refer to the “Release Notes” and “Known Issues” sections in the cuda-gdb User Manual (cuda-gdb.pdf). * Please refer to cuda-memcheck.pdf for notes on supported error detection and known issues. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Revision History =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 02/2011 - Version 4.0 RC 11/2010 - Version 3.2 10/2010 - Version 3.2 RC2 09/2010 - Version 3.2 RC =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= More Information =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= For more information and help with CUDA, please visit: http://www.nvidia.com/cuda =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Acknowledgements =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= NVIDIA extends thanks to EM Photonics (http://www.emphotonics.com) for their contributions to the matrix-vector multiplication functions in the CUBLAS library incorporated into the v4.0 release. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=