nVidia CUDA Buffering use on RedHawk 5.4 November 16, 2009 John Blackwood Contents ======== 1. Introduction 2. Hardware / Software Used 3. Example Program Buffer Use 4. PDL results 4.1 'reduction' CUDA example 4.1.1 Without preallocated graphics buffer support 4.1.2 With preallocated graphics buffer support 4.2 'smokeParticles' CUDA Example 4.2.1 Without preallocated graphics buffer support 4.2.2 With preallocated graphics buffer support 5.0 Summary 1. Introduction =============== This document summarizes a very short study of graphics buffer use by CUDA applications on RedHawk 5.4. The study used the already existing CUDA example programs as the basis for this study. The main focus of this study was to examine nVidia graphics buffer usage and its impact upon a shielded CPU and the resulting process dispatch latency (PDL) values. 2. Hardware / Software Used =========================== The hardware used was a GeForce 8500 GT nvidia card on an AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ processor running pre-alpha RedHawk 5.4. 3. Example Program Buffer Use ============================= A quick look was done to determine approximately how many graphics buffers were used by various example programs in the /usr/share/doc/cuda/sdk/C/bin/linux/release directory. These examples are provided in the ccur-nvidia-cuda RPM. The table below show the number of graphics buffer pages used by each test: test number of pages ---- --------------- clock 2586 SobelFilter 3458 SobolQRNG 2586 3dfd 2586 Mandelbrot 3458 oceanFFT 3714 particles 3458 eigenvalues 2586 radixSort 2586 smokeParticles 3938 reduction 2564 Note that while the CUDA examples certain make use of a fair number of buffers, graphical tests such as viewperf use far more buffers; from 20,000 to 60,000 buffers, depending upon the type of graphics card being used. 4. PDL results ============== On a 2 cpu system, one cpu was shielded from the CUDA activity, with just the RCIM interrupt routed to the shielded CPU: > /usr/bin/cpu cpu chip core ht ht-siblings state shielding ---- ---- ---- ---- ----------- ----- -------------- 0 0 0 - - up proc irq ltmr 1 0 1 - - up none rcim interrupt on cpu 1: > cat /proc/irq/19/smp_affinity 1 user 1 actual The CUDA test was executed on the non-shielded CPU, with our standard test/pdl/pdltest executing on the shielded CPU. The following two CUDA tests were executed 2 times, once with a RedHawk trace kernel that had no preallocated graphics buffer support enabled, and once with a standard RedHawk trace kernel with graphics buffer support. All tests were run in an i386 32bit environmemt. 4.1 'reduction' CUDA example The CUDA 'reduction' example program was first executed. This particular test was run in a loop on the non-shielded cpu while the pdltest was executing for 60 seconds on the shielded cpu. It should be mentioned that this particular test only uses the nVidia GPU for calculations; no visual/graphical display is used in this test. 4.1.1 Without preallocated graphics buffer support Max PDL is 105.0 microseconds. Summary: PDL (60000 samples): 8.0, 12.0, 105.0 uS min/avg/max PDL Histogram: 0 .. 10 : 25948 ********************** 11 .. 20 : 32621 **************************** 21 .. 30 : 600 * 31 .. 40 : 703 * 41 .. 50 : 45 * 51 .. 60 : 12 * 61 .. 70 : 25 * 71 .. 80 : 33 * 81 .. 90 : 11 * 91 .. 100 : 1 * 101 .. 150 : 1 * And the cross processor interrupt counts (in decimal) were: retrigger_next_event_forced 1 do_flush_tlb_all 271 __cpa_flush_range 345,184 cache_flush [nvidia] 1,741 4.1.2 With preallocated graphics buffer support Max PDL is 17.0 microseconds. Summary: PDL (60000 samples): 9.0, 11.7, 17.0 uS min/avg/max PDL Histogram: 0 .. 10 : 26740 *********************** 11 .. 20 : 33260 **************************** And the cross processor interrupt counts (in decimal) were: retrigger_next_event_forced 1 do_flush_tlb_all 34 4.2 'smokeParticles' CUDA Example The smokeParticles example was also executed on both types of kernels. This example, in addition to using the nVida GPU for CUDA based computations, also uses nVidia graphics as the example executes. 4.2.1 Without preallocated graphics buffer support Max PDL is 87.0 microseconds. Summary: PDL (60000 samples): 9.0, 12.0, 87.0 uS min/avg/max PDL Histogram: 0 .. 10 : 8902 ******** 11 .. 20 : 51078 ******************************************* 21 .. 30 : 4 * 31 .. 40 : 11 * 61 .. 70 : 3 * 71 .. 80 : 1 * 81 .. 90 : 1 * And the cross processor interrupt counts (in decimal) were: retrigger_next_event_forced 1 do_flush_tlb_all 4 __cpa_flush_range 3,982 cache_flush [nvidia] 23 4.2.2 With preallocated graphics buffer support Max PDL is 22.0 microseconds. Summary: PDL (60000 samples): 9.0, 12.1, 22.0 uS min/avg/max PDL Histogram: 0 .. 10 : 9167 ******** 11 .. 20 : 50831 ******************************************* 21 .. 30 : 2 * And the cross processor interrupt counts (in decimal) were: retrigger_next_event_forced 3 5.0 Summary This small study of CUDA application buffer use shows that the already existing preallocated graphics buffer support can also be of benefit to CUDA applications as well as graphical applications, in terms of reducing the amount of jitter/interference on a shielded CPU when the nVidia card(s) are processing CUDA-based requests.