nVidia CUDA Buffering use on RedHawk 5.4
                        November 16, 2009
                          John Blackwood


  Contents
  ========
  1. Introduction
  2. Hardware / Software Used
  3. Example Program Buffer Use
  4. PDL results
  4.1 'reduction' CUDA example
  4.1.1 Without preallocated graphics buffer support
  4.1.2 With preallocated graphics buffer support
  4.2 'smokeParticles' CUDA Example
  4.2.1 Without preallocated graphics buffer support
  4.2.2 With preallocated graphics buffer support
  5.0 Summary


1. Introduction
===============

This document summarizes a very short study of graphics buffer use by
CUDA applications on RedHawk 5.4.

The study used the already existing CUDA example programs as the basis
for this study.

The main focus of this study was to examine nVidia graphics buffer usage
and its impact upon a shielded CPU and the resulting process dispatch
latency (PDL) values.


2. Hardware / Software Used
===========================

The hardware used was a GeForce 8500 GT nvidia card on an AMD
Athlon(tm) 64 X2 Dual Core Processor 4200+ processor running
pre-alpha RedHawk 5.4.


3. Example Program Buffer Use
=============================

A quick look was done to determine approximately how many
graphics buffers were used by various example programs in the
/usr/share/doc/cuda/sdk/C/bin/linux/release directory.
These examples are provided in the ccur-nvidia-cuda RPM.

The table below show the number of graphics buffer pages used by each test:

 test           number of pages
 ----           ---------------
 clock		    2586 
 SobelFilter	    3458
 SobolQRNG	    2586
 3dfd		    2586
 Mandelbrot	    3458
 oceanFFT	    3714
 particles	    3458
 eigenvalues	    2586
 radixSort	    2586
 smokeParticles	    3938
 reduction          2564

Note that while the CUDA examples certain make use of a fair number
of buffers, graphical tests such as viewperf use far more buffers;
from 20,000 to 60,000 buffers, depending upon the type of graphics card
being used.


4. PDL results
==============

On a 2 cpu system, one cpu was shielded from the CUDA activity, with
just the RCIM interrupt routed to the shielded CPU:

 > /usr/bin/cpu
  cpu   chip core  ht   ht-siblings  state  shielding
  ----  ---- ---- ----  -----------  -----  --------------
   0     0    0    -    -            up     proc irq ltmr
   1     0    1    -    -            up     none

  rcim interrupt on cpu 1:
  > cat /proc/irq/19/smp_affinity
  1 user 1 actual


The CUDA test was executed on the non-shielded CPU, with
our standard test/pdl/pdltest executing on the shielded CPU.

The following two CUDA tests were executed 2 times, once with a RedHawk
trace kernel that had no preallocated graphics buffer support enabled, and
once with a standard RedHawk trace kernel with graphics buffer support.

All tests were run in an i386 32bit environmemt.


4.1 'reduction' CUDA example

The CUDA 'reduction' example program was first executed.
This particular test was run in a loop on the non-shielded cpu
while the pdltest was executing for 60 seconds on the shielded cpu.

It should be mentioned that this particular test only uses
the nVidia GPU for calculations; no visual/graphical display
is used in this test.

4.1.1 Without preallocated graphics buffer support

  Max PDL is 105.0 microseconds.

  Summary: PDL (60000 samples):     8.0,    12.0,   105.0 uS min/avg/max
  PDL Histogram:
      0 .. 10   :  25948 **********************
     11 .. 20   :  32621 ****************************
     21 .. 30   :    600 *
     31 .. 40   :    703 *
     41 .. 50   :     45 *
     51 .. 60   :     12 *
     61 .. 70   :     25 *
     71 .. 80   :     33 *
     81 .. 90   :     11 *
     91 .. 100  :      1 *
    101 .. 150  :      1 *

And the cross processor interrupt counts (in decimal) were:

  retrigger_next_event_forced         1
  do_flush_tlb_all                  271
  __cpa_flush_range             345,184
  cache_flush  [nvidia]           1,741


4.1.2 With preallocated graphics buffer support

  Max PDL is 17.0 microseconds.

  Summary: PDL (60000 samples):     9.0,    11.7,    17.0 uS min/avg/max
  PDL Histogram:
      0 .. 10   :  26740 ***********************
     11 .. 20   :  33260 ****************************


And the cross processor interrupt counts (in decimal) were:

  retrigger_next_event_forced  1
  do_flush_tlb_all            34


4.2 'smokeParticles' CUDA Example

The smokeParticles example was also executed on both types of
kernels.  This example, in addition to using the nVida GPU for
CUDA based computations, also uses nVidia graphics as the
example executes.

4.2.1 Without preallocated graphics buffer support

  Max PDL is 87.0 microseconds.

  Summary: PDL (60000 samples):     9.0,    12.0,    87.0 uS min/avg/max
  PDL Histogram:
      0 .. 10   :   8902 ********
     11 .. 20   :  51078 *******************************************
     21 .. 30   :      4 *
     31 .. 40   :     11 *
     61 .. 70   :      3 *
     71 .. 80   :      1 *
     81 .. 90   :      1 *

And the cross processor interrupt counts (in decimal) were:

  retrigger_next_event_forced     1
  do_flush_tlb_all                4
  __cpa_flush_range           3,982
  cache_flush  [nvidia]          23


4.2.2 With preallocated graphics buffer support

  Max PDL is 22.0 microseconds.

  Summary: PDL (60000 samples):     9.0,    12.1,    22.0 uS min/avg/max
  PDL Histogram:
      0 .. 10   :   9167 ********
     11 .. 20   :  50831 *******************************************
     21 .. 30   :      2 *

And the cross processor interrupt counts (in decimal) were:

  retrigger_next_event_forced   3


5.0 Summary

This small study of CUDA application buffer use shows that the already
existing preallocated graphics buffer support can also be of benefit
to CUDA applications as well as graphical applications, in terms of
reducing the amount of jitter/interference on a shielded CPU when the
nVidia card(s) are processing CUDA-based requests.