Logo Cineca Logo SCAI
MARCONI status
GALILEO status
DAVIDE status

You are here

GPGPU (General Purpose Graphics Processing Unit)



A GPU is a specialized device designed to rapidly manipulate high amounts of graphical pixels. Historically, GPU were born for being used in advanced graphics and videogames.

More recently interfaces have been built to interact with codes not related to graphical purposes, for example for linear algebraic manipulations.

General-purpose GPU computing or GPGPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.

The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

(courtesy of http://www.nvidia.com/object/GPU_Computing.html)



The GPU has evolved over the years to have teraflops of floating point performance.

NVIDIA revolutionized the GPGPU and accelerated computing world in 2006-2007 by introducing its new massively parallel architecture called “CUDA”.

The success of GPGPUs in the past few years has been the ease of programming of the associated CUDA parallel programming model. In this programming model, the application developer modify their application to take the compute-intensive kernels and map them to the GPU. The rest of the application remains on the CPU. Mapping a function to the GPU involves rewriting the function to expose the parallelism in the function and adding “C” keywords to move data to and from the GPU.

The Tesla 20-series GPU at CINECA are based on the "Fermi" and "Kepler” architecture, Kepler being the latest CUDA architecture. Both architectures are optimized for scientific applications with key features such as 500+ gigaflops of IEEE standard double precision floating point hardware support, L1 and L2 caches, ECC memory error protection, local user-managed data caches in the form of shared memory dispersed throughout the GPU, coalesced memory accesses and so on.


The GPU resources of the Eurora cluster consist of 2 nVIDIA Tesla K20 "Kepler" per node, with compute capability 3.x. The GPU resources of the cluster PLX consist instead of  2 nVIDIA Tesla M2070 GPUs (codenamed Fermi) per node with compute capability 2.0. In addition, ten of the nodes are equipped with 2 nVIDIA QuadroPlex 2200 S4.

Regardless of the cluster, all of the GPUs are configured with the Error Correction Code (ECC) support active, that offers protection of data in memory to enhance data integrity and reliability for applications. Registers, L1/L2 caches, shared memory, and DRAM all are ECC protected.

At present (Nov 2014), the billing policy is based on the wall-clock time used on the requested cores and does not take into account the used GPUs. In other words, GPUs are free of charge on PLX and Eurora and only the core-hours consumed on CPUs is accounted. The rationale is to invite the users to take advantage as much as possible from the possibilities of the GPUs.

Programming environment (how to write GPU enabled applications)

All tools and libraries required in the GPU programming environment are contained in the CUDA toolkit. The CUDA toolkit is made available through the “cuda” module. When need to start a programming environment session with GPUs, the first thing to do is to load the CUDA module.

> module load cuda

In doing so you will load the most recent version of the package. At present (Nov 2014), the most recent version is CUDA 5.0.35 on Eurora and CUDA  4.2.9 on PLX. With the previous command in general you load the most recent version of the package. For listing all the available version you can type:

> module available cuda

CUDA, in addition to the C compiler, provides optimized GPU-enabled scientific libraries for linear algebra, FFT, random number generators, and basic algorithms (such as sorting, reductions, signal processing, image processing, etc) through the following libraries:

  • CUBLAS: GPU-accelerated BLAS library
  • CUFFT: GPU-accelerated FFT library
  • CUSPARSE: GPU-accelerated Sparse Matrix library
  • CURAND: GPU-accelerated RNG library
  • CUDA NPP: nVidia Performance Primitives
  • THRUST: a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL).

The CUDA C compiler is nvcc. It's important to remember that the modules relatives to the needed compilers or MPI/OpenMP libraries must be loaded before the CUDA module.

In order to take full advantage of the GPUs capabilities, you should add the --arch=sm_XX switch to the nvcc command, where XX=30 on Eurora and XX=20 on PLX.

Example1: how to compile a C serial program with cuda (using the cublass library) on Eurora

module load gnu
module load cuda
nvcc –arch=sm_30 –I$CUDA_INC –L$CUDA_LIB –lcublas –o myprog myprog.c

Example2: how to compile a C MPI program with cuda (using a built in makefile) on Eurora

module load …
module load gnu
module load openmpi/1.6.4--gnu--4.6.3
module load cuda

Note that PGI C and Fortran compilers provide its own cuda library and cuda extensions to the programming languages. Therefore you don't need to load any cuda module.

Production environment (how to run a GPU enabled application)

Access to computational resources is granted through job requests to the resource manager. The resource manager is a program that runs on the front-end and listens to users’ requests.

A job request typically consists in:

  •  resource specification: the kind and amount of resources you want for your job;
  •  job script: a shell script with the sequence of commands and controls needed to carry out your job.

Submitted jobs are put in a queue (with many priority attributes) and wait until requested resources become available. The job is then processed and subsequently removed from the queue.

On our GPGPU systems the resource manager is PBS. More information on PBS can be found on the HPC User Guide.

Job submission

Job requests are submitted to the resource manager (PBS) using the qsub command:

$ qsub [opts] my_job_script.sh

Where [opts] specifies resources  and settings required by the job.

-l select:<N>:ncpus=<C>:ngpus=<G>:mpiprocs:<P>

asks for <N> nodes, and for each of them: <C> cores, <G> gpus and <P> MPI tasks;

-l walltime=hh:mm:ss 

specifies the maximum duration of your job;

-A <account_no>

specifies the project account for your credit. If due, you can find it with the “saldo –b” command (see "New accounting policy" for more details).

 -q <PBSqueue>

Specifies the PBS queue where your job will be put

Other useful qsub switches can be found in the PLX PBS and Eurora PBS sections of the HPC UserGuide and in the manual pages (see “man qsub”).

For example, if you need one core and one GPU for three hours, submit your jobs as follows:

$ qsub –l select=1:ncpus=1:ngpus=1 –l walltime=3:00:00 -A <project>
       -q parallel my_script.sh

or, if you need 4 cores and two GPUs for three hours,

$ qsub –l select=1:ncpus=4:ngpus=2 –l walltime=3:00:00 -A <project> 
       -q parallel my_script.sh

As another example, if you want your job script my_script.sh to run for 3 hours on 2 nodes with 12 cores and two GPUs for each node (with a total of 24 cores and 4 GPUs), using credit from project gran10, you can use the following command:

$ qsub –l select=2:ncpus=12:ngpus=2,mpiprocs=2 –l walltime=3:00:00
       A gran10 q parallel my_script.sh

 The previous example will assign you two “full” nodes, i.e. 24 cores and 4 GPUs. If you do not specify the walltime resource your job will be assigned the default value specific of the selected PBS queue.

Please do not ask for a whole node if you do not intend to use also both GPUs, since such request will prevent other users to access the GPUs on that same node.

For any other information regarding features and limitations on each PBS queue as well as how to write job scripts see our HPC User Guide.


At present the use of the GPUs and other accelerators is not accounted, only the time spent on the cpus is considered.

More details about "Accounting" can be found in the UserGuide (http://www.hpc.cineca.it/content/accounting-0).