EZTrace
 

Analyzing a CUDA application with EZTrace

Installing EZTrace

First, configure EZTrace as described in the support page.
$ ./configure --prefix=<INSTALLATION_DIR>

If everything is OK, compile and install EZTrace:
$ make &&make install

Check the installation of EZTrace:
$ eztrace_avail
3       stdio   Module for stdio functions (read, write, select, poll, etc.)
2       pthread Module for PThread synchronization functions (mutex, semaphore, spinlock, etc.)
1       omp     Module for OpenMP parallel regions
4       mpi     Module for MPI functions
5       memory  Module for memory functions (malloc, free, etc.)
7       cuda  Module for cuda functions (cuMemAlloc, cuMemcopy, etc.)

You should see a module called cuda that permits to analyze CUDA applications.

Running an application with EZTrace

Let's analyze this CUDA application that performs a Matrix multiplication [1]. First, compile it:
$ tar xzvf matmul.tgz
[...]
$ make
/usr/local/cuda/bin/nvcc -o matmul matmul.cu

You can run the application with:
$ ./matmul
Executing Matrix Multiplcation
Matrix size: 160x160
Finished.
CUDA matmul took 366 ms to complete.
CPU matmul took 25936 ms to complete.

You can analyze it with EZTrace by simply adding eztrace -t cuda to the command line:
$ eztrace -t cuda ./matmul
Starting EZTrace... done
Executing Matrix Multiplcation
Matrix size: 160x160
[EZTrace-CUDA] Only CPU events will be recorded. To enable the recording of GPU-events, set the EZTRACE_CUDA_CUPTI_ENABLED environment variable
Finished.
CUDA matmul took 379 ms to complete.
CPU matmul took 38884 ms to complete.
Stopping EZTrace... saving trace /tmp/<username>_eztrace_log_rank_1

However, this only permits to record events that occur on the CPU (for instance calls to cudaMemcpy, the invocation of a kernel, etc.) and it doesn't depicts the occupancy of the GPU.
In order to generate an execution trace that contains both CPU and GPU events, set the EZTRACE_CUDA_CUPTI_ENABLED:
$ export EZTRACE_CUDA_CUPTI_ENABLED=1
$ eztrace -t cuda ./matmul
CUPTI is enabled
Starting EZTrace... done
Executing Matrix Multiplcation
Matrix size: 160x160
Finished.
CUDA matmul took 379 ms to complete.
CPU matmul took 38884 ms to complete.
Stopping EZTrace... saving trace /tmp/<username>_eztrace_log_rank_1

This should generate a trace file in /tmp/<username>_eztrace_log_rank_1.

Analyzing the execution of the application

Once the execution traces are generated, we need to analyze them. EZTrace allows to:
Visualizing an execution trace

In order to generate a PAJE/OTF trace file that can be visualized with ViTE or Vampir, you need to run eztrace_convert:

$ eztrace_convert -t PAJE /tmp/<username>_eztrace_log_rank_1
module stdio loaded
module pthread loaded
module omp loaded
module mpi loaded
module memory loaded
module cuda loaded
6 modules loaded
no more block for trace #0
28 events handled

This generates a file named eztrace_output.trace that you can visualize with ViTE:
$ vite eztrace_output.trace



Here, you can see the status of the main thread that allocates data on the GPU (the yellow status corresponds to a call to cudaMalloc), copies data from the host memory to the GPU memory (the black status, with an arrow from the CPU to the GPU), invokes a kernel (green status, with an arrow from the CPU to the GPU) and copies back the data from the GPU (black status, with an arrow from the GPU to the CPU).

The status of the GPU as well as the quantity of allocated memory on the GPU are also depicted.
Computing statistics
EZTrace currently cannot compute statistics on CUDA application. This feature should be implemented soon !


[1] The CUDA matrix multiplication program comes from Virginia Tech Advanced Research Computing.

This tutorial was designed for EZTrace 1.0 or higher. If you encounter any problem, feel free to contact EZTrace developpers.