.. index:: single: Valgrind .. _mdoc_valgrind: ======== Valgrind ======== Valgrind is an open source set of debugging and profiling tools. It is most commonly used to locate memory errors, including leaks, but also can be used to debug threaded codes and profile cache efficiency. See the `Valgrind Online Documentation `_ for more information. Usage ===== The following snippet shows how to load the valgrind module and use it to perform analysis on a c code. .. code:: bash module load valgrind gcc -g source.c valgrind --tool=[memcheck,cachgrind,helgrind] ./a.out If no tool is specified, valgrind will default to the memory checker. Memcheck -------- Memcheck is a tool to detect a wide range of memory errors including buffer over-runs, memory leaks and double-freeing of heap blocks, and uninitialized variables. Download :download:`memleak.c`: for a simple example of using the cachegrind module to identify a memory leak:: $ module load valgrind $ gcc -g memleak.c $ valgrind --tool=memcheck ./a.out ==3153== Memcheck, a memory error detector ==3153== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==3153== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==3153== Command: ./a.out ==3153== ==3153== ==3153== HEAP SUMMARY: ==3153== in use at exit: 800 bytes in 10 blocks ==3153== total heap usage: 10 allocs, 0 frees, 800 bytes allocated ==3153== ==3153== LEAK SUMMARY: ==3153== definitely lost: 800 bytes in 10 blocks ==3153== indirectly lost: 0 bytes in 0 blocks ==3153== possibly lost: 0 bytes in 0 blocks ==3153== still reachable: 0 bytes in 0 blocks ==3153== suppressed: 0 bytes in 0 blocks ==3153== Rerun with --leak-check=full to see details of leaked memory ==3153== ==3153== For counts of detected and suppressed errors, rerun with: -v ==3153== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6) Running memcheck without any options identifies that 800 bytes were not freed at the time the program terminated, and those were allocated in 10 distinct blocks. To get a better idea of where those blocks were allocated, use the option :option:`--leak-check=full`:: $ valgrind --tool=memcheck --leak-check=full ./a.out ==3154== Memcheck, a memory error detector ==3154== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==3154== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==3154== Command: ./a.out ==3154== ==3154== ==3154== HEAP SUMMARY: ==3154== in use at exit: 800 bytes in 10 blocks ==3154== total heap usage: 10 allocs, 0 frees, 800 bytes allocated ==3154== ==3154== 800 bytes in 10 blocks are definitely lost in loss record 1 of 1 ==3154== at 0x4C278FE: malloc (vg_replace_malloc.c:270) ==3154== by 0x400575: main (memleak.c:i24) ==3154== ==3154== LEAK SUMMARY: ==3154== definitely lost: 800 bytes in 10 blocks ==3154== indirectly lost: 0 bytes in 0 blocks ==3154== possibly lost: 0 bytes in 0 blocks ==3154== still reachable: 0 bytes in 0 blocks ==3154== suppressed: 0 bytes in 0 blocks ==3154== ==3154== For counts of detected and suppressed errors, rerun with: -v ==3154== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 6 from 6) Now memcheck has identified that the 10 code blocks were allocated at memleak.c line 10, and the user can modify the code to free those allocations at the appropriate place. Cachegrind ---------- Cachegrind is a Valgrind tool that simulates (rather than measures) how a code interact with the multi-level caches found in modern computer architectures. It is very useful for identifying cache misses as a performance problem, as well as identifying parts of the code responsible. Cachegrind does have several limitations, and can dramatically increase the time it takes to execute a code. See the `cachgrind manual `_ for full details. Download :download:`matrixmult.c` for a simple example using the cachegrind module to estimate cache efficiency:: $ module load valgrind $ gcc -g matrixmult.c $ valgrind --tool=cachegrind ./a.out ==2548== Cachegrind, a cache and branch-prediction profiler ==2548== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. ==2548== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==2548== Command: ./a.out ==2548== --2548-- warning: L3 cache found, using its data for the LL simulation. ==2548== ==2548== I refs: 3,252,178,387 ==2548== I1 misses: 745 ==2548== LLi misses: 738 ==2548== I1 miss rate: 0.00% ==2548== LLi miss rate: 0.00% ==2548== ==2548== D refs: 1,082,643,679 (720,139,382 rd + 362,504,297 wr) ==2548== D1 misses: 406,465,246 (405,103,433 rd + 1,361,813 wr) ==2548== LLd misses: 313,706 ( 1,950 rd + 311,756 wr) ==2548== D1 miss rate: 37.5% ( 56.2% + 0.3% ) ==2548== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==2548== ==2548== LL refs: 406,465,991 (405,104,178 rd + 1,361,813 wr) ==2548== LL misses: 314,444 ( 2,688 rd + 311,756 wr) ==2548== LL miss rate: 0.0% ( 0.0% + 0.0% ) The above output shows that the example code has a greater than 50\% read cache miss rate, which will significantly degrade performance. Since the code was compiled with the :option:`-g` compiler flag, the cg_annotate tool can be used to parse cachgrind output and produce a line-by-line annotated report:: $ cg_annotate --auto=yes cachegrind.out.2548 -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 20971520 B, 64 B, 20-way associative Command: ./a.out Data file: cachegrind.out.2548 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 3,252,178,387 745 738 720,139,382 405,103,433 1,950 362,504,297 1,361,813 311,756 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 3,251,974,540 6 6 720,090,005 405,100,952 0 362,490,010 1,361,251 311,250 /home/drudd/debug/matrixmult.c:main -------------------------------------------------------------------------------- -- Auto-annotated source: /home/drudd/debug/matrixmult.c -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -- line 12 ---------------------------------------- . . . . . . . . . *******************************************************************/ . . . . . . . . . #include . . . . . . . . . #include . . . . . . . . . #include . . . . . . . . . . . . . . . . . . #define N 300 . . . . . . . . . #define M 4000 . . . . . . . . . 909 0 0 0 0 0 3 0 0 int main( int argc, char *argv[] ) { . . . . . . . . . int i, j, k; . . . . . . . . . double *A, *B, *C; . . . . . . . . . double tmp; . . . . . . . . . 3 1 1 0 0 0 1 0 0 A = (double *)malloc(N*M*sizeof(double)); 3 0 0 0 0 0 1 0 0 B = (double *)malloc(N*M*sizeof(double)); 2 0 0 0 0 0 1 0 0 C = (double *)malloc(N*N*sizeof(double)); . . . . . . . . . 7 1 1 0 0 0 0 0 0 if ( A == NULL || B == NULL || C == NULL ) { . . . . . . . . . fprintf(stderr,"Error allocating memory!\n"); . . . . . . . . . exit(1); . . . . . . . . . } . . . . . . . . . . . . . . . . . . /* initialize A & B */ 1,801 0 0 0 0 0 0 0 0 for ( i = 0; i < N; i++ ) { 6,000,600 1 1 0 0 0 0 0 0 for ( j = 0; j < M; j++ ) { 2,400,000 0 0 0 0 0 1,200,000 150,000 150,000 A[M*i+j] = 3.0; 2,400,000 0 0 0 0 0 1,200,000 1,199,999 150,000 B[N*j+i] = 2.0; . . . . . . . . . } . . . . . . . . . } . . . . . . . . . 180,001 0 0 0 0 0 0 0 0 for ( i = 0; i < N*N; i++ ) { 180,000 0 0 0 0 0 90,000 11,251 11,250 C[i] = 0.0; . . . . . . . . . } . . . . . . . . . 600 0 0 0 0 0 0 0 0 for ( i = 0; i < N; i++ ) { 630,600 2 2 90,000 11,251 0 0 0 0 for ( j = 0; j < N; j++ ) { 1,800,180,000 0 0 0 0 0 0 0 0 for ( k = 0; k < M; k++ ) { 1,440,000,000 0 0 720,000,000 405,089,701 0 360,000,000 0 0 C[N*i+j] += A[M*i+k]*B[N*k+j]; . . . . . . . . . } . . . . . . . . . } . . . . . . . . . } . . . . . . . . . 3 0 0 0 0 0 2 1 0 free(A); 2 0 0 0 0 0 1 0 0 free(B); 3 0 0 1 0 0 1 0 0 free(C); . . . . . . . . . . . . . . . . . . return 0; 6 1 1 4 0 0 0 0 0 } -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 100 1 1 100 100 0 100 100 100 percentage of events annotated Note that in this example the loop order causes very poor cache performance for the innermost line of the nested loop. Exchanging the k and j indexed loops will give significantly better performance. Still better performance can be obtained through blocking, or, as this is a standard linear algebra opperation, using LAPACK or the Intel Math Kernel Library, which has tuned routines for performing such calculations. Helgrind -------- Helgrind is a thread error checking tool. Unfortunately it has poor interaction with gcc's OpenMP implementation, and can lead to a large number of distracting messages. Still, it can be useful in identifying races or unprotected critical sections within shared memory parallel code.