Debugging multi-threaded C++ applications with Helgrind
Introduction
Multi-threaded programs are becoming more important because today’s software doesn’t benefit from multi-processor machines unless it utilizes multiple threads. Unfortunately, multi-threaded applications are prone to programming mistakes that are difficult to detect and localize due to the unpredictable and therefore nondeterministic execution order.
Faults that are unique to concurrent programs are data races, where unsynchronized access to shared memory locations results in inconsistent data, and dead locks, where two or more threads block each other. Hence, the need for tools that help in finding these kind of faults is hugely important.
Helgrind: a thread error detector
Maybe you’ve heard of the freely available tool Valgrind before. Its default tool Memcheck is able to detect memory-related bugs (memory leaks, doubly freed memory, use of deleted memory, use of uninitialized memory, etc.). It turns out that Valgrind also comes with a tool called Helgrind for detecting synchronization errors in C and C++ programs.
So let’s have a try and access a shared memory location from two threads without using synchronization:
#include <pthread.h> /* Unprotected global variable */ int var = 0; /* this function is run by the second thread */ void* child_inc( void* arg ) { /* increment var by second thread */ var++; return NULL; } int main( void ) { /* this variable is our reference to the second thread */ pthread_t child; /* create a second thread which executes child_inc() */ pthread_create(&child, NULL, child_inc, NULL); /* increment var by main thread */ var++; /* wait for the second thread to finish */ pthread_join(child, NULL); /* what is the result of var? */ return 0; }
The problem with this example is that ‘var’ is not protected for being updated simultaneously by both threads! A correct program would protect ‘var’ e.g. with a lock of type ‘pthread_mutex_t’, which is acquired before each access and released afterwards. Such missing locking can cause obscure timing dependent bugs!
Debugging process
Now, let’s run g++ compiler over this source code to create the executable:
user@host:~$ g++ -Wall -g -pthread simple_race.cpp -o simple_race
We compile our program with -g to include debugging information so that Helgrind’s error messages include exact line numbers.
Using Helgrind (provided that we’re on a Unix platform) is as simple as:
user@host:~$ valgrind --tool=helgrind --log-file=./helgrind.log ./simple_race &
As clear from the command above, the main binary is ‘valgrind’ and the tool which we want to use is specified by the option ‘–tool’. The ‘./simple_race’ above signifies the executable over which we want to run Helgrind. And the ‘./helgrind.log’ specifies that Valgrind sends all of its messages to this file.
Helgrind’s output for this program is:
==6395== Helgrind, a thread error detector ==6395== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al. ==6395== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info ==6395== Command: ./simple_race ==6395== Parent PID: 5667 ==6395== ==6395== ---Thread-Announcement------------------------------------------ ==6395== ==6395== Thread #1 is the program's root thread ==6395== ==6395== ---Thread-Announcement------------------------------------------ ==6395== ==6395== Thread #2 was created ==6395== at 0x51472DE: clone (clone.S:77) ==6395== by 0x4E3CF6F: do_clone.constprop.4 (createthread.c:75) ==6395== by 0x4E3E57F: pthread_create@@GLIBC_2.2.5 (createthread.c:256) ==6395== by 0x4C2F460: pthread_create_WRK (hg_intercepts.c:269) ==6395== by 0x4C2F5E2: pthread_create@* (hg_intercepts.c:300) ==6395== by 0x4005F4: main (simple_race.cpp:20) ==6395== ==6395== ---------------------------------------------------------------- ==6395== ==6395== Possible data race during read of size 4 at 0x601038 by thread #1 ==6395== Locks held: none ==6395== at 0x4005F5: main (simple_race.cpp:23) ==6395== ==6395== This conflicts with a previous write of size 4 by thread #2 ==6395== Locks held: none ==6395== at 0x4005C5: child_inc(void*) (simple_race.cpp:10) ==6395== by 0x4C2F5CB: mythread_wrapper (hg_intercepts.c:233) ==6395== by 0x4E3DE99: start_thread (pthread_create.c:308) ==6395== ==6395== ---------------------------------------------------------------- ==6395== ==6395== Possible data race during write of size 4 at 0x601038 by thread #1 ==6395== Locks held: none ==6395== at 0x4005FE: main (simple_race.cpp:23) ==6395== ==6395== This conflicts with a previous write of size 4 by thread #2 ==6395== Locks held: none ==6395== at 0x4005C5: child_inc(void*) (simple_race.cpp:10) ==6395== by 0x4C2F5CB: mythread_wrapper (hg_intercepts.c:233) ==6395== by 0x4E3DE99: start_thread (pthread_create.c:308) ==6395== ==6395== ==6395== For counts of detected and suppressed errors, rerun with: -v ==6395== Use --history-level=approx or =none to gain increased speed, at ==6395== the cost of reduced accuracy of conflicting-access information ==6395== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
A lot of details for an apparently very simple bug! But the lines above show that the checker was able to correctly detect the data race. The main error message says that there is a race as a result of a read of size 4 (bytes) at 0x601038, which is the address of ‘var’, happening in function ‘main’ at line 23 in the program.
In order to suppress false reportings in subsequent runs of Helgrind, it is possible to write a so-called suppressionfile that contains report-type and call-stack-patterns of locations that are false positives or part of code that is not modifiable like third-party libraries.
Conclusion
Helgrind can be very helpful in finding and resolving threading-related problems. Like all sophisticated tools, it is most effective when you understand how use it. The tool may generate a huge amount of warnings containing many false positives, but you can find real bugs, too! So play around with Helgrind!
And happy debugging!