Testing multi-threaded race conditions

Question

Reading the comments to this answer, specifically:

Just because you can't write a test doesn't mean it's not broken. Undefined behaviour which usually happens to work as expected (C and C++ are full of that), race conditions, potential reordering due to a weak memory model... – CodesInChaos 7 hours ago

@CodesInChaos if it cant be reproduced then the code written to 'fix' cant be tested either. And putting untested code into live is a worse crime in my opinion – RhysW 5 hours ago

...has me wondering if there are any good general ways to consistently trigger very infrequently occurring in production problems caused by race conditions in test case.

John R. Strohm · Accepted Answer · 2013-04-25 14:42:09Z

101

After having been in this crazy business since about 1978, having spent almost all of that time in embedded real-time computing, working multitasking, multithreaded, multi-whatever systems, sometimes with multiple physical processors, having chased more than my fair share of race conditions, my considered opinion is that the answer to your question is quite simple.

No.

There's no good general way to trigger a race condition in testing.

Your ONLY hope is to design them completely out of your system.

When and if you find that someone else has stuffed one in, you should stake him out an anthill, and then redesign to eliminate it. After you have designed his faux pas (pronounced f***up) out of your system, you can go release him from the ants. (If the ants have already consumed him, leaving only bones, put up a sign saying "This is what happens to people who put race conditions into XYZ project!" and LEAVE HIM THERE.)

answered Apr 25, 2013 at 14:42

John R. Strohm

18.1k6 gold badges47 silver badges56 bronze badges

26

I completely agree. In other words, this is much like the joke - Patient: "Doctor, it hurts when I do this..." Doctor: "Then stop doing it!"
– Mark Rushakoff
Commented Apr 25, 2013 at 14:43
Nice answer. If something causes an un-testable problem, try to work around it to start with, avoid the problem altogether!
– user78252
Commented Apr 25, 2013 at 14:58
1

Doesn't give an answer how you redesign to eliminate it w/o ability to test newly redesign.
– Noma4i
Commented Aug 18, 2016 at 14:18
1

@Noma4i: Nobody ever said designing multitasking multithreaded multi-whatever systems is EASY. There is a large body of knowledge on how to do it, written mostly by Edsger Dijkstra, Tony Hoare, Nico Habermann, and Per Brinch Hansen. You could do FAR worse than to embark on a serious study of their writings.
– John R. Strohm
Commented Aug 18, 2016 at 15:25
I am thinking about the feasibility of using thread sanitizer to instrument your software, and then simply fuzz it and detect the crash, isn't that doable?
– lllllllllllll
Commented Mar 14, 2020 at 3:01

| Show 2 more comments

Julien · Accepted Answer · 2013-04-25 19:25:32Z

19

The best tool I know for these sort of problems is an extension of Valgrind called Helgrind.

Basically Valgrind simulates a virtual processor and runs your binary (unmodified) on top of it, so it can check every single access to memory. Using that framework, Helgrind watch system calls to infer when an access to a shared variable is not properly protected by a mutual exclusion mechanism. That way it can detect a theorical race condition even if it has not actually happened.

Intel sells a very similar tool called Intel Inspector.

These tools give great results but your program will be considerably slower during analysis.

answered Apr 25, 2013 at 19:25

Julien

3361 silver badge4 bronze badges

1

is Valgrind still a *nix only tool?
– Dan Is Fiddling By Firelight
Commented Apr 25, 2013 at 20:20
1

Yes, Linux, MacOSX, android and some BSD: valgrind.org/info/platforms.html
– Julien
Commented Apr 25, 2013 at 20:35
3

ThreadSanitizer is a similar tool. It works differently than Helgrind, which gives it the advantage of being much faster, but requires integration into the toolchain.
– Sebastian Redl
Commented Apr 26, 2013 at 17:16

Add a comment |

rerun · Accepted Answer · 2013-04-25 15:21:04Z

18

If you are in the ms tool chain. Ms research has created a tool which will force new interlevings for each run and can recreated failed runs its called chess.

here is a video showing it in use.

answered Apr 25, 2013 at 15:21

rerun

2,04512 silver badges24 bronze badges

Add a comment |

Kilian Foth · Accepted Answer · 2013-04-25 15:06:46Z

11

Exposing a multi-threading bug requires forcing different threads of execution to perform their steps in a particular interleaved order. Usually this is hard to do without manual debugging or manipulating the code to get some kind of "handle" to control this interleaving. But changing code that behaves unpredictably will often influence that unpredictability, so this is hard to automate.

A nice trick is described by Jaroslav Tulach in Practical API Design: if you have logging statements in the code under question, manipulate the consumer of those logging statements (e.g. an injected pseudo-terminal) so that it accepts the individual log messages in a particular order based on their content. This allows you to control the interleaving of steps in different threads without having to add anything to production code that isn't already there.

edited Apr 25, 2013 at 15:06

answered Apr 25, 2013 at 14:37

Kilian Foth

110k45 gold badges300 silver badges317 bronze badges

2

I have done similar before using injected repository's to sleep the threads that call it in specific orders to force the interleave I want. Having written code that does it, I'm inclined to +1 @John's answer above. Seriously, this stuff is so painful to employ correctly, and still gives only best guess guarantees because there could be slightly different interleaves with different results; the better approach is to just eliminate all possible race conditions through static analysis and or careful combing of code for any and all shared state
– Jimmy Hoffa
Commented Apr 25, 2013 at 14:48

Add a comment |

Sebastian Redl · Accepted Answer · 2013-04-26 17:23:35Z

There is no way to be absolutely sure various kinds of undefined behavior (in particular race conditions) don't exist.

However, there are a number of tools that show up a good number of such situations. You may be able to prove that a problem exists currently with such tools, even though you cannot prove that your fix is valid.

Some interesting tools for this purpose:

Valgrind is a memory checker. It finds memory leaks, reads of uninitialized memory, uses of dangling pointers and out-of-bounds accesses.

Helgrind is a thread safety checker. It finds race conditions.

Both work by dynamic instrumentation, i.e. they take your program as-is and execute it in a virtualized environment. This makes them unintrusive, but slow.

UBSan is an undefined behavior checker. It finds various cases of C and C++ undefined behavior, such as integer overflows, out-of-range shifts and similar stuff.

MSan is a memory checker. It has similar goals as Valgrind.

TSan is a thread safety checker. It has similar goals as Helgrind.

These three are built into the Clang compiler and generate code at compile time. This means that you need to integrate them into your build process (in particular, you have to compile with Clang), which makes them much harder to initially set up than *grind, but on the other hand they have a much lower runtime overhead.

All the tools I listed work on Linux and some of them on MacOS. I don't think any work on Windows reliably yet.

Community · Accepted Answer · 2017-05-23 12:40:17Z

It seems most of the answers here mistake this question as "how do I automatically detect race conditions?" when the question is really "how do I reproduce race conditions in testing when I find them?"

The way to do it is to introduce synchronization in your code that are used for testing only. For example, if a race condition occurs when Event X happens in between Event A and Event B, then for testing your application, write some code that waits for Event X to happen after Event A happens. You will likely need some way for your tests to talk to your application to tell it ("hey i'm testing this thing, so wait for this event at this location").

I'm using node.js and mongo, where some actions involve creating consistent data in multiple collections. In these cases, my unit tests will make a call to the application to tell it "set up a wait for Event X", and once the application has set it up, the test for event X will run, and the tests will subsequently tell the application ("i'm done with the wait for Event X") so the rest of the tests will run normally.

The answer here explains this type of thing in detail in the context of python: https://stackoverflow.com/questions/19602535/how-can-i-reproduce-the-race-conditions-in-this-python-code-reliably

Stack Exchange Network

Testing multi-threaded race conditions

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
testing
multithreading
or ask your own question.

Linked

Hot Network Questions

Testing multi-threaded race conditions

6 Answers 6

Not the answer you're looking for? Browse other questions tagged testingmultithreading or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
testing
multithreading
or ask your own question.