The Wrong Pointer

In which I get very confused.

A few months ago, I was working on a C exercise that involved coordinating a number of identical threads with the goal of minimizing runtime. In principle, it was a simple enough problem, but I somehow managed to produce a solution that successfully minimized runtime and logged egregiously incorrect timing data.

Without going into too much detail, the problem was presented as an aardvark and anthill simulation, so you'll have to pardon the names and nouns.

The simulation emitted timestamps on standard output, so the following output was what tipped me off to the fact that something strange was happening:

00.002668 Aardvark A slurping Anthill 0
1519874814.003379 Aardvark B slurping Anthill 1
-7070059777.995059 Aardvark C slurping Anthill 2
-7070059777.542307 Aardvark D slurping Anthill 0
-7070064072.509315 Aardvark E slurping Anthill 1

These are times in seconds, relative to the time when the program was started. It was immediately clear that at most one of these timestamps could possibly be correct. In other cases no reasonable timestamps were produced, so it was clear that a race condition of some sort was present. However, the simulation performed exactly as expected in all other aspects.

The timestamps were produced using gettimeofday, and the two struct timeval were declared not only in static global space, but also in a separate translation unit from the one I was responsible for implementing. The compiler shouldn't have allowed me even make reference to the timing structures, and a quick grep confirmed that I was not.

At this point, I was less motivated than I should have been, because I could independently verify that my simulated aardvarks were coordinating correctly. They finished at the published target time of approximately fifty-six seconds. That said, I couldn't really let a bug this silly slip by in good faith.

Since this was C, there was a good chance that I was looking at a memory error. Since something was stomping over memory that I had no way of legally accessing, there was a very good chance that I was looking at a memory error.

==7940== Memcheck, a memory error detector
==7940== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==7940== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==7940== Command: ./anthills
==7940==
1519876676.016039 Aardvark C slurping Anthill 1
1519876676.041669 Aardvark A slurping Anthill 0
-7070057915.958050 Aardvark D slurping Anthill 2
-7070062210.509055 Aardvark B slurping Anthill 0
...
==7822== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

But valgrind's memcheck reported no memory errors, and the program didn't crash or show any other symptoms. This wasn't too unexpected. After all, I didn't have many references to heap memory floating around. In fact, I had no naked calls to malloc at all. Since this program used the printf family of functions and pthreads, there was undoubtedly some heap memory being used behind the scenes, but I wasn't managing any of it directly.

As it turns out, most of the memory accesses that occurred in code that I controlled happened to be in static memory, where memcheck won't complain if you overshoot by a few bytes. valgrind wasn't really going to be too helpful, so it was out.

At this point I was still wondering how I could possibly have stomped over memory that I had no way of getting a pointer to, so on a whim I swapped compilers from gcc to clang.

==8039== Memcheck, a memory error detector
==8039== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8039== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8039== Command: ./anthills
==8039==
00.017880 Aardvark F slurping Anthill 2
00.041125 Aardvark B slurping Anthill 0
00.047397 Aardvark A slurping Anthill 0
00.042611 Aardvark H slurping Anthill 2
00.043128 Aardvark E slurping Anthill 1
00.045085 Aardvark G slurping Anthill 1
...
==8039== Invalid read of size 4
==8039==    at 0x4E428D9: pthread_join (pthread_join.c:45)
==8039==    by 0x401F87: main (anthills.c:390)
==8039==  Address 0x2000002d1 is not stack'd, malloc'd or (recently) free'd
==8039==
==8039==
==8039== Process terminating with default action of signal 11 (SIGSEGV)
==8039==  Access not within mapped region at address 0x2000002D1
==8039==    at 0x4E428D9: pthread_join (pthread_join.c:45)
==8039==    by 0x401F87: main (anthills.c:390)

Compiling with clang instead of gcc gave correct timing data, but caused the program to segfault upon completion. That wasn't what I had wanted to see, but it was still helpful. Given the relative simplicity of our code, it wasn't likely that I'd found a compiler bug. So I was almost certainly hitting undefined behavior that gcc and clang handled differently, and since just about the only things happening here were memory accesses through pointers, I had to be going out of bounds somewhere.

valgrind is cognitively cheap, but in hindsight I probably should have reached for gdb first anyway. I took advantage of gdb's ability to watch a location in memory and to interrupt the program whenever a change is detected.

For a few reasons, I decided to stick with gcc. First and foremost among them is that a segfault upon completion seems worse than incorrect timing data. In any case, I knew the names of the memory locations associated with the timing data, but I didn't know which pthread_t was causing problems. Even if I did figure out which one it was, with over twenty threads, I had no guarantee that it would be the same one every time.

There were only two timeval structures: start and now. Since time proceeded forward normally between jumps, now was being used correctly. gettimeofday gives time relative to the UNIX epoch, so I was fairly sure that now wasn't the problem. That left me with just start to worry about.

I fired up gdb and watched start.

(gdb) p &start
$1 = (struct timeval *) 0x6036d0 <start>
(gdb) watch (struct timeval) *0x6036d0
Hardware watchpoint 1: (struct timeval) *0x6036d0

The timing data is usually correct for the first couple of ticks, and sort of as expected, I see the following a few times.

Hardware watchpoint 1: (struct timeval) *0x6036d0

Old value = {tv_sec = 0, tv_usec = 0}
New value = {tv_sec = 1519952936, tv_usec = 0}
0x00007ffff7ffac6d in gettimeofday ()

I tell gdb to continue a few times, and then something more interesting crops up:

Thread 4 "anthills" hit Hardware watchpoint 1: (struct timeval) *0x6036d0

Old value = {tv_sec = 1519952940, tv_usec = 882809}
New value = {tv_sec = 1, tv_usec = 882809}
actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:170
170     if (0 != sem_trywait(ants_remaining + hill)) return;

This is much more interesting, and looking at the new value tells me that this is when things go wrong. Since the form of the equation used to calculate elapsed time is start - now, this also explains why the timestamps become extremely negative after this point.

That's progress, but something isn't right here. sem_trywait operates on semaphores, and ants_remaining is very much an array of semaphores. Further, hill is guaranteed to be a valid index. So why is sem_trywait trampling over start?

I could try using gdb to watch all of the semaphores in ants_remaining, but my hardware is not capable of watching the three semaphores that make up ants_remaining, so that's not going to work.

I now know that for some reason, sem_trywait is stomping over memory that holds struct timeval start, but I don't know why. What else can I get out of gdb?

I know I'm in a function, so I can inspect the call stack.

(gdb) where
#0  actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:170
#1  0x000000000040102e in aardvark (input=0x7fffffffdd71) at aardvarks.c:271
...

If I ignore all of the extra code I used while I was still working out how to coordinate the aardvarks, actually_slurp boils down to the following:

void actually_slurp(char aardvark, int hill, int order)
{
    if (hill == -1)
        hill = next_robin();
        last_robins[order - 9] = hill;

    if (0 != sem_trywait(ants_remaining + hill)) return;

    clock_in(hill, aardvark - 'A');

    slurp(aardvark, hill);
}

Oh.

I'm an idiot: I've omitted the braces around the body of a conditional block. That section was meant to read:

    if (hill == -1) {
        hill = next_robin();
        last_robins[order - 9] = hill;
    }

This fixes the timestamps, and the simulation no longer travels backwards into negative time.

As it turns out, the rest of the program works hard to guarantee that hill is only ever -1 when order is 9 or greater. This means that the condition hill == -1 is an implicit bounds check for the assignment into last_robins.

But why would an out-of-bounds write into last_robins cause sem_trywait to clobber start when operating on ants_remaining?

I took a peek at the declarations of last_robins and ants_remaining.

int last_robins[AARDVARKS % ANTHILLS];

// Keep at most 3 aardvarks on an anthill at once
static sem_t occupancy_per_hill[ANTHILLS];
static sem_t ants_remaining[ANTHILLS];

They're suspiciously close. AARDVARKS and ANTHILLS are 11 and 3, respectively, so intuitively it's not too much of a stretch to believe that going out of bounds in one of them might land me inside one of the others.

I decided to take a look at what memory got clobbered by the line that was being executed when it shouldn't have been, last_robins[order - 9].

Recalling that order was 1 and appeared to be consistently 1, gdb says the following:

(gdb) p &(last_robins[1 - 9])
$1 = (int *) 0x6036d0 <start>
(gdb) p &start
$2 = (struct timeval *) 0x6036d0 <start>

Oh.

If we compile with clang, the same exercise reports:

(gdb) p &(last_robins[1 - 9])
$1 = (int *) 0x603840 <threads+64>

threads turns out to be yet another global static array in a translation unit that my code can't see. It contains the pthread_t objects that need to be cleaned up before the program terminates, so now I know why compiling with clang gave me an executable that behaved perfectly correctly until it tried to clean up after itself. Since out of bounds references are undefined behavior, the compiler is free to rearrange global static memory in whatever way it wishes to, and it turns out that gcc and clang chose different layouts.

I have just one last niggling concern: this isn't entirely consistent with my initial observation. gdb clearly called out a call to sem_trywait as the culprit, but we can clearly see here that it is in fact the previous line, last_robins[order - 9] that caused the erroneous write. I'm not too sure why this is the case, and it might be interesting to see if gdb will consistently point at the line immediately following the one that causes the write to memory.

void actually_slurp(char aardvark, int hill, int order)
{
    if (hill == -1) {
        hill = next_robin();
    }
    last_robins[order - 9] = hill;

    fprintf(stderr, "Oh no\n");

    if (0 != sem_trywait(ants_remaining + hill)) return;

    clock_in(hill, aardvark - 'A');
    slurp(aardvark, hill);
}

The buggy code above yields the following:

Thread 3 "anthills" hit Hardware watchpoint 1: (struct timeval) * 0x6036d0

Old value = {tv_sec = 1520234254, tv_usec = 914875}
New value = {tv_sec = 8589934593, tv_usec = 914875}
actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:165
165     fprintf(stderr, "Oh no\n");
(gdb) where
#0  actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:165
#1  0x000000000040104c in aardvark (input=0x7fffffffdcd1) at aardvarks.c:273

Well alright then.