Wesley Wei

Read, Eval, Print: A First Look

Wesley Wei — Tue, 04 Sep 2018 12:00:00 GMT

With some mental gymnastics, the read, eval, print pattern drives most of our interactions with computers. A couple more contortions, and it sort of describes how we interact with each other in day to day life.

Don't get me wrong: I'm not trying to make some deep statement about life or people or society. My point is that this is how we often start off thinking about programming. Tell the computer what to do, let it do it, and then have it spit out results for the world to see. From the computer's perspective, it reads the instructions you give, evaluates them to produce a result, and then prints them out for you to read.

This is easiest to see in any of the numerous languages for which there exists an interactive interpreter, like Python. But I'm not here to talk to you about Python.

Okay, that's a lie. But the spotlight isn't on Python.

The Worst REPL

Today, I'm going to introduce what I rather unimaginatively call The Worst REPL, or REPL for short. REPL isn't really a general purpose programming language like Python. In fact, it's intended to be used much more like a POSIX shell. The difference, however, is that unlike a shell, which lends itself to executing executables stored on the filesystem, REPL is much more adept at executing Python functions.

A terrible analogy that I came up with 15 seconds ago is that REPL is to Python as a shell is to the operating system. This is a wildly inaccurate analogy with more holes in it than a sponge, but hopefully it holds water about as well.

Explain Yourself

Many who are comfortable working inside of a terminal emulator will agree that there exist tasks that are more suited to a terminal environment than a graphical environment. REPL grew out of solving one such problem.

I, as one member of a three person team, was working on a project that involved client applications communicating with a remote server. In any case, it eventually came time to test our server, and we were sunk. We had tested individual parts and we'd done some integration testing as well, but we needed to test the whole package from the client's perspective too.

Here was the problem: the GUI that we had could display what we needed just fine, it just wasn't yet capable of driving interaction with the remote server.

Forget testing, we didn't have a way for anyone to use the client even if everything other than the GUI worked flawlessly.

Enter REPL's Ancestor

From the beginning we had been entertaining the idea of having some scripting support, and that was reflected in our design enough for an early draft of REPL to emerge. I won't go into too much detail about this ancestor to REPL, but I will say the following: It was interesting enough and powerful enough that it spawned REPL, but also designed so horribly that I threw out almost everything when I started work on REPL.

Motivation

In that terrible hack of an interactive application shell I saw some things that worked well, as well as many shortcomings that I wanted to fix. So when I set out to create REPL, I had the fuzzy goal of making it "better," whatever that meant.

REPL isn't good for much on its own. In fact, I'll argue that if it is good for anything on its own, then I've probably done something wrong design-wise.

See, REPL's whole shtick is that it's embeddable into other applications as a scriptable driver for testing. At it's core, what REPL does is very, very simple: it forwards text arguments from the keyboard to functions written in Python. This allows users or developers to run Python code without worrying about too much syntax, and scripting support enables automation.

What struck me was that because execution was driven by a person hammering away on a keyboard, functions exposed through REPL were forced to serialize/deserialize any data that they exposed to the user. Since RPEL's precursor was embedded in an application whose purpose was largely to make API calls over a text based protocol, this forced serialization was immensely valuable because it made us think about API boundaries.

So to me, one of REPL's most fundamental uses is to force developers to think harder about whatever API they plan to inflict on themselves and on their users.

What Now?

REPL has a simple goal from my perspective, but it's vague, not terribly inspiring, and worst of all, not particularly useful - REPL won't make any design suggestions. Over time, REPL subtly changed directions and started to feel more and more like a shell (like Bash, for example). So while REPL has always been, and will always be, an embeddable shell for testing, it is now possible to write applications centered around REPL instead of the other way around.

Next time, I'll talk about what you need to know in order to plug your own functions into REPL, as well as what you'd need to know in order to build an application using REPL

The Wrong Pointer

Wesley Wei — Wed, 21 Mar 2018 01:17:23 GMT

In which I get very confused.

A few months ago, I was working on a C exercise that involved coordinating a number of identical threads with the goal of minimizing runtime. In principle, it was a simple enough problem, but I somehow managed to produce a solution that successfully minimized runtime and logged egregiously incorrect timing data.

Without going into too much detail, the problem was presented as an aardvark and anthill simulation, so you'll have to pardon the names and nouns.

The simulation emitted timestamps on standard output, so the following output was what tipped me off to the fact that something strange was happening:

00.002668 Aardvark A slurping Anthill 0
1519874814.003379 Aardvark B slurping Anthill 1
-7070059777.995059 Aardvark C slurping Anthill 2
-7070059777.542307 Aardvark D slurping Anthill 0
-7070064072.509315 Aardvark E slurping Anthill 1

These are times in seconds, relative to the time when the program was started. It was immediately clear that at most one of these timestamps could possibly be correct. In other cases no reasonable timestamps were produced, so it was clear that a race condition of some sort was present. However, the simulation performed exactly as expected in all other aspects.

The timestamps were produced using gettimeofday, and the two struct timeval were declared not only in static global space, but also in a separate translation unit from the one I was responsible for implementing. The compiler shouldn't have allowed me even make reference to the timing structures, and a quick grep confirmed that I was not.

At this point, I was less motivated than I should have been, because I could independently verify that my simulated aardvarks were coordinating correctly. They finished at the published target time of approximately fifty-six seconds. That said, I couldn't really let a bug this silly slip by in good faith.

Since this was C, there was a good chance that I was looking at a memory error. Since something was stomping over memory that I had no way of legally accessing, there was a very good chance that I was looking at a memory error.

==7940== Memcheck, a memory error detector
==7940== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==7940== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==7940== Command: ./anthills
==7940==
1519876676.016039 Aardvark C slurping Anthill 1
1519876676.041669 Aardvark A slurping Anthill 0
-7070057915.958050 Aardvark D slurping Anthill 2
-7070062210.509055 Aardvark B slurping Anthill 0
...
==7822== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

But valgrind's memcheck reported no memory errors, and the program didn't crash or show any other symptoms. This wasn't too unexpected. After all, I didn't have many references to heap memory floating around. In fact, I had no naked calls to malloc at all. Since this program used the printf family of functions and pthreads, there was undoubtedly some heap memory being used behind the scenes, but I wasn't managing any of it directly.

As it turns out, most of the memory accesses that occurred in code that I controlled happened to be in static memory, where memcheck won't complain if you overshoot by a few bytes. valgrind wasn't really going to be too helpful, so it was out.

At this point I was still wondering how I could possibly have stomped over memory that I had no way of getting a pointer to, so on a whim I swapped compilers from gcc to clang.

==8039== Memcheck, a memory error detector
==8039== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8039== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8039== Command: ./anthills
==8039==
00.017880 Aardvark F slurping Anthill 2
00.041125 Aardvark B slurping Anthill 0
00.047397 Aardvark A slurping Anthill 0
00.042611 Aardvark H slurping Anthill 2
00.043128 Aardvark E slurping Anthill 1
00.045085 Aardvark G slurping Anthill 1
...
==8039== Invalid read of size 4
==8039==    at 0x4E428D9: pthread_join (pthread_join.c:45)
==8039==    by 0x401F87: main (anthills.c:390)
==8039==  Address 0x2000002d1 is not stack'd, malloc'd or (recently) free'd
==8039==
==8039==
==8039== Process terminating with default action of signal 11 (SIGSEGV)
==8039==  Access not within mapped region at address 0x2000002D1
==8039==    at 0x4E428D9: pthread_join (pthread_join.c:45)
==8039==    by 0x401F87: main (anthills.c:390)

Compiling with clang instead of gcc gave correct timing data, but caused the program to segfault upon completion. That wasn't what I had wanted to see, but it was still helpful. Given the relative simplicity of our code, it wasn't likely that I'd found a compiler bug. So I was almost certainly hitting undefined behavior that gcc and clang handled differently, and since just about the only things happening here were memory accesses through pointers, I had to be going out of bounds somewhere.

valgrind is cognitively cheap, but in hindsight I probably should have reached for gdb first anyway. I took advantage of gdb's ability to watch a location in memory and to interrupt the program whenever a change is detected.

For a few reasons, I decided to stick with gcc. First and foremost among them is that a segfault upon completion seems worse than incorrect timing data. In any case, I knew the names of the memory locations associated with the timing data, but I didn't know which pthread_t was causing problems. Even if I did figure out which one it was, with over twenty threads, I had no guarantee that it would be the same one every time.

There were only two timeval structures: start and now. Since time proceeded forward normally between jumps, now was being used correctly. gettimeofday gives time relative to the UNIX epoch, so I was fairly sure that now wasn't the problem. That left me with just start to worry about.

I fired up gdb and watched start.

(gdb) p &start
$1 = (struct timeval *) 0x6036d0 
(gdb) watch (struct timeval) *0x6036d0
Hardware watchpoint 1: (struct timeval) *0x6036d0

The timing data is usually correct for the first couple of ticks, and sort of as expected, I see the following a few times.

Hardware watchpoint 1: (struct timeval) *0x6036d0

Old value = {tv_sec = 0, tv_usec = 0}
New value = {tv_sec = 1519952936, tv_usec = 0}
0x00007ffff7ffac6d in gettimeofday ()

I tell gdb to continue a few times, and then something more interesting crops up:

Thread 4 "anthills" hit Hardware watchpoint 1: (struct timeval) *0x6036d0

Old value = {tv_sec = 1519952940, tv_usec = 882809}
New value = {tv_sec = 1, tv_usec = 882809}
actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:170
170     if (0 != sem_trywait(ants_remaining + hill)) return;

This is much more interesting, and looking at the new value tells me that this is when things go wrong. Since the form of the equation used to calculate elapsed time is start - now, this also explains why the timestamps become extremely negative after this point.

That's progress, but something isn't right here. sem_trywait operates on semaphores, and ants_remaining is very much an array of semaphores. Further, hill is guaranteed to be a valid index. So why is sem_trywait trampling over start?

I could try using gdb to watch all of the semaphores in ants_remaining, but my hardware is not capable of watching the three semaphores that make up ants_remaining, so that's not going to work.

I now know that for some reason, sem_trywait is stomping over memory that holds struct timeval start, but I don't know why. What else can I get out of gdb?

I know I'm in a function, so I can inspect the call stack.

(gdb) where
#0  actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:170
#1  0x000000000040102e in aardvark (input=0x7fffffffdd71) at aardvarks.c:271
...

If I ignore all of the extra code I used while I was still working out how to coordinate the aardvarks, actually_slurp boils down to the following:

void actually_slurp(char aardvark, int hill, int order)
{
    if (hill == -1)
        hill = next_robin();
        last_robins[order - 9] = hill;

    if (0 != sem_trywait(ants_remaining + hill)) return;

    clock_in(hill, aardvark - 'A');

    slurp(aardvark, hill);
}

Oh.

I'm an idiot: I've omitted the braces around the body of a conditional block. That section was meant to read:

    if (hill == -1) {
        hill = next_robin();
        last_robins[order - 9] = hill;
    }

This fixes the timestamps, and the simulation no longer travels backwards into negative time.

As it turns out, the rest of the program works hard to guarantee that hill is only ever -1 when order is 9 or greater. This means that the condition hill == -1 is an implicit bounds check for the assignment into last_robins.

But why would an out-of-bounds write into last_robins cause sem_trywait to clobber start when operating on ants_remaining?

I took a peek at the declarations of last_robins and ants_remaining.

int last_robins[AARDVARKS % ANTHILLS];

// Keep at most 3 aardvarks on an anthill at once
static sem_t occupancy_per_hill[ANTHILLS];
static sem_t ants_remaining[ANTHILLS];

They're suspiciously close. AARDVARKS and ANTHILLS are 11 and 3, respectively, so intuitively it's not too much of a stretch to believe that going out of bounds in one of them might land me inside one of the others.

I decided to take a look at what memory got clobbered by the line that was being executed when it shouldn't have been, last_robins[order - 9].

Recalling that order was 1 and appeared to be consistently 1, gdb says the following:

(gdb) p &(last_robins[1 - 9])
$1 = (int *) 0x6036d0 
(gdb) p &start
$2 = (struct timeval *) 0x6036d0

Oh.

If we compile with clang, the same exercise reports:

(gdb) p &(last_robins[1 - 9])
$1 = (int *) 0x603840

threads turns out to be yet another global static array in a translation unit that my code can't see. It contains the pthread_t objects that need to be cleaned up before the program terminates, so now I know why compiling with clang gave me an executable that behaved perfectly correctly until it tried to clean up after itself. Since out of bounds references are undefined behavior, the compiler is free to rearrange global static memory in whatever way it wishes to, and it turns out that gcc and clang chose different layouts.

I have just one last niggling concern: this isn't entirely consistent with my initial observation. gdb clearly called out a call to sem_trywait as the culprit, but we can clearly see here that it is in fact the previous line, last_robins[order - 9] that caused the erroneous write. I'm not too sure why this is the case, and it might be interesting to see if gdb will consistently point at the line immediately following the one that causes the write to memory.

void actually_slurp(char aardvark, int hill, int order)
{
    if (hill == -1) {
        hill = next_robin();
    }
    last_robins[order - 9] = hill;

    fprintf(stderr, "Oh no\n");

    if (0 != sem_trywait(ants_remaining + hill)) return;

    clock_in(hill, aardvark - 'A');
    slurp(aardvark, hill);
}

The buggy code above yields the following:

Thread 3 "anthills" hit Hardware watchpoint 1: (struct timeval) * 0x6036d0

Old value = {tv_sec = 1520234254, tv_usec = 914875}
New value = {tv_sec = 8589934593, tv_usec = 914875}
actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:165
165     fprintf(stderr, "Oh no\n");
(gdb) where
#0  actually_slurp (aardvark=66 'B', hill=1, order=1) at aardvarks.c:165
#1  0x000000000040104c in aardvark (input=0x7fffffffdcd1) at aardvarks.c:273

Well alright then.