virtual insanity

Now that I’ve (apparently) fixed the loader, my mammoth WebKit test binary loads and runs, and so I’ve begun implementing the stub functions with earnest. To start my method has been to run the program until it crashes, find out where the crash happened, which is usually a NULL pointer dereference, and then provide a basic implementation of the class that that thing is supposed to be pointing to.

The current problem is a crash that occurs inside a regular method call, for no apparent reason. The offending method, in its entirety:

void DocumentLoader::setFrame(Frame* frame)
{
    if (m_frame == frame)
        return;
    ASSERT(frame && !m_frame);
    m_frame = frame;
    attachToFrame();
}

Good old printf() tracing shows that the crash occurs after m_frame = frame but before attachToFrame(). That is, that method is never called. This is highly unusual, and tedious to debug, because it means we have no choice but to drop down to assembly code, which I can muddle through well enough but can’t really wrap my brain around.

Disassembling the last two lines of the method, we get this:

    mov    0x8(%ebp),%edx
    mov    0xc(%ebp),%eax
    mov    %eax,0xc(%edx)

    mov    0x8(%ebp),%eax
    mov    (%eax),%eax
    add    $0x8,%eax
    mov    (%eax),%eax
    sub    $0xc,%esp
    pushl  0x8(%ebp)
    call   *%eax
    add    $0x10,%esp

The pointer to the current object, this, is on the stack, 8 bytes in, as is the frame pointer, 12 bytes in. So we see the value of this being dereferenced through the stack and stored in %edx, and then the same for the frame pointer, being stored it in %eax. Then the location 12 bytes into the object proper is computed (which is where m_frame is stored), and %eax (the location of the frame object) is stored in it. Thus, m_frame = frame.

The next chunk, predictably, is the call to attachToFrame(). The important thing about this method is that its what C++ calls a virtual method. It wasn’t until Friday that it was actually explained to me what that meant, and I found it hilarious. Consider:

    Object *o = new Object;
    o->method();

    o = new SubObject;
    o->method();

(where SubObject is a subclass of Object).

Now, if method() is a virtual function, this will do what you’d expect from most other OO languages: the first call will call Object::method(), the second calling SubObject::method(). If its not virtual, then both calls will go to Object::method, because its taken from the type of the pointer, not the type of the object itself.

I don’t know if this was considered counterintuitive when it was first designed, but its certainly not the way most OO languages work these days. Usually you have to be explicit when you want to call a superclass version.

In any case, the code generated is different. In the simple non-virtual case, the call can be done via an absolute address, as the compiler can know exactly where the method() function is for the type. The virtual case is more complicated as the object itself needs to be interrogated to find out where its function is.

To do this, a table for each class that the object inherits from is placed inside the object, containing pointers to the functions that the object wants to use for its virtual methods. A virtual method call might then be rendered in C as:

    o->_vtbl_Object.method();

That is, go through the table of implementations of methods defined in the Object class to find the method, and call it.

So, getting back to our disassembly. attachToFrame() is a virtual method. The code gets this from the stack, 8 bytes in, and puts it in %eax. Then it dereferences the pointer to find the actual memory location of the object. It then adds 8 to that to get the location of the virtual method table, and dereferences that to get a pointer to the attachToFrame() function, which goes into %eax.

Then it does the usual function call setup, making room on the stack for the arguments and return address, and then calls the function at the location in %eax. It is here that the crash occurs, because %eax has 0 in it.

I was floored when I first saw this. I checked a number of times in different places, finally checking the constructor itself. And sure enough, the virtual table contains all zeroes. To me this smelt suspiciously like a relocation problem - if the the ELF loader is not correctly doing the relocations for virtual tables, then they’ll point to garbage memory, causing a crash.

I’m not entirely sure how this can be, and haven’t figured it out yet. I need to check the place where virtual table is normally initialised, but I don’t know where that is! I can theorise by thinking about the structure of an object and the virtual table internally.

The first critical thing is that the virtual table inside the object is a pointer. That is, when the memory for the object is allocated space is not allocated for the virtual table too. A pointer needs to be to point to a valid virtual table. There’s two ways this could be done: setting a pointer to some known static data that contains the data for this class, or allocating some more memory and copying the pointers from same known static data.

The former seems the more likely to me. The extra allocation and copy seems unnecessary as the table for the object will not change during the lifetime of the object. There are seperate tables for each class the object inherits from, so there’s no need for a group of tables to be mixed into a single one.

So given that as a theory, we should be able to find some code somewhere around the constructor that sets up the virtual table pointer. It’ll probably be the first thing after the memory allocation is done. This code might not exist in the binary itself though but may be part of a language support library (libgcc or similar). Regardless, the thing that will need to be there is the virtual table location.

I’m expecting to find that the location of the virtual table is not being relocated properly by the ELF loader. Basically, I trust GCC to produce correct code than I trust our loader to do the right thing. The problem could also be within our linker, collect-aros, but its so simple that I’m happy to rule it out initially.

Stuart, get back to work!

Update 3pm: Found it. I missed one section header table index conversion when I was updating the loader for large numbers of sections. Stupid, but it never hurts to exercise my brain on the really low level stuff.