27 July, 2009

putting things in context

The last couple of months have been busy but I’ve managed to find bits of time here and there to hack on the new AROS hosted port. Last week I really got the guts of the task switching and interrupt code working the way I wanted, which is what I’m here to tell you about today.

Task switching in a typical multitasking system is very simple in concept. Imagine a computer running a single task. There’s a big pile of instructions in memory somewhere, and the processor just runs them in sequence. It will keep doing that until something stops it. That something is the most important requirement to make preemptive multitasking work.

What usually happens (again in very simple terms) is that there’s an extra bit of circuitry somewhere in the computer that works as a timer. Every now and again (though tens or hundreds of times a seconds), it will prod the CPU. In response, the CPU will stop what its doing and go and run a different bit of code somewhere else in memory. The “prod” is known as an interrupt (or Interrupt Request (IRQ)), and the bit of code that runs is the interrupt handler (or more formally, the Interrupt Service Routine (ISR)). Its the handler’s job to arrange for a different task to run.

Something the CPU will do when responding to the interrupt is to save its complete state (known as the context) before it calls the handler. That is, somewhere in memory (typically on the stack) it will save a copy of all its registers, the stack pointer, the program counter and everything else it needs to continue running the program from where it was stopped. This is necessary as the handler will need to use those registers in order to do its work. Many CPUs provide a single instruction to restore the entire CPU state in one go.

To make task switching work, the interrupt handler will take a copy of the context and store it inside the OS task state, which usually contains lots of other info about the running task, such as memory it has allocated, files it has open, etc. Then, the handler chooses another task based on some criteria (this is the scheduler). Finally, it copies the saved context from the state of the task to wherever the CPU needs it, then tells the CPU to reload the context and leave the handler. The handler “returns” to running the newly selected task. This process contiues ad infinitum and you get the illusion that your computer is doing lots of things at the same time.

The existing Unix-hosted version of AROS does fundamentally the same thing, but in a highly convoluted way. The main thing to note is all tasks run inside a single Unix process, which then does some deep magic with Unix signals to make interrupts and task switches are happening. The kind of magic employed is highly OS-specific, and although I don’t know exactly why it was done the way it was, I can guess that it was one of:

The facilities for user-space task switching weren’t available or were incomplete when it was first written (I know this was the case for Linux)
Originally AROS was much more tightly integrated with the Linux desktop (eg one AROS window per X11 window, etc)

Times have changed though, and so what I’m trying to do is make a new port that is designed to be much closer structurally to its native cousins. I’m realising this through a number of mechanisms provided by POSIX: threads, signals and the ucontext set of functions (though somewhat ironically these have been removed from the latest versions of POSIX and SUS).

What I do is this. I create a thread to mimic the function of the timer interrupt delivery circuit. It sits in a tight loop, waiting a little while then sending a signal to the “main” thread. This obviously mimics the the interrupt that would exist on a real system, and causes the main thread to stop what its doing and jump to a signal handler.

When a signal is delivered to a Unix process, the kernel saves the current process state (context) onto the stack and then calls a signal handler function. When the handler returns, the kernel reloads the state from the stack and continues from where it was. This sounds like almost exactly what we want, except Unix typically doesn’t provide a portable way to get at the saved state on the stack. The existing hosted AROS implementation for Linux uses a bunch of Linux-specific knowledge to dig into the stack and get the data it needs, but thats obviously not portable. These days however, we have the ucontext functions which, while not without their quirks, are far more useful.

The prototypes look like this:

int getcontext(ucontext_t *ucp);
int setcontext(const ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int swapcontext(ucontext_t *oucp, ucontext_t *ucp);

For those who’ve seen setjmp() and longjmp() before, getcontext() and setcontext() will be quite familiar in function. getcontext() takes a copy of the current process state, including the CPU context, and drops it into the memory pointed to by ucp. setcontext() restores the process state and CPU context from whatever is saved in in ucp, effectively causing a direct jump to the point just after the getcontext(). What this means is that you get the appearance of setcontext() never returning, whereas getcontext() can return multiple times. Interesting times indeed.

makecontext() takes an existing context and modifies it such that when setcontext() is called on it it will jump to func with the arguments specified on the on the stack. You actually need to do a bit of fiddling inside ucp before calling it, to setup an alternate stack for the context to run on and so forth. For the most part this call is not particularly useful except when setting up.

Finally, swapcontext() is an atomic context get-and-set. That is, it does this:

getcontext(oucp);
setcontext(ucp);

except that a later setcontext(oucp) will return to the point after the call to swapcontext().

Armed with this knowledge, we can now take a look at the (slightly simplified) implementation. The task switch “interrupt” handler, is a two-stage process. The first part, which as far as the Unix kernel is concerned is the actual signal handler, looks like this:

ucontext_t irq_ctx;
char irq_stack[SIGSTKSZ];

void irq_trampoline (int signo, siginfo_t *si, void *vctx) {
    getcontext(&irq_ctx);
    irq_ctx.uc_stack.ss_sp = (void *) irq_stack;
    irq_ctx.uc_stack.ss_size = SIGSTKSZ;
    irq_ctx.uc_stack.ss_flags = 0;
    makecontext(&irq_ctx, (void (*)()) irq_handler, 0);

    swapcontext((ucontext_t *) GetIntETask(SysBase->ThisTask)->iet_Context, &irq_ctx);
}

(irq_stack is initialised during startup as irq_stack = malloc(SIGSTKSZ))

So the signal from the timer thread arrives, and the current task gets interrupted and we arrive here. The getcontext() and makecontext() bit sets up a new context that, when called, will call the actual interrupt handler (ie the scheduler etc) and select a new task.

Its the call to swapcontext() that is most interesting. What this does is save the current context into the current task structure, and switch to the interrupt handler proper. The handler calls into the scheduler to choose another task then calls setcontext() on its saved context to start it up. The subtlety is in the fact that when the saved context is later used to start the task up again, it will return to the point just after the call to swapcontext(), immediately drop off the end of the signal handler and head back to where it was.

You might wonder why the more obvious method of using getcontext() to save the context then calling the scheduler directly isn’t used. The problem comes from the fact that when getcontext() “returns”, the caller has no way of knowing if it was the initial call to save the context, or if it was as a result of setcontext() being called. Without this knowledge, we’re left to this kind of trickery so that the only time we end up after the context being save is when the context is reloaded.

(This is the opposite of setjmp(), which returns zero from its initial call and non-zero after a call to longjmp(). It perhaps makes the code easier to read to just have a call and test to determine what to do next, but its slightly slower and it would also result in the handler being run on the task stack, which means making the handler more complicated to make sure it rewinds correctly when the task is switched back. Or tricks can be played with sigaltstack(), which further complicates things.

The actual implementation is naturally a little more complicated, mostly because it has to deal with so-called “system calls”, which is what happens when an application triggers a task switch (eg by calling Wait()). To allow that, each interrupt signal carries a numeric id that allows the trampoline and handler to determine what type of interrupt was requested. Then, when Exec wants to force a task switch, it will trigger the interrupt requesting it, which will make the scheduler with the main task “stopped”, as above, but with slightly different semantics. It doesn’t add much code though, and the technique is identical.

There’s still lots to be done to clean up the scheduler, which so far is a hack job of the hack job already present in the mingw32 port. The next thing to do is continue to work on the boot sequence, which is almost there but is just a tiny bit finicky at the moment (that’s a technical term). Next time I think I’ll write about the new host module setup which blows hostlib.resource out of the water (if you know what that is)!