How to Share

This week, I’m going to write about a technical topic – something I learned a bit more about in the course of working on the group project thus far. My team’s project goal is to develop a Real-Time Operating System (RTOS) in a microcontroller. The topic I’m writing about today is semaphores.

I’m learning that one of the most challenging aspects of working on an operating system is managing issues of concurrency. Race conditions have the potential to emerge practically anywhere there is a resource shared by multiple tasks. Since it is the job of the operating system to provide and allocate resources to tasks, race conditions are a common concern.

What we generally need, in order to avoid race conditions, is the ability to ensure that a single actor can perform a series of operations involving some resource, without that resource being subject to any modification by any other actor for the duration of time those operations are performed. This series of operations is said to be the critical section. This requirement may seem simple enough, but for the operating system, providing sort of insurance is a challenge. The operating systems core scheduling work involves constant juggling from one task to another, and in a manner which is mostly asynchronous to the tasks’ logic and execution. This is because the operating system does most of its work in response to interrupts, which may fire at any time. Needless to say, this makes it quite difficult to have any certainty that a given task is will not be interrupted precisely while performing operations within in the critical section. Once this sort of interruption occurs, shared resources are subject to modification, unbeknownst to the interrupted task, which are precisely the circumstances in which race conditions can occur.

So how can we solve this problem. One possible approach might be to simply disable interrupts just prior to each critical section, and reenable them just after. If our task can be interrupted, then the critical section will execute linearly and our problem goes away, right? Well, I see two issues with this approach. Firstly, there are many, many critical sections of code and disabling interrupts at a high frequency might start to result in latency issues in the operating system. More importantly, however, this approach wouldn’t provide any insurance against concurrent modification of shared resources on a system with multiple processors. On such a system, the mere fact that a task is able to execute work through the critical section without interruption does nothing to guarantee that another task is not being run on a separate processor and doing work which interferes. Clearly, we need a better solution.

Enter the semaphore. At it’s core, the semaphore is little more than a convention – a shared mechanism tasks can use to take turns. We learned how to take turns in kindergarten, and, as it turns out, the underlying mechanism used to support semaphores, at least in the Cortex-M4 architecture I’ve been studying, isn’t much more complicated on a functional level than many other things we learn in kindergarten, like drawing inside the lines and counting to 20.

In order to share, we have to (a) know when it is someone else’s turn, (b) recognize when there is an opportunity to take our own turn, and (c) jump in and actually take our turn when the time is right. But what if there is a moment of hesitation between (b) and (c) and someone else jumps in and takes a turn instead of us? How might we respond? Would we be crestfallen and give up? Would we pretend like the usurper doesn’t exist and blindly grab for the shared resource, hoping we get to it first? Or would we politely wait until another opportunity for a turn arises?

Now I probably don’t need to spell this out, but the correct answer is politely wait. The designers of the Cortex-M4 architecture clearly graduated from kindergarten, because they applied the same principles in the assembly instructions which underpin the semaphore: LDREX, STREX, and CLREX. LDREX loads a value from memory into a chosen register, and, in so doing, sets an exclusive monitor. This is the Cortex-M4 equivalent of raising your hand in class. If you raised your hand while someone else was talking, the teacher tells you to put your hand down and so you will try again later. In that case, you checked the value loaded from LDREX, and that value signaled that another task is currently holding access to the shared resource, so you should wait and try again later. But let’s say you raise your hand and the teacher calls on you to speak. You’ve loaded the value with LDREX and checked it. The value signals the availability of a shared resource. Now is your chance! You muster up your courage and open your mouth to begin speaking. At this critical moment, one of two things can happen: either you speak without incident, or you are rudely interrupted before you can speak by Artos, who didn’t raise his hand and blurted out some comment totally irrelevant to the subject at hand. In the former case STREX was to successfully executed to store a new value at the same address previously loaded via LDREX – a value signaling that you now hold access to the shared resource. In the later case, a context switch executed by the Operating System, involving use of CLREX, or a memory access by another processor cleared the exclusive monitor which was set after LDREX executed. As a result, STREX, when attempted by the current task fails to update the target memory address, and instead sets a flag in another register signaling the failure. Referencing this flag, then, the responsible student politely waits for the disruption caused by Artos to subside, and politely returns to LDREX, again awaiting their turn to speak.

Comments

Leave a Reply Cancel reply