5
$\begingroup$

I just read the chapter on locks in the book Operating Systems: Three Easy Pieces, and it discusses an implementation of locks (see in Figure 28.9) that involves a test-and-set hardware instruction.

The textbook emphasizes that both hardware support and OS support are necessary to build an efficient locking mechanism. Wikipedia echoes this sentiment:

Locks typically require hardware support for efficient implementation. This support usually takes the form of one or more atomic instructions such as "test-and-set", "fetch-and-add" or "compare-and-swap". These instructions allow a single process to test if the lock is free, and if free, acquire the lock in a single atomic operation.

I think I've come up with an alternative lock implementation that doesn't require hardware support. It seems to have relatively similar performance overhead, while guaranteeing mutual exclusion, so making this post to find out if I'm missing something.

Textbook Lock Implementation

I'll first summarize the textbook implementation. For setup, we need to:

  1. Maintain a single flag (set to 0 or 1) and a queue (of thread IDs).
  2. Maintain a separate flag for a "guard" lock, which will be a simple spin lock. Because the guard lock is meant to protect a relatively small critical section, the performance overhead of a spin lock won't be as pronounced.

When a thread wants to acquire a lock, it will do the following:

  1. First, spin with the test-and-set instruction until you acquire the guard lock.
  2. Once acquired, check if the flag (of the main lock) is set to 0. If it is, simply set it to 1, and the lock has been acquired! If the flag is set to 1, proceed to step 3.
  3. Push the current thread ID onto the queue.
  4. "Park" the current thread. This is an instruction in Solaris that basically puts a thread to sleep.

On the other hand, when a thread wants to release a lock, it will do the following:

  1. Spin until you acquire the guard lock.
  2. If the queue is empty, set the flag to 0, and you're done. If the queue is nonempty, proceed to step 3.
  3. Pop a thread ID from the queue, and "unpark" the corresponding thread (wake it up).

My Implementation (without hardware support)

First, some setup:

  1. Every lock must be initialized with some unique lock ID.
  2. Inside the OS, maintain a "lock table." For each lock ID, the lock table will maintain a flag (0 or 1) and a queue (of thread IDs).

Then, when you want to acquire the lock:

  1. Make a system call (which might be called something like lock) and pass in the lock ID as an argument.
  2. Enter the OS. Since the OS manages scheduling, we guarantee mutual exclusion for the following steps.
  3. Search the lock table for the specified lock ID. If the flag is 0, set it to 1. Then return; the lock has been acquired! If the flag is 1, proceed to step 4.
  4. Push the calling thread to a queue.
  5. Block the calling thread, so for now, it cannot be scheduled.

To release a lock:

  1. Again, make a system call with the lock ID as an argument.
  2. Entering the OS (again guaranteeing mutual exclusion), search the lock table for the specified lock ID. If the queue is empty, set the flag to 0, you're done. If not, proceed to step 3.
  3. Pop a thread from the queue, and unblock it. It should be scheduled soon.

Note that when we are acquiring a lock, a context switch is not necessary if the lock is available, so the performance overhead should be minimal.

It's likely that I'm missing a small detail that will make me look stupid, but that's why I'm asking this question in the first place.

$\endgroup$
2
  • $\begingroup$ I'm not an expert on this stuff, but I think making a system call is a lot slower than a test-and-set instruction, so in the common-case path (where we acquire the lock without contention), your alternative seems like it will be slower. $\endgroup$
    – D.W.
    Commented Nov 10, 2020 at 6:26
  • $\begingroup$ In step 2, are you proposing that the entire OS stops doing everything until the locks are taken care of? How would this be efficient on modern hardware with multiple CPU cores, all of which might be in kernel mode simultaneously? $\endgroup$ Commented Nov 10, 2020 at 11:40

2 Answers 2

5
$\begingroup$

Since the OS manages scheduling, we guarantee mutual exclusion for the following steps.

But how do you guarantee mutual exclusion inside the OS?

On a single-processor machine¹, you know that only the kernel code that serves the lock system call is executing. All other threads are in a known state. However the kernel code can be interrupted at any time by a hardware interrupt. This could be the timer interrupt that tells the kernel that it's supposed to schedule a different thread. You can implement locks just by manipulating the bits in the lock table, but only if interrupt code won't manipulate the lock table. So it's possible to implement a lock this way, but it imposes some design constraints in the kernel.

On a multi-processor machine, all you know is that this processor is serving the lock system call. The other processors could be manipulating the lock table at the same time. At step 3, you read an entry from the lock table and then possibly write to that entry. Another processor could have read or written the entry between your read and your write². Avoiding this race condition is precisely why processors have instructions such as compare-and-swap that combine a read and a write.

Even on a single-processor machine, making a system call is a heavyweight operation compared to a single instruction such as compare-and-swap. It's no good for high-speed concurrency.

¹ By single-processor, I mean a machine that executes a single thread at a time. In this context, a machine with a single piece of silicon that contains multiple logical cores is a multi-processor machine.
² And in the real world, it can get worse, because processors don't always have a consistent view of memory, so two processors could for example take the same lock independently even if their read/write sequences don't overlap. But even with consistent memory, interlacing read/write sequences spells doom for a naive lock implementation.

$\endgroup$
2
$\begingroup$

OS locks require hardware support because every instruction in the ISA takes multiple clock cycles to complete. It has to go through a instruction execution cycle having fetch,decode,execute,write... stages (there are processor with >20 stages). Also most of the processor nowadays are pipe-lined processors. I.e, since hardware for each stage of the instruction execute cycle is different, multiple instruction are put in flight (at different stages) at the same time. This scenario is further complicated by the fact that memory operand for a instruction can be at any place in the memory hierarchy, L1, L2, L3 ... cache or main memory. So the time taken for reading a operand can be large (hundreds of cycles for main memory read).

Lock is a memory variable. So modifying it can take multiple cycles (number of stages in instruction execution cycle and where the lock memory variable is). If there is a hardware support (atomic instructions) then, processor will ensure that no other instruction is able to access the lock memory location till previous instruction that requested an access is completed (goes through all its instruction cycle stages). A atomic instruction ensure that there is total order among all the lock variable modifying instructions.

Even spin lock function use atomic instruction for exclusive access to spin-locks.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.