Somebody recently asked in a blog comment whether the new ReaderWriterLockSlim uses a full barrier (a.k.a. two-way fence, CMPXCHG, etc.) on lock exit. It does, and I claimed that “it has to”. It turns out that my statement was actually too strong. Doing so prevents a certain class of potentially surprising results, so it’s a matter of preference to the lock designer whether these results are so surprising as to incur the cost of a full barrier. Vance Morrison’s “low lock” article, for instance, shows a spin lock that doesn’t make this guarantee. And, FWIW, this is also left unspecified in the CLR 2.0 memory model. Java’s memory model permits non-barrier lock releases, though I will also note the JMM is substantially weaker in areas when compared to the CLR’s.
Here’s an example of a possibly surprising result that non-barrier releases can cause:
Initially x == y == 0.
Thread 0 Thread 1
=============== ===============
lock(A); lock(B);
x = 1; y = 1;
unlock(A); unlock(B);
t0 = y; t1 = x;
Is it possible after executing both that: t0 == t1 == 0?
It is simple to reason that this is impossible with sequential consistency. In SC the only way that t0 == 0 is if Thread 0’s t0 = y statement (and therefore x = 1, assuming a memory model in which writes happen in order) were to occur before Thread 1’s y = 1 (and therefore t1 = x). In this case, t1 = x must subsequently see t1 == x == 1, otherwise the history contradicts SC. The only other possibility is that Thread 1’s t1 = x happens before Thread 0’s x = 1 and therefore also t0 = y, in which case it must be the case that the subsequent t0 = y by Thread 0 yields t0 == y == 1. In both cases, either x or y must be seen as 1. (Interleavings are clearly possible that result in x == y == 1.)
The CLR 2.0’s memory model guarantees that, if the unlock incurs a barrier, the same SC reasoning applies. Unfortunately, if the unlock is not a barrer, then in both cases the load of x or y may occur before the write buffer has been flushed, meaning the write to x or y and the unlocking write itself, possibly leading to t0 == t1 == 0. This happens even on relatively strong processor consistency memory models such as X86, and on weaker ones such as IA-64 (even when all loads are acquire and all stores are release, which only happens for volatile CLR fields). To ensure the write buffer has been flushed before the read happens, the unlock statement must flush the buffer (or an explicit barrier is needed), accomplished with CMPXCHG on X86 ISAs.
Many would argue that, because the locks taken are different between the two threads, SC does not apply and therefore implementing unlock as a non-barrier write is legal. JMM takes this stance. This actually seems like a fine argument to me, and after thinking about it for a bit, it’s probably what I would choose if I were defining a memory model. But the CLR 2.0 MM is generally stronger than most, so people might actually depend on this and expect it to work. This could cause Monitor-based code to break when moved to alternative lock implementations that don’t issue full barriers at release time. This is just one example of why it’d be really great to have a canonical specification for the CLR’s MM. At least we’d then have a leg to stand on when faced with tricky compatability trade-offs some day down the line.