To gather meaningful performance metrics, it’s usually a good idea to run several iterations of the same test, averaging the numbers in some way, to eliminate noise from the results. This is true of sequential and fine-grained parallel performance analysis alike. Though it’s clearly important for sequential code too, data locality can add enough noise to your parallel tests that you’ll want to do something about it. For example, if iteration #1 enjoys some form of temporal locality left over from iteration #0, then all but the first iteration would receive an unfair advantage. This advantage isn’t usually present in the real world – most library code isn’t called over and over again in a tight loop – and could cause test results to appear rosier than what customers will actually experience. Therefore, we probably want to get rid of it.
I wrote an article that appears in the May 2007 issue of MSDN Magazine. It’s now online for your reading pleasure:
Late last summer, an interesting issue with traditional optimistic read-based software transactional memory (STM) systems surfaced. We termed this “privatization” and there has been a good deal of research on possible solutions since then. I won’t talk about solutions here, but I will give a quick overview of the problem and a pointer to recent work.
One of the motivations of doing a new reader/writer lock in Orcas (ReaderWriterLockSlim) was to do away with one particular scalability issue that customers commonly experienced with the old V1.1 reader/writer lock type ( ReaderWriterLock). The basic issue stems from exactly how the lock decides when (or in this case, when not) to wake up waking writers. Jeff Richter’s MSDN article from June of last year highlights this problem. This of course wasn’t the primary motivation, but it was just another straw hanging off the camel’s back.
The CLR commits the entire reserved stack for managed threads. This by default is 1MB per thread, though you can change the values with compiler settings, a PE file editor, or by changing the way you create threads. We’ve been having a fascinating internal discussion on the topic recently, and I’ve been surprised how many people were unaware that the CLR engages in this practice. I figure there’s bound to be plenty of customers in the real world that are also unaware.
In 2.0 SP1, we changed the threadpool’s default maximum worker thread count from 25/CPU to 250/CPU.
A reader asked for clarification on a past article of mine, regarding my claim that one particular variant of the double checked locking pattern won’t work on the .NET 2.0 memory model. The confusion was caused because my advice seems to contradict Vance’s MSDN article on the topic.
Somebody recently asked in a blog comment whether the new ReaderWriterLockSlim uses a full barrier (a.k.a. two-way fence, CMPXCHG, etc.) on lock exit. It does, and I claimed that “it has to”. It turns out that my statement was actually too strong. Doing so prevents a certain class of potentially surprising results, so it’s a matter of preference to the lock designer whether these results are so surprising as to incur the cost of a full barrier. Vance Morrison’s “low lock” article, for instance, shows a spin lock that doesn’t make this guarantee. And, FWIW, this is also left unspecified in the CLR 2.0 memory model. Java’s memory model permits non-barrier lock releases, though I will also note the JMM is substantially weaker in areas when compared to the CLR’s.
In Orcas, we offer a new reader/writer lock: System.Threading.ReaderWriterLockSlim.
I previously mentioned the X86 JIT contains a “hack” to ensure that thread aborts can’t sneak in between a Monitor.Enter(o) and the subsequent try-block. This ensures that a lock won’t be leaked due to a thread abort occurring in the middle of a lock(o) { S1; } block. In the following example, that means an abort can’t be triggered at S0: