What might be surprising is that Windows 7's multithreading changes did not deliver more of a performance punch. The explanation for this lies in what changed in how Windows 7 manages threads. The principal changes consist of increased processor affinity (see the sidebar, "How Nehalem processors and Windows 7 work together") and changes to the Windows kernel dispatcher lock. This eye-glazing term refers to a core aspect of modern operating systems: how the kernel prevents two threads from accessing the same data or resource at the same time.
Anytime a thread wants to access an item that might be claimed by another thread, it must use a lock to make sure that only one thread at a time can modify the item. Prior to Windows 7, when a thread needed to get or access a lock, its request had to go through a global locking mechanism. This mechanism -- the kernel dispatcher lock -- would handle the requests. Because it was unique and global, it handled potentially thousands of requests from all processors on which Windows ran. As a result, this dispatcher lock was becoming a major bottleneck. In fact, it was a principal gating factor that kept Windows Server from running on more than 64 processors.
New locking mechanism
Windows 7 includes a wholly new mechanism that gets rid of the global locking concept and pushes the management of lock access down to the locked resources. This permits Windows 7 to scale up to 256 processors without performance penalty. On systems with only a few processors, however, the old kernel dispatcher lock was not overburdened, so this new mechanism provides no noticeable improvement in threading performance on desktops and small servers.
The new improved processor affinity discussed in the sidebar does not show up in the performance results. On runs with SMT disabled, this was expected because the benchmarks use all resources available; no Turbo Mode boost is possible. When we ran the four-thread Viewperf benchmark with SMT enabled (giving the benchmark eight processing pipelines), the results were essentially unchanged. That is, the differences were immaterial, which suggests that Turbo Mode works best in narrowly constrained settings, rather than the typical threaded applications we tested. Despite several requests, Microsoft would not comment on these results.