There is one problem with the new Intel CPUs that becomes more noticeable with quad-socket configurations. As mentioned earlier, the PCIe bus is on the CPU socket so with four sockets you have four PCIe buses with 40 lanes each for a total of 160 lanes of 1Gbps PCIe bandwidth. That is a lot of I/O bandwidth, but looking a bit deeper there is a problem:
The QPI connections between sockets is a dual-channel 12.8Gbps channel for a total performance of 25.6Gbps
The PCIe express bandwidth of a socket is 40x 1Gbps per lane or 40 Gbps of PCIe bandwidth to the socket.
Problems quickly arise when PCIe bandwidth exceeds 25.6Gbps and the process requesting access to the PCIe bus is not on the socket with the bus where the access is being requested. Some of the workarounds attempted would lock processes on sockets with the PCIe bus that needs to be read or written. But it did not work for all applications. For example, those with data coming in and going out of multiple locations such as a striped file system are affected because you cannot break the request and move each request to each PCIe bus.
The real-world performance for general purpose applications running on a four-socket system is likely an estimated 90% of the QPI bandwidth between sockets (or 23Gbps) unless the data goes out on the socket with the PCIe bus. Every fourth I/O, if they are equality distributed, will run at 40Gbps, so the average performance would be (3x23Gbps +40Gbps)/4 or an average performance of about 27.25Gbps per socket for a quad-socket system.
This is, of course, the average based on equal distribution of the processes and I/O to the PCIe bus. A process that has PCIe processor affinity will significantly improve that average, but it is often difficult to architect and meet the requirements of putting every task on a PCIe bus and ensuring that the process runs on the CPU with that bus. The probability of this limitation is higher with a quad-socket system than with a dual-socket system.
The diagram below shows an example of a dual-socket system that, though having the same issues, reduces the potential of hitting that architectural limitation.
My estimate for performance for a dual-socket system is (23Gbps +40Gbps) or average socket performance of 31.5Gbps. On a dual-socket system it is much easier to architect the system so that you can put the right I/O on the right CPU and achieve near-peak performance.
CPU Conclusions Are Counter-Intuitive