August 12, 2014, 2:06 PM — Nvidia is finally discussing its Project Denver combined CPU/GPU, giving a speech at the Hot Chips conference taking place at Stanford University this week, and it could be the most powerful ARM processor ever made (until the next one).
Currently, Nvidia's beefiest chip has been the Tegra K1, with four ARM Cortex-A15 processors. The next version of the K1 will replace the four 32-bit A15 cores with two of the 64-bit Denver cores. Up to now, that's about all that's been known about Denver.
Tirias Research, which follows the semiconductor market, just published a white paper on the chip, having gotten a sneak preview prior to Hot Chips. Nvidia also made a blog post discussing the new processor as well.
The biggest development, according to Tirias, is dynamic code optimization. The core microarchitecture of the CPU has an in-order pipeline, which uses much less power but also can stall out if instructions it needs are not there. Out-of-order processors are faster but also use much more power.
If Denver finds repetitive code sequence, an optimizer assesses the code to find a better way to keep reusing or repeating it. It's all very complicated, but what it essentially does is give out-of-order execution without using a power-hungry out-of-order pipeline. Nvidia says this effectively doubles the performance of the base-level hardware through optimization while improving energy efficiency.
This code is then stored in 128MB of optimization cache that is invisible to the OS and apps. If the app need the optimized code again, it fetches it from the cache.
Also, Denver uses a seven-way superscalar execution engine, which makes it capable of processing up to seven operations per clock cycle, the same as a Haswell processor. Most ARM processors use three, maybe four pipelines. And combined with the code optimization, you will get faster execution due to the elimination of unnecessary instructions.
Nvidia licensed the A15 rather than the 64-bit A57 for a reason. The A15 license allowed it to build a custom microarchitecture, as long as the final processor maintains ARMv8 instruction set compatibility. So what Denver does is convert ARMv8 instructions into its own internal ISA on the fly. This allows for even more dynamic code optimization.
As it is converting the instructions, Denver analyzes the code just before execution and look for places where it can bundle together multiple instructions that don't depend on one another for execution in parallel.
Nvidia provided benchmarks for the Tirias paper, so take them with a grain of salt, showing Denver out-performing Intel's Bay Trail Atom, Apple's A7 processor and in some cases, a Haswell-generation Celeron. Celeron is the low-end of Intel's x86 line but that's still impressive, if the numbers hold out. We'll see when the hardcore techie testers like Anandtech finally get their hands on it later this year. For now, it looks like Chromebooks (which use the Tegra K1) are about to get a lot more powerful