AMD has gone all-in on ‘combined processing’ in recent years, and their design philosophy in this regard has been deliberate and focused. Although they did release a niche pre-overclocked FX CPU running at 5 GHz turbo, they haven’t had a new ‘pure’ CPU since Vishera was released in 2012. And it should be no surprise that it has been indicated that they will no longer be releasing products in the FX lineup.
The payoff hasn’t quite come just yet, but when it does, Kaveri is designed to take full advantage of a new style of coding. AMD has added new features that lean even more in the direction of combined processing, including an all new way of allowing the GPU and CPU to access the same memory.
Kaveri Review – What’s New?
AMD’s first Kaveri product for the desktop is the A10 7850K APU. It features a long-awaited switch to a 28nm fab process on the desktop side, where they have been at 32nm since Bulldozer in 2011. The A10 7850K is a 12 “compute core” APU, which shows you just how much AMD is focused on counting the GPU and CPU equally.
Specifically, the A10 7850K includes 8 GPU cores and 4 CPU cores (although as you may know, modern AMD CPU modules are designed with 2 Integer schedulers and 1 Floating Point scheduler each, but for the most part, we can just consider them to be dual core modules. And since the A10 7850K has two modules, it’s effectively a quad core CPU).
Kaveri introduces quite a few new features, both on the CPU and GPU side, when compared to previous generation Richland or Trinity APUs. Let’s take a look at how Kaveri is put together.
The first thing you’ll notice is that the GPU side takes up a lot more space than the CPU side; 47% of the die area itself. Compare this die map to the Trinity die map, or Intel’s Haswell die map to get a better comparison. AMD packed on 8 GCN cores, bringing their APU architecture up to date with the latest GPU lineup. Each core contains 64 “Radeon Cores” or Stream Processing Units. The closest desktop GPU to this type of shader configuration is the OEM-only R9 255. On the A10 7850K, it’s designated as simply “R7″ though, and can be used with other R7 video cards in Dual Graphics Mode.
If the Steamroller CPU module looks familiar, that’s because not much has changed compared to Piledriver or even Bulldozer. The way the architecture works is identical, but a lot of effort has been put into making it more efficient, with fewer cache misses, mispredictions, and better scheduling in the Fetch and Decode portions of the module. The max-width dispatches in the Integer scheduler has been improved by 25%, and major improvements have been made in the L1 cache’s store handling. Overall, this leads to around a 10-20% performance improvement over Richland running at the same clock speed. Single core performance has been improved, and power consumption has as well, in terms of performance/watt.
We don’t talk too much about power consumption in our desktop CPU/APU coverage, unless it’s a real issue (like it was with Bulldozer). It is worth nothing though, that AMD has been able to improve performance/watt. We’ll take a closer look at this later in the review. The A10 7850 has a 95W TDP, which is roughly the same as Richland’s 100W TDP.
As for the GPU modules, if you are familiar with GCN architecture, you know how this works. Kaveri uses the latest version, the same found in the R9 290 and 290X. This includes TrueAudio technology, which makes 3D audio sound really good, even with stereo speakers, and the latest video encoding and decoding acceleration modules UVD4 and VCE2.
What is HSA? True Combined Processing
There has been a lot of talk about HSA and its various features, and what it all means. Most people wonder what HSA itself means, along with other acronyms. HSA itself stands for “Heterogeneous System Architecture”. Basically it’s the entire philosophy of combining the power of the CPU and GPU to make full use of what everything has to provide.
There are two key features of Kaveri that makes HSA work. hUMA (Heterogeneous Uniform Memory Access) and hQ (Heterogeneous Queueing).
To understand how HSA is made possible, and how the CPU and GPU can be combined at a closer level than before, we’ll need to go back and look at the evolution of AMD’s platforms. Specifically at how the GPU and CPU interact with memory. We’ll have to go back in time, and dust off some old AMD slides to fully understand this.
The first product to be called an “APU” was Llano, so we’ll start with that. What made Llano different from the previous generation of CPU with integrated GPU? On previous products with integrated graphics, the IGP actually resided on the Northbridge die itself, rather than the CPU die. It communicated with the CPU and the Memory via the Northbridge. This was done through a HyperTransport link.
Llano moved the GPU cores to the CPU die, along with the DDR3 memory controller and Northbridge, although everything was still separate. Bandwidth between the GPU and memory was improved, but it still had to make an extra step to get there. The “Fusion Compute Link” is the first sign of AMD’s focus on combined processing, as it provided the GPU a direct 128 bit bus to the CPU.
Trinity introduced the Unified Northbridge, which is connected to the GPU from a widened 256 bit Fusion Compute Link (called “Control” link on this diagram for some reason) going through an IOMMU. As you can see, the major change was that the memory controller became integrated with the Northbridge, along with the link controller. The key to this, though, is that everything is ‘smarter’ via the System Request Interface. This prioritizes certain tasks, instead of letting everything clog up the various busses.
At this point, there is still a separate partition of memory going to either the CPU or GPU; they can’t directly access each others’ portions.
Kaveri adds a third 256 bit FCL, and the end result is that both the CPU and GPU have access to the same pool of memory, and more importantly, are coherent of what sort of processes the other is doing. If it is determined that the GPU needs access to a portion of memory the CPU is currently at, the CPU simply leaves a ‘pointer’ to that address, and the GPU continues from there. The CPU can go back and access it after calculation, with no need to copy data one way or the other.
As you probably know, the GPU is more suited for certain types of processing than the CPU. If you compare mining cryptocurrency on a CPU to a GPU for instance, the difference is huge. If you were into Folding@Home when GPU processing was first introduced, you know first hand what a difference it makes.
And that is the key to HSA and the APU in general. The ability to hand over parallel workloads to the GPU and serial workloads to the CPU, and let them work together coherently, is a pretty huge deal.
Some of the new architecture should lead to instant improvements in performance, but much of HSA has to be programmed for specifically to take full advantage of it. It is up to AMD not only to provide the hardware, which they have done, but convince people to make use of it. Looking at the documentation for HSA, it looks like they are putting a lot of effort in doing so. We’ll be looking at a few applications that make use of HSA, as well as some traditional OpenCL computing that works in a somewhat similar way (at least in terms of combining CPU with GPU). As you’ll see, AMD has a bright future ahead of them if major applications make use of this.
Kaveri Platform – FM2+ Required
As is usually the case with a new CPU architecture, a new platform is needed to be able to use it. Except not really, since the new platform, A88X, has actually been around for a few months now, and is fully backwards compatible with FM2 APUs.
As far as I can tell, A88X is 100% identical to A85X, which was the previous top end chipset. The only difference is the FM2+ socket, which has 2 extra pins required to install Kaveri.
So it still has four lanes of PCI-E 2.0, four USB 3.0 ports, 10 USB 3.0 ports, eight SATA 3.0 ports, and so on. The 16 PCI-E lanes coming from the Kaveri APU are, of course, 3rd generation