According to recent sources, folks at Chips and Cheese conducted a recent GPU memory latency performance test on the AMD’s rDNA 2 & NVIDIA’s Ampere GPU architectures. The results produced were more than interesting.
Latency performance has become a very crucial factor in the ever-increasing use of multi-chipset dies and several IO chips onboard the same die. So, the performance of the two GPUs gives us a glance at their actual performance capabilities.
Coming to the results, AMD’s Radeon RX 6800 XT (RDNA 2 GPU) & the NVIDIA GeForce RTX 3090 (Ampere GPU) were positioned against each other. According to the test, the cache and memory benchmark of the AMD’s rDNA 2 architecture fared far better than NVIDIA’s Ampere GPU. It delivered lower latency despite having to check two more levels of cache on the way to the memory. The use of Infinity cache only adds 20ns over L2 hit and is still faster than NVIDIA’s Ampere.
The testers concluded that NVIDIA’s Ampere-based GA102 GPU is simply much larger and uses a more conventional GPU memory subsystem with only two cache levels, it has to take a lot of cycles and results in over 100ns latency (L1 to L2). RDNA 2 on the other hand has a latency of just 66ns.
We know that AMD’s Navi 21 GPU features a 4 MB L2 cache while the NVIDIA GA102 GPU features a 6 MB L2 cache for the whole chip. The NVIDIA A100 Ampere GPU for HPC features a massive 40 MB L2 cache.
The folks at Chip and Cheese had this to say about the results:
RDNA 2’s cache is fast and there’s a lot of it. Compared to Ampere, latency is low at all levels. Infinity Cache only adds about 20 ns over an L2 hit and has lower latency than Ampere’s L2. Amazingly, RDNA 2’s VRAM latency is about the same as Ampere’s, even though RDNA 2 is checking two more levels of cache on the way to memory.
In contrast, Nvidia sticks with a more conventional GPU memory subsystem with only two levels of cache and high L2 latency. Going from Ampere’s SM-private L1 to L2 takes over 100 ns. RDNA’s L2 is ~66 ns away from L0, even with an L1 cache between them. Getting around GA102’s massive die seems to take a lot of cycles.
This could explain AMD’s excellent performance at lower resolutions. RDNA 2’s low latency L2 and L3 caches may give it an advantage with smaller workloads, where occupancy is too low to hide latency. Nvidia’s Ampere chips in comparison require more parallelism to shine.