Intel Ponte Vecchio and Xe HPC Architecture: Built for Big Data

Intel dropped a ton of new information during its Intel Architecture Day 2021, and you can check out our other articles for deep dives into the Alder Lake CPUs, Sapphire Rapids, Arc Alchemist GPU, and more. That last one is particularly relevant to what we will discuss here, Intel’s Ponte Vecchio and the Xe HPC architecture. It’s big. Scratch that: It’s massive, particularly in the maximum configuration that sports eight GPUs all working together. The upcoming Aurora supercomputer will be using Sapphire Rapids and Ponte Vecchio in a bid to be the first exascale supercomputer in the U.S., and there’s a good reason the Department of Energy opted to go with Intel’s upcoming hardware. 

(Image credit: Intel)

Like the built-for-gaming Xe HPG, the base building block for Xe HPC starts with the Xe-core. There are still eight vector engines and eight matrix engines in the Xe-core, but this Xe-core is fundamentally very different from Xe HPG. The vector engine uses a 512-bit register (for 64-bit floating point), and the XMX matrix engine has been expanded to 4096-bit chunks of data. That’s double the potential performance for the vector engine and quadruple the FP16 throughput on the matrix engine. L1 cache sizes and load/store bandwidth have similarly increased to feed the engines. 

(Image credit: Intel)

Besides being bigger, Xe HPC also supports additional data types. The Xe HPG MXM only works on FP16 and BF16 data, but the Xe HPC also supports the TF32 (Tensor Float 32), which has gained popularity in the machine learning community. The vector engine also adds support for FP64 data, though only at the same rate as FP32 data.

With eight vector engines per Xe-core, the total potential throughput for a single Xe-core is 256 FP64 or FP32 operations, or 512 FP16 operations on the vector engine. For the matrix engines, each Xe-core can do 4096 FP16 or BF16 operations per clock, 8192 INT8 ops per clock, or 2048 TF32 operations per clock. But of course, there’s more than one Xe-core in Ponte Vecchio.

(Image credit: Intel)

Xe HPC groups 16 Xe-core units into a single slice, where the consumer Xe HPG only has up to eight units. One interesting point here is that, unlike Nvidia’s GA100 architecture, Xe HPC includes the ray tracing units (RTUs). We don’t know how fast the RTU is relative to Nvidia’s RT cores, but that’s a big potential boost to performance for professional ray tracing applications.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button