Intel’s Rajeeb Hazra announced an interesting new product last month at ISC 2012. Knights Corner is the first of the new Xeon Phi brand, and it’s a co-processor that looks roughly like a GPU without the display ports. Intel has referred to this architecture as “Many Integrated Core” (MIC), but it’s not clear to me whether that name is being dropped. So far we have very little technical information. Anandtech has a good summary of information that’s been made available. There’s a “solution brief” on the Intel website indicating Knights corner uses a 22 nm process, Intel’s new 3D Tri-Gate transistors, and that the chip contains “more than 50 cores” (shall we assume 64?).
Intel follows the trend we’re seeing with most Chip Multi-Processors (CMPs), in which complicated cores are being thrown out in favor of dozens of simpler cores. We clearly see this in GPUs (where the processing cores are essentially single-precision floating point units), and it’s also the cornerstone of Tilera’s TILE architecture. Knights Corner cores are Pentiums with 64-bit floating point units, which should provide strong compute capability per core for a variety of applications, while still boasting power efficiency thanks to the state-of-the-art 22 nm process.
It’s the Network
I can only speculate about the on-chip network. Intel released a MIC prototype (codenamed Knights Ferry) to developers in 2010. Knights Ferry featured a 1,024-bit ring bus network connecting its 32 cores to memory, and the entire board had 2 GB shared DDR memory with a coherent 8 MB L2 cache (thanks Wikipedia).
What will we see in Knights Corner? The marketing material is really pushing programmability and compatibility with x86. I would guess we’ll see more of the same (global DDR memory, coherent L2 cache, and a ring bus network). But I have to ask, have we now reached the scalability limit of a shared memory architecture? One would think at some point CMPs will have to embrace Non-Uniform Memory Access (NUMA) and start treating processing cores (or groups of them) as network nodes. Once these on-chip networks get large enough, a shared global memory is going to be the bottleneck for almost every application. Developers will need some control over the communication patterns in order to get optimal performance.
How Do We Program It?
This architecture presents a very simple programming model to the developer. OpenCL will make it easy to write code for both data-parallel (like CUDA) and task-parallel (like pThreads) applications. Moving data will be handled under the hood. I expect a host/kernel model, where the main CPU will launch parallel kernels. I’m sure Intel Parallel Studio will provide C++ and Fortran programmers with some cool tools for improving performance. I’m also pleased to see standalone Windows and Linux versions (the product was originally a Microsoft Visual Studio plug-in). It would also be nice if Intel releases a free Xeon Phi Linux SDK, which will encourage academia to try out this new product (assuming the cost for the actual card isn’t over the top). If they provide pThreads support, I have some code I’d love to benchmark.
What This Chip Is Not
This is potentially great for scientific research and other High Performance Computing (HPC) applications, but it’s not going to do much for the armies of developers doing “real work” using distributed databases and writing MapReduce jobs. Someday there will be a market for a cluster-on-chip: distributed memory with NUMA, addressable processing cores, and real message passing.
I’d really like to try this out for math-intensive HPC applications. I’m sure we’ll see Knights Corner in many of the biggest supercomputers. I predict CUDA developers will embrace the Xeon Phi if the price is right (they’ll need to get it down around $1,000 after a high initial launch price) and the software ecosystem is strong.