3 Themes from HPEC 2013

Last week, I attended the annual IEEE High Performance Extreme Computing Conference (formerly called High Performance Embedded Computing) in Waltham, MA. I had the privilege of presenting my paper on distributed database performance, and I got some great comments and questions.

Here are the key themes that came across in many of the talks and keynotes:

1. Hadoop MapReduce is entering the “trough of disillusionment”
The MapReduce programming pattern is inadequate for all but the simplest of analytics. On top of that, the Hadoop implementation of this classic model of parallelism is bogged down by a weak scheduler and inefficient middleware. Looking ahead, it can serve “embarrassingly parallel” applications, but future versions need to address some of the performance problems.

2. The next generation of intelligent analytics rely on sparse computation
In years past, the HPEC community was laser-focused on signal processing and accelerators such as FPGAs and GPUs. The engineer’s goal was to squeeze every last FLOP out of a computing system. This year, about 10 talks dealt with applications of sparse matrices and data structures. This is a major shift.

3. Tomorrow’s chips are going to look a lot different
Intel’s latest commercial chips are 22nm. Moore’s Law will get transistors down to 5-10nm. After that? This community has to innovate. Talks from Intel, Texas Instruments, MIT, Carnegie Mellon and others mentioned tricks like 3D chip stacking, making memory smarter, and other novel architectures as ways to cram more transistors on a chip, move data faster, and accommodate future applications that look nothing like the dense computation for which today’s systems were optimized.

Reach out on Twitter or leave a reply below if you noticed other themes!

Intel’s Rajeeb Hazra announced an interesting new product last month at ISC 2012. Knights Corner is the first of the new Xeon Phi brand, and it’s a co-processor that looks roughly like a GPU without the display ports. Intel has referred to this architecture as “Many Integrated Core” (MIC), but it’s not clear to me whether that name is being dropped. So far we have very little technical information. Anandtech has a good summary of information that’s been made available. There’s a “solution brief” on the Intel website indicating Knights corner uses a 22 nm process, Intel’s new 3D Tri-Gate transistors, and that the chip contains “more than 50 cores” (shall we assume 64?).

Is it a GPU? No, but it’s not far off.

Simple Cores

Intel follows the trend we’re seeing with most Chip Multi-Processors (CMPs), in which complicated cores are being thrown out in favor of dozens of simpler cores. We clearly see this in GPUs (where the processing cores are essentially single-precision floating point units), and it’s also the cornerstone of Tilera’s TILE architecture. Knights Corner cores are Pentiums with 64-bit floating point units, which should provide strong compute capability per core for a variety of applications, while still boasting power efficiency thanks to the state-of-the-art 22 nm process.

It’s the Network

I can only speculate about the on-chip network. Intel released a MIC prototype (codenamed Knights Ferry) to developers in 2010. Knights Ferry featured a 1,024-bit ring bus network connecting its 32 cores to memory, and the entire board had 2 GB shared DDR memory with a coherent 8 MB L2 cache (thanks Wikipedia).

What will we see in Knights Corner? The marketing material is really pushing programmability and compatibility with x86. I would guess we’ll see more of the same (global DDR memory, coherent L2 cache, and a ring bus network). But I have to ask, have we now reached the scalability limit of a shared memory architecture? One would think at some point CMPs will have to embrace Non-Uniform Memory Access (NUMA) and start treating processing cores (or groups of them) as network nodes. Once these on-chip networks get large enough, a shared global memory is going to be the bottleneck for almost every application. Developers will need some control over the communication patterns in order to get optimal performance.

How Do We Program It?

This architecture presents a very simple programming model to the developer. OpenCL will make it easy to write code for both data-parallel (like CUDA) and task-parallel (like pThreads) applications. Moving data will be handled under the hood. I expect a host/kernel model, where the main CPU will launch parallel kernels. I’m sure Intel Parallel Studio will provide C++ and Fortran programmers with some cool tools for improving performance. I’m also pleased to see standalone Windows and Linux versions (the product was originally a Microsoft Visual Studio plug-in). It would also be nice if Intel releases a free Xeon Phi Linux SDK, which will encourage academia to try out this new product (assuming the cost for the actual card isn’t over the top). If they provide pThreads support, I have some code I’d love to benchmark.

What This Chip Is Not

This is potentially great for scientific research and other High Performance Computing (HPC) applications, but it’s not going to do much for the armies of developers doing “real work” using distributed databases and writing MapReduce jobs. Someday there will be a market for a cluster-on-chip: distributed memory with NUMA, addressable processing cores, and real message passing.

My Take

I’d really like to try this out for math-intensive HPC applications. I’m sure we’ll see Knights Corner in many of the biggest supercomputers. I predict CUDA developers will embrace the Xeon Phi if the price is right (they’ll need to get it down around $1,000 after a high initial launch price) and the software ecosystem is strong.

Tag Archives: HPC