GPU Supercomputing: Are we there yet?

Wen-mei Hwu, PCI Chief Scientist
11/30/2011 - 12:58

Supercomputers need to cut their power bills. Today’s supercomputers consume between 1 and 4 watts of electricity for each Giga FLOPS (Floating Point Operations Per Second) of peak calculation capability. The leading supercomputers in 2012 will have about 10 Peta FLOPS of peak calculation capability, which translates into 10 to 40 megawatts. With electrical power costing about $1 a year per watt, these machines will rack up power bills ranging from $10 million to $40 million annually. This is an important reason why leading edge supercomputers are increasingly using GPUs, which are five times more power efficient than today's CPUs.
In 2010, three supercomputers in the Top500 list use GPUs. Tianhe-1A (#2), Nabulae (#4), and Tsubame (#5) use one, one and three GPUs in each computing node respectively. The power efficiency of these machines is reflected in their Green500 ranking: Tianhe-1A (#13), Nabulae (#16), and Tsubame (#4). By using more GPUs in each computing node, Tsubame operates at the highest power efficiency among the three systems. It should be no surprise that the Blue Waters system, set to operate at more than 10 Peta FLOPS in 2012, will use GPUs in at least some of its computing nodes.

So, are we there yet? The answer is no, but hopefully soon. The use of GPUs in high-performance computing has come a long way since the late 1990s, when physicists had to write their simulation models as graphics shaders painting texture onto pixels. Today, they can write their models as C functions extended with CUDA or OpenCL keywords. So, we should be all set? Not quite. There are many functions that are yet to be created for GPUs before they become as usable as traditional CPUs. The main challenge is the need to change the algorithms for these functions into scalable parallel algorithms that can effectively overcome memory bandwidth limitations.
In some cases, such as finding the shortest path in graphs, there is no known massively parallel algorithms. It may take decades for such algorithms to be invented. In some other cases, the non-uniform, sparse nature of real-world data can cause catastrophic load imbalance in large scale parallel execution that make the parallel algorithms run much slower than expected. In many other cases, massively parallel algorithms exist but require clever optimizations to allow these algorithms to run efficiently within the limited available memory bandwidth. Most of the advances in GPU computing has been in this third category. Many researchers, including my team at the University of Illinois, have created such optimized algorithms for linear algebra, N-body simulation, partial differential equations, and graph optimization. The challenge is that there are still a lot of functions to be created for many application areas. In most cases, we also need to re-think the application strategy to deploy these functions effectively.
So, why do we bother? Why not just use CPUs? The simple answer is that we need to do the same for CPUs too. All CPUs have multiple cores with vector units and limited memory bandwidth today. The number of cores in the CPUs is increasing. In order to effectively use CPUs, we will need the same type of work!
The good news is that as we gain experience in developing and deploying parallel functions, we also learn how to create new tools to make the job easier. There are quite a few new tools coming out of research groups and commercial companies. These tools need to evolve and improve. Nevertheless, we as a community are making steady progress. After all, four out of 25 planned applications for Blue Waters can use GPUs effectively today. Obviously, we want the number to the much higher by the time we launch the system in 2012. We are planning on getting there on time.