GPUs: Path into the future
With the announcement of a new Blue Waters petascale system that includes a considerable amount of GPU capability, it is clear GPUs are the future of supercomputing. Access magazine's Barbara Jewett recently sat down with Wen-mei Hwu, a professor of electrical and computer engineering at the University of Illinois, a co-principal investigator on the Blue Waters project, and an expert in computer architecture, especially GPUs.
Q. Why should we care about graphics processing units (GPUs)?
A. There are a few things about GPUs that are especially attractive. One is that if you look at a typical computer chip, a CPU chip, versus a GPU chip today, a GPU chip tends to give you 10 times more peak execution throughput and somewhere around six times of memory DRAM access bandwidth with maybe around 50 percent, in some cases only 20-30 percent, more power consumption. So if you calculate the ratios you would probably be around—in terms of execution throughput per watt, or memory bandwidth per watt—we're probably talking about eight times more attractive on the GPU side for the execution throughput and somewhere around five times more attractive on the memory bandwidth side.
Q. So more work is done for the same wattage?
A. Same wattage. That's why, when people build these huge machines, GPUs have become more attractive. Because these huge machines are power hungry.
The past year was really the year where a lot of these projects turned in the GPU direction for various reasons. For two or three years we have known that those ratios are very attractive, it's just that there were other factors that deterred people from making that turn. But many of those factors have disappeared or essentially resolved to a level that this ratio started to bring people into building machines with them.
Q. Do people have to rewrite their codes in order to run on GPUs?
A. They have to rewrite their code. That goes into some of those factors I mentioned. There have been probably about five or six factors that have been deterring people from making widespread use of GPUs in the high-performance computing world. Probably the most obvious one is that the first generation of these GPUs only had single precision, the number representation used only 32 bits. Most scientific computing people want to use double precision, 64 bits.
That was a serious problem in the first—and second—generation GPUs. If you used double precision the performance dropped dramatically—more than 10 times in the first generation, and about eight times in the second generation. Go back to the performance ratio I mentioned earlier; if it drops about eight times then GPU is about the same as CPU, so there is no real advantage.
The new generation, what we call the FERMI generation, is the generation that pretty much everyone has been using since last year. The single-to-double-precision performance became 1:2, exactly as a CPU now. That makes the GPU much more attractive for scientific computing.
So that's number one. Number two is it took quite a bit of time for vendors to catch up. A lot of things people use in the field were not really there for the GPUs. If you look at a lot of the scientific libraries for GPUs, the number of, say, linear algebra libraries and partial differential equation solver libraries, we have finally reached the point where it may not be all there but there are enough of them that people can begin to really use them. That is why, starting last year, a lot more people were using GPUs.
Another interesting aspect is when you try to use GPUs, you can try to use it in two ways. One way is to just call the libraries; you still need to do some adjustment but it is nothing major and you avoid rewriting a lot of code. But if you want to do some form of computation that is not available in the form of libraries, you actually need to rewrite your code to run on the GPU. So this was also one of the major deficiencies.
Cray has been working on this deficiency. One of the most popular ways of programming multi-core CPUs today is OpenMP, where people write their code in C but they put in these pragmas and directives, and so on, so that the Intel compiler and some other compiler like PTI compiler can generate multi-threaded code for applications. With new compiler technology, people won't need to rewrite their code for GPU. They will still need to rewrite their code if they want to get a lot of benefit, but they won't need to rewrite to get some benefit.
So those are the factors that have been changing recently. If you look at the top machines in the world last year, four of the last 10 top machines were GPU clusters. That is an indication that GPUs have reached critical mass.
Q. Will GPU codes being developed now transfer to the next generation computer? And to an exascale?
A. GPUs need to have at least thousands of threads all running at the same time to get the full benefit of the hardware. Whereas the CPUs today typically have no more than 16 threads in a chip, even in the very high-end CPUs. When you are writing an algorithm it is very different if you are writing for 16 threads or thousands of threads. In many ways it is a redesign of your algorithm so that you can actually partition your work into much, much smaller chunks, but then all the chunks can be done in parallel. In some cases, like in solving graph problems, it is very hard.
On the other hand, if you look at exascale computing, even CPUs will have probably hundreds of threads. So we are going to cross over to the regime where you will need to be able to partition your work into smaller chunks. What is really happening is that as people begin to rewrite their code and rethink their algorithms and doing all that hard work, they make their work suitable not only for GPUs today, they are also making their code much more suitable for CPUs in the future and then definitely GPUs in the future. So that's where you are paying forward, for your future. Many application teams understand that. Essentially they are now making a one-time investment. And saying, "Once I make that investment, I make my code scalable in terms of the number of threads. And then I have a path into the future."
One of the research projects we do in my lab is to actually make sure that people can truly write just one piece of code. They can just focus on getting their algorithm to be scalable. We actually have various pieces of technology to translate the code into a particular form for the GPU or a particular form for the CPU. As long as the code is based on work-efficient parallel algorithms, we have been successful in this approach. At that point, when people invest in developing their new code, they are not just investing into GPU only, they are investing into the future path.
Q. Gets them thinking of the big picture?
A. Another important activity is that we're teaching a lot of these chip-level scalable programming techniques to domain people, people in physics, mechanical engineering, and so on. Its not that we expect everyone to be writing detailed code, but we're actually training them to think about scalable programming, so they can begin to also renew their models, their methods. Begin to think about parallelism and data locality in a competent way. They can work with people who are really good with libraries and coding and so on to be able to generate the new generation of libraries faster.
We recognized this need for education and planned for it when we wrote the Blue Waters project proposal. That is why there is the Virtual School of Computational Science and Engineering (VSCSE) summer school the past three years; we already knew these things were coming and made it part of the project. I have also been teaching the European Union summer school in Barcelona for the past two years. These schools are really about training the domain scientists to think about scalable computation.
Q. So GPUs have arrived. Are people willingly converted?
There are a couple of things that are interesting. In some of these application domains, such as in molecular dynamics, for example, if you look at the entire application—not just the part you move to the GPU, but after you move these parts to the GPU—if you look at the entire application often times we are talking about two times speedup, in some cases three to four times speedup versus the 10 times peak execution throughput I previously mentioned. In some cases people have demonstrated for some routines you can actually get close to 100 times speedup because of the interaction between the memory bandwidth and the peak performance. But usually that is less common. So people will say, what's the big deal? For the whole application, you are only getting two or three times speedup. But I think it is something we need to put into perspective.
We are at the point in the industry where just the computing power grows about two times every two years. But in terms of real application, often you are getting less than 40 percent. Again, the growth is two times in terms of peak performance, but not in terms of real application. People don't realize this. They think, "I'll just wait another two years and I'll get two times." I have to tell them no, you don't understand. In another two years you'll be getting probably 40 percent performance increase and you still have to work for it.
So in order to actually get two times in terms of real application performance increase, in terms of this natural evolution, you will have to get around six times increase in terms of the peak. And guess what? That's going to be eight years from now. And that becomes an extremely obvious situation. If it is going to be eight years, it is much better if you can get some increase today.
Q. Is that some of the problem the HPC industry has had—we get so focused on the Top 500 list and peak performance?
A. Yes. The whole HPC industry is going towards lower clock frequency and higher throughput types of systems. That is why most people are going into a GPU type of arrangement. When you think about development efforts and when you think about sustained performance, real application performance, this is different than what we talk about in terms of peak.
For the new Blue Waters design, the GPU nodes are going to be very, very strong nodes. So as long as we can get good node-level algorithms going, these nodes are going to be even stronger than the old Blue Waters design. To me, that is extremely important because that means if we do a good job getting the node level algorithms going, then we can actually sustain a higher level percent of performance for the applications. But you know, these are the challenges that we still need to meet. These are real challenges.
Q. Take it from scientific computing into personal computing. Are we going to be having GPUs on our desks and in our laptops?
A. Oh, absolutely. Every computer has a GPU in there, that's the graphics card. At this point, not all those graphics cards are what we call CUDA-enabled, or GPU-computing-enabled, but at this point every graphics card shipped by NVIDIA is pretty much GPU-computing-enabled. So in a few years, everything that you can buy, and pretty much everything that people use, will be GPU-computing-abled.
That's actually extremely important, because when people bother to write their code in a form that can execute well on GPUs, they want to be able to run them on millions and millions of systems. We've had these exotic parallel machines and have maybe 100 of them in the world. People just ignore them because there is no real money in selling software or licensing software for those machines. With only 100, how many customers can there be?
But here you have hundreds of millions of machines. That's the fundamental reason why people are willing to port all the libraries, because those libraries cost a lot of effort and money. Since all these machines will have GPU-computing capabilities, those libraries can now run on hundreds of millions of machines instead of just being used by HPC people. So now there is a business case for that.
Another important consideration is, if you look at the mobile world there is now a design style called fusion that essentially has a single chip with a modest CPU and a modest GPU on it. These chips are very power efficient, but they have a fair amount of computing power from the GPU. We're talking probably 100 gigaflops. Compared to the high-end GPU that is about 2 teraflops today, 100 gigaflops is not a whole lot, but if you compare that with the high-end CPUs today, most CPUs are still at about 60 gigaflops. So these chips you can put into tablets and the like will have 100 gigaflops, and that gives you tremendous computing power with very low energy consumption for very sophisticated applications. Even on cell phones in the future.
That actually brings out some very interesting client applications. Suddenly the smartphones don't need to necessarily rely on all servers for executing applications, they can have much, much better use of interfaces and they can have much better latency. Because for some of these applications, if you want to talk with them in a server, the latency of communication to a server incurs intolerable latency. So now we're beginning to see even more interest in developing applications and libraries for GPUs for that reason.
Q. With solving some of these issues with GPUs, does that, in your opinion, make exascale something we are now going to be able to hit?
A. No! It is still not clear to me if we can hit exascale. Even with GPUs we're still not quite there yet. We're still probably another order of magnitude short. If you look at the GPU floating-point performance per memory, per gigabyte of main memory, it is still not quite sufficient. In the future we're going to have dramatically less amount of memory available per gigaflop of performance we're looking for. That will further require some even worse brain surgery to some of the algorithms. When I look at some of these things, it is still not clear to me how some of those problems will be solved. So much of that is still going to be research.
Q. So we still have things to dream about.
A. Yes. And there are still going to be a lot of inventions. And that's also the scary part. People still need to invent new algorithms and people still need to invent probably some of these new models to be able to fill that one order of magnitude gap that nobody really knows how to fill at this point.
Q. Do think funding constraints will hinder innovation, or will it encourage even more creativity?
A. That's hard to tell. Funding has always been kind of funny. In reality, nobody really funds these things. If you look at the way the National Science Foundation and the Department of Energy spend money on these machines, they are always quick in spending money on the hardware. But they are always reluctant to spend money on the applications and the software. Moving forward, I think it is going to be the same. But people always manage to put together some money for these innovations. So I think it may still happen. But there is going to be a lot of uncertainty, and I think it may come slower than people would imagine.