So, what is new in parallel computing? Revisiting "Tutorial on Parallel Processing"

Robert H. Kuhn and David A. Padua
02/24/2012 - 14:11

Blowing now toward the south, then toward the north, the wind turns again and again, resuming its rounds. What has been, that will be; what has been done, that will be done. Nothing is new under the sun. -- Ecclesiastes

When work on Illiac IV started at Illinois in the 1960s, parallel computing was the focus of only a few researchers and was a niche area with few venues where results could be presented. Half a century later, the situation is quite different. Since the 1960s, there has been an immense and continuous research activity enhanced by two pinnacles. In the mid 1980s, the almost sudden realization of the importance of supercomputing attracted many to the area. Then, in the last five years, the advent of multicore brought renewed interest and a significant increase in research activity.

In 1981, we were fresh out of graduate school and were invited (thanks to a recommendation from our common PhD Thesis advisor) to give a tutorial in the main parallel computing conference of the times, ICPP, the International Conference on Parallel Processing. As part of this, we put together a collection of papers on hardware, software, algorithms and applications entitled Tutorial on Parallel Processing published by the IEEE Press. In the days before the internet, collections of papers were valuable and useful. Our tutorial was even reprinted once.

We thought it would be interesting to revisit that 30-year-old collection to see what we can learn about the rate of progress and durability of ideas.  Most of the papers in the tutorial dealt with four topics: machines, languages and compilers, applications and algorithms.  We will comment on the first two topics since blogs are supposed to be short.

Machines

Ten of the 21 papers on machines described “big-iron” systems: pipelined processors like the Cray-1 and array processors like Goodyear’s STARAN. These classes of machines are mostly extinct now, although their descendants (microprocessor vector extensions, pipelined functional units) are still with us in the same way as dinosaurs survive in today’s birds. Indeed, they are now more widespread than ever before. Even cell phones contain multiple cores and GPUs. There were two papers on dataflow machines which did not survive either. It has been said that support for instruction level parallelism (ILP) is a legacy of dataflow machine research, but most see Tomasulo’s algorithm and the CDC 6600 scoreboard (which pre-dates dataflow) as the true ancestors of ILP. However, there is also a legacy of dataflow. For example, some proposals for runtime system of future exascale system are based on dataflow ideas.

Five of the papers in the Tutorial described multiprocessors, all with a shared memory. In a way, there hasn’t been striking change here. On the other hand, shared memory is much richer with cache coherency and multi-level shared or private caching. (“NUMAness” -- it’s a terrible word isn’t it?!) One area where dataflow contributed to current multi-core is the need for synchronization acceleration, from full-empty bits then to lock elision and transactions today.

There were no papers on distributed-memory machines, clusters. In those days, most people (including us) seemed to believe that these machines would be quite hard to program. PVM and MPI demonstrate that networking software has made great strides. While programming distributed memory machines is difficult and in some cases can be really convoluted, in practice the difficulty turned out not to be as great as expected. One of the four papers on special purpose processors was on database machines that never incarnated into popular systems.

Languages and compilers

The Tutorial contained four papers on parallel programming languages and three on compiler technology. Two of these focused on message-passing languages (CSP and Concurrent Pascal), one on a vector language, and the last was a survey on data flow languages. Like today, languages and compilers respond to the need to program available hardware. It is interesting that while message passing has been the dominant form of parallelism in high-end machines, languages like CSP and Concurrent Pascal never became popular. Libraries are today the notation of choice for message passing. The main reasons seem to be market size (compilers are expensive to build and maintain and the number of users of message-passing notations is relatively small) and speed of adaptation (libraries can be modified faster than compilers).

Although vector notation is quite appealing independent of parallelism (as the fascination of many years with APL show), it has never become mainstream. Today even data parallel languages such as Chapel and CUDA rely mostly on other constructs (Chapel on forall and CUDA on SPMD). Intel’s ARBB is the exception, but it has not been widely accepted. Difficulties in developing efficient implementations of dataflow, functional and vector languages are possibly the main reason that these languages are rarely adopted. This area is still fertile ground for compiler research, in our opinion.

Quite striking is the absence in the Tutorial of papers on languages or notations for shared memory programming. There were of course notations for shared-memory programming such as co-begin/co-end, but they apparently were not used to program parallel machines but to represent concurrency (much like java threads today). Notations like OpenMP, Cilk, and TBB with support for scheduling and load balancing were yet to be developed. Only hinted of in the ‘80s was the now popular idea of programming an SMP with a parallel loop. The idea of parallel loops back then only made use of static scheduling and did not contain primitives for reductions nor the notion of private variables.   

There was also one paper in the Tutorial on automatic detection of parallelism. Today, practically all compilers implement the technology described in that paper for vectorization and parallelization, but today’s vectorizers are best guided by the user with pragmas and nobody uses compilers for parallelization. In place of vectorization, SIMD languages like CUDA and OpenCL have grown; stimulated by GPUs’ growth.

Final words

It is clear that much has changed in 30 years. Most classes of machines described in the Tutorial have disappeared giving way on the high-end to clusters and highly-parallel, tightly coupled systems that are immensely more powerful than those in 1981. These performance advances are in part the result of microprocessor technology and in part the result of the inclusion of architectural features that take advantage of advances in hardware technology and benefit today’s applications.

On the programming side, although none of the notations described in the tutorial became mainstream, all major concepts of yesteryear are still part of our conversation today. Array computers and programming (think GPUs, CUDA, and ArBB) are widely available.  Functional programming is still mentioned as a good thing and there is an area of active research (for example, parallel Haskell), and so on. But there has also been changes … today’s popular notations, threads, OpenMP and MPI, have numerous features not available in any form in the early 1980s.

In both software and hardware, our community's work over the past 30 years has led to impressive advances.  And with the advent of multicores and the challenge of exascale, future changes are likely to happen even faster. While there has been a complete change of scenery, the concepts of the 1980s are still with us. This endurance across decades is a great motivation for those working on the next wave of ideas.