Personal Parallel Supercomputers
Carlos Silesky, John Sobolewski
University of New Mexico
Computer & Information Resources & Technology
Albuquerque, NM 87131
silesky@unm.edu, jssob@unm.edu
Abstract
High performance or supercomputers are generally believed to be large and expensive systems affordable only by large corporations and well endowed research universities. What is not generally known, is that relatively powerful computer systems, having more than one order of magnitude the performance of a CRAY-1 computer for many scientific and engineering applications, can be build relatively quickly and inexpensively using public domain software, commodity personal computers (pc's) and networking components. Such systems, with a peak theoretical performance of more than one Gigaflop, can be easily built for well under $50,000 and are creating a price-performance revolution for many applications that previously were confined to running on expensive supercomputers. In this paper such systems are described, together with their implementation, operational characteristics, price-performance comparisons with their commercial counterparts, as well as their limitations. These pc based supercomputers place high performance computing within reach of community colleges, departments and even individuals, in which case they may be called personal supercomputers.
Not too long ago, supercomputers could only be acquired by national laboratories, larger research universities and corporations with large research budgets. However, recent advances in commodity microcomputers and network interconnect technologies, coupled with the availability of a free UNIX operating system (Linux [2,3]), have made it possible to easily assemble a cluster of personal computers into a parallel "supercomputer" with a theoretical peak performance of well over one Gigaflop at a cost of less than $10-20 per Megaflop. Such systems rival the performance of much more expensive systems for many types of engineering and scientific applications and are affordable to organizations and even to individual researchers.
Over the past 2-3 years, several important events have enabled the assembly of these personal supercomputers from commodity parts that can be readily purchased from local personal computers and network component vendors. These events include[1]:
These commodity hardware and software components allow reasonably self-sufficient computer users to build themselves an inexpensive personal supercomputer that is simple, reliable, inexpensive to maintain and reasonably user-friendly with excellent cost/performance for many applications, especially those that are naturally parallel (e.g., Monte-Carlo Techniques), insensitive to communication latency or require relatively few nodes to run. Recent research shows that many scientific and engineering applications fall in these classes. An analysis of about 178,000 batch jobs run at the Maui High Performance Computing Center (MHPCC), for example, shows that 78% of all jobs submitted by a very heterogeneous user base used eight processors or less and 94% of all jobs used 32 processors or less [4]. The cumulative distribution of these jobs is shown in Figure 1. This, together with the fact that a 300 or 350 MHz Pentium personal computer (or equivalent) has approximately the same performance as a processor node at the MHPCC, suggests that a relatively large percentage of these jobs could be run on a personal supercomputer consisting of a cluster of 8-32 personal computers. However, users with very large jobs that are sensitive to communication latency, require large address spaces, use many processors, and require fast execution times should continue to use commercial supercomputers designed for this purpose.
The architecture of a personal supercomputer is very simple and is illustrated in Figure 2. It consists of n processors or nodes and their peripherals (where n > 2 for a parallel system), an interconnect network fabric to allow each node to communicate with all others, a

suitable operating system and message passing software. In effect, this is identical to the architecture of a conventional shared - nothing (in the sense that the processors or nodes do not share memory or peripherals) parallel computer where the message passing software is used to exchange (share) data and other information among the processors over the interconnect fabric using the explicit message passing programming paradigm.
There are many commercial systems using this architecture with proprietary processors, interconnects and software. To reduce the cost of such proprietary systems, less costly commodity hardware and software must be used. The system that was
built and evaluated consisted of the following:
It is desirable to attach an additional processor to the cluster to serve as a control, software development, and compile processor. In general, this processor should

have additional disk for storing large data sets and needed software tools as well as a larger memory to compile application programs.
3. Cost and Performance Issues
For a given parallel application code that is not memory bound, the performance of the shared-nothing parallel architecture described depends primarily upon the following:
Obviously, better overall system performance can be obtained by using the fastest processors available, such as the proprietary DEC Alpha, Sun Sparc or similar machines. However, for an n node system, that increases the cost since these more powerful nodes may cost an order of magnitude, or even more, than a 300 MHz personal computer that may be obtained for under $1,000 today. Similarly, proprietary network interconnects (such as Myrinet for example) with proprietary software drivers can greatly increase the effective communication bandwidth, but usually at much greater cost. In short, using higher performance proprietary processors and interconnects results in a cost-performance tradeoff. Proprietary components not manufactured in volume can also result in significantly higher initial and recurring maintenance costs, although they can result in significantly better performance for certain types of applications.
For parallel programs, it is important to understand the factors that affect their performance on a given system. Perhaps the most important ones are the degree of parallelization (the ratio of parallel to serial code), and the ratio of computation to communication among the nodes, both of which should be as large as possible. Figure 3 shows typical communication patterns for a parallel program running on four nodes. The greatest speedup is obtained when all 
processors compute all the time. However, when data or other information needs to be exchanged between the nodes, processing is interrupted and needed messages are exchanged using explicit message passing calls (e.g., MPI sends and receives) as shown. The total time for which processing is interrupted must be minimized, and it is, therefore, important to minimize both the frequency and the length of time taken to pass the messages. The latter depends on the effective communication bandwidth, while the former is a function of the parallel application and the algorithm used to implement it.
4. Performance Evaluation
To evaluate the performance of both single and parallel processor systems, a number of standard benchmarks were used. They included:
The PARKBENCH suite consists of programs that test clock accuracy, vector operations, memory bottlenecks, system interconnect performance, and parallel performance using a number of kernels commonly used in computational fluid dynamics, computational chemistry as well as scientific modeling and simulation. Specifically:
The above benchmarks were used to evaluate the performance of personal computers and workstations with various commodity interconnects and of commercial parallel scalable machines with high performance proprietary interconnects. The personal computers evaluated were conventional 233 and 266 MHz systems with 512 Kb of cache and 128 MB of memory. The commodity interconnects included 10/100 Mbps ethernet switches and OC-3 (155 Mbps) ATM switches, while the proprietary interconnect was the high speed switch used on IBM SP scalable parallel systems. Furthermore, the SP systems evaluated included four types of different nodes, including:
5. Results and Observations
The results of the single processor benchmarks are shown in Table 1. From these figures, the following general observations can be made:
Figure 4 shows the effective bandwidth, in MBytes/sec, as a function of message size for a variety of interconnects and protocols. Here, the effective bandwidth is assumed to be equal to m/tc where m is the message size and tc is the total time to transmit a message of length m from one processor to another. The following observations can be made:
In general, results from the COMMS 2 and 3 benchmarks support the above observations. The switch saturation benchmark (COMMS3) for example showed that with 100 Mbps ethernet, each processor could transmit at 6.01 MBytes/sec, while the IBM proprietary switch supported a transmission rate of 71.16 MBytes/sec per processor.
|
Benchmark |
Processor Type |
|||||
|
IBM Power2 |
IBM P2SC |
PC |
||||
|
66 MHz Thin |
66 MHz Wide |
120 MHz |
160 MHz |
233 MHz |
266 MHz |
|
|
Max. Th. Mflops |
266 |
266 |
480 |
640 |
233 |
266 |
|
Specint base 95 |
- |
3.19 |
- |
- |
5.24 |
5.92 |
|
Specfp base 95 |
- |
8.51 |
- |
- |
7.04 |
- |
|
100 x 100 (Mflops) |
- |
60.81 |
61.33 |
98.77 |
19.02 |
23.52 |
|
Poly 1 (Mflops) |
- |
89.05 |
280.33 |
385.26 |
76.76 |
120.21 |
|
Poly 2 (Mflops) |
- |
81.03 |
338.92 |
855.57 |
77.69 |
132.35 |
|
Rinfl (secs) |
- |
12.5 |
7.47 |
4.01 |
59.27 |
45.24 |
|
PTSTWM (secs) |
- |
4.01 |
3.37 |
2.03 |
8.81 |
9.61 |
|
BT (Mflops) |
- |
53.47 |
76.05 |
90.32 |
16.04 |
20.83 |
|
LU (Mflops) |
- |
54.96 |
70.62 |
99.92 |
22.96 |
19.71 |
|
MG (Mflops) |
- |
44.52 |
50.55 |
65.53 |
9.5 |
10.70 |
|
SP (Mflops) |
- |
41.27 |
55.01 |
64.1 |
11.16 |
15.27 |

The results of the parallel benchmarks for 4 processors are shown in Table 2. The interconnect fabric was the 100 Mbps ethernet switch for the personal computers and the proprietary high performance switch for the IBM SP system. Comparing these results with those shown in Table 1, the following observations can be made:
ratio is highest (e.g., LU benchmark) and worse when that ratio is lowest (PTSTWM benchmark).
It is anticipated that Gigabit (1000 Mbps) ethernet interface cards and switches will become commodity ports soon, which together with special low latency drivers, should greatly improve the parallel performance of the personal computer based system. If performance is important for pc based supercomputers, a proprietary interconnect such as Myrinet could be used. Myrinet has a relatively low latency (about 14 microseconds) and a sustained bandwidth in the 70-90 MBps range. However, it requires special drivers, it does not scale easily beyond 32 processors and it doubles the cost of pc based supercomputers.
These results also show that much work remains to be done to better understand how the combination of interprocessor communication characteristics and interconnect network performance affect the scalability (in terms of speedup, scaleup and CPU efficiency) of application codes running on parallel machines using commodity parts. Specifically, modest investment in faster interconnects such as Myrinet, Gigabit ethernet, or Fiber Channel
|
Benchmark |
Processor Type |
|||||
|
IBM Power2 |
IBM P2SC |
PC |
||||
|
66 MHz Thin |
66 MHz Wide |
120 MHz |
160 MHz |
233 MHz |
266 MHz |
|
|
BT (Mflops) |
- |
163.77 |
211.15 |
287.7 |
35.4 |
47.65 |
|
LU (Mflops) |
- |
207.04 |
262.65 |
429.18 |
84.59 |
86.45 |
|
MG (Mflops) |
- |
44.52 |
81.97 |
127.19 |
14.0 |
16.8 |
|
SP (Mflops) |
- |
125.79 |
181.23 |
229.84 |
42.38 |
51.87 |
|
PTSTWM (secs) |
2.3 |
2.06 |
1.73 |
0.94 |
16.5 |
9.24 |
interconnects may result in much improved cost/performance ratios for these "new" commodity systems and may make these personal supercomputers useful for a much broader set of applications than is currently possible.
Recent hardware and software developments have made it possible for individuals and small organizations to easily build simple parallel computers using personal computers and other commodity parts. While such machines have only a fraction of the performance of their commercial counterparts for most applications, that performance is achieved at an even smaller fraction of the cost, resulting in an overall cost/performance improvement. Such machines can be built by reasonably computer literate users. They are reliable, relatively easy to use and support, and have relatively low recurring hardware and software maintenance costs. While they are not as scalable as their commercial counterparts, they make education and research in parallel computing affordable for small educational institutions, individual departments and even for individuals.
Improvements in interconnect performance and a deeper understanding of how that affects the performance of parallel application codes are needed to make these systems cost effective for a broader set of applications.
References