Cluster Computing

Gerik Zayatz

zayatz2@tcnj.edu

Submitted: December 13, 2001

Research Advisor: Dr. Deborah Knox

Abstract

Cluster computing is one of the most interesting innovations in the field of parallel computing in the recent past. Due to record low prices on hardware and the availability of free, open source software, massive amounts of computing power are available to the general population. This availability allows for the creation of clusters of computers that can serve the function of one large supercomputer.

My research focused on the creation and use of a small cluster of six computers for an undergraduate research seminar. Issues addressed included software choice, installation, configuration, and networking. In addition to the creation of the cluster, a free implementation of the MPI standard, MPICH, was studied and used for parallel programming. Furthermore, visualization libraries and tools were examined and used for the study of the programs written and examined.

  1. Introduction
  2. A cluster is defined as being a parallel system that is made up of a number of complete computers that are used as if they were a single computer. A complete computer is defined as a computer containing all the necessary hardware and software to run independently. Clustering has flourished in the current world due to the decreasing cost of computers and the ease of creating networks to connect them. One of the most famous clusters is the Beowulf cluster created in 1996 by NASA that was able to achieve low end supercomputer power with 16 Pentium Pro based machines.

    The basic unit of the cluster is a computer. Clusters use complete off the shelf units instead of expensive specially designed computers made for parallel computing. Due to this, clusters can typically be built for far less money, as no special architecture needs to be used.

    The second item required for a cluster is the network. The network is used for all intracluster communication such as message passing. Clusters need the fastest network possible since the sending of data and messages between computers in the cluster is the main reason for slowdown in a parallel program. This is due to the fact that compared to transfer speeds internal to the computer network technology is extremely slow. Even a delay of a tenth of a second can effect the performance of a program immensely.

    The final required item for a cluster is a message-passing library. Message-Passing libraries allow for the creation of parallel programs on a cluster by providing functions used to send messages and data between cluster components. The two main message-passing libraries are PVM and MPI. PVM has seen a large decrease in popularity in recent years due to the creation of the MPI standards. MPI is an open standard and has a few different implementations by various groups and is the library most used today.

  3. Cluster Setup
  4. Cluster setup was an area in which my research partner and I spent a good deal of our time on. It involves hardware setup, operating system installation and configuration, network design and implementation, the setup of software packages useful to a cluster such as NFS and RSH, and the installation of the MPICH programming libraries.

    1. Hardware
    2. Our cluster consists of 6 brand new identical Dell Optiplex machines acquired for us by our Research Advisor. The machines have Pentium III 1 GHz processors and 256 megabytes of ram. They each have 40-gigabyte hard drives and 10/100 Mbit Ethernet controllers built into the motherboard.

    3. Operating System
    4. The operating system we chose to use was Linux, specifically the RedHat 6.2 distribution of Linux. The reason we chose Linux was due to the fact that it is a free, extremely stable OS with a large history in cluster computing. In addition, Linux is easier to setup and maintain for a beginner user compared to FreeBSD, another free Unix based operating system that runs on Intel hardware.

      The reason we chose to use RedHat 6.2 was that we had CDs with the operating system on it readily available for us to use for installation. Due to the fact that we received our hardware a few weeks after the beginning of the semester, we felt it better to go with what we had instead of evaluating various different distributions. In addition, a previous research project on clusters at our school the year previous used RedHat 6.2 and didn’t have any major issues with it.

    5. Network Setup
    6. The network our cluster uses runs at 10 Mbits per second, and is an Ethernet network. The reason we chose Ethernet as our physical network is because it is cheap and easy to use, our computers had Ethernet network adapters built in, and our campus has an Ethernet network also. Each computer is joined together by a 10/100 Mbit 3Com switch we acquired for free. We used a switch because it allows two computers to communicate at the full rate of the network even if other computers are talking at the same time. This is in contrast to a hub where network performance degrades as more and more computers try to use the network at the same time. A picture of our network is shown as Figure 1.

      The optimal setup for our network would be to run our network at 100 Mbits/s but other factors caused us to use it at the slower speed of 10 Mbits/s. Due to the fact that we wanted our machines to have access to the campus network and also the Internet, we hooked them all up to the campus network that runs at 10 Mbits/s in our building. This causes the entire interconnection between the cards to run at 10 Mbits/s instead of the optimal 100 Mbits/s. After coming across issues with slowdowns in our programs, we are considering finding a way in which to run our cluster at 100 Mbits/s, but also have an Internet connection. This is generally accomplished by using a gateway computer that connects to the cluster network and also the outside network. We avoided this because it would involve opening up our computers and voiding the warranty.

    7. NFS Setup
    8. NFS stands for Network File System, and it allows directories and hard drives to be shared across a network and used by multiple other machines. NFS is not required of a cluster, but is mainly included as a way to increase the ease of use and transparency of the cluster. The reason we installed NFS on our cluster was to eliminate the need for users to maintain multiple copies of their programs across our 6 different machines. Instead users are able to access the same home directory from all of the machines on our cluster and only need to maintain, compile, and run one copy of the program. NFS requires an NFS server to be setup that shares a directory or hard drive, and is accessed by clients.

      The version of NFS that came with our distribution of Linux was NFS 2. We decided to use NFS 2 instead of 3 because it is generally regarded as more stable due to the fact that it is older, and also because of time constraints.

      NFS server setup requires the NFS package to be installed on a computer. The next step is to add the startup script to the proper directory in the "/etc/rc.d" directory. This makes sure that NFS is started on boot so that a system administrator does not need to manually start the NFS server. The only file needed to setup the directories that NFS will share with other machines is the "/etc/exports" file. Each directory to be shared is inserted onto a new line and is followed on the same line by the hostnames of the machines that are able to access the directory and what permissions they have. An example exports file line is as follows:

      /home knox-cluster-10(rw) knox-cluster-11(rw)

      This line tells the NFS daemon to share the /home directory with the machines knox-cluster-10 and knox-cluster-11, and to give them read and write privileges in that directory. In order to share this directory with other machines, they only need to be added on the same line.

      NFS client setup is extremely simple and only requires the editing of one line in the "/etc/fstab" file. The fstab file is the filesystem table and contains information on which disks to mount at startup. In order for a machine to access a shared directory, a new line is put into the fstab file that lists the hostname of the NFS server followed by a colon and the directory name that is shared, then the mount point, filesystem type which is NFS, permissions, and the dump and fsck values. A sample line in a fstab file is as follows:

      knox-cluster-13:/home /mnt/home nfs rwx 0 0

      This line tells the filesystem mounter to mount the home directory from knox-cluster-13 at /mnt/home on the local machine, that this directory is an NFS volume, and also that users should be able to read, write, and execute files on the shared directory.

    9. MPICH Setup

    MPICH was the implementation of MPI that we used for our cluster, and it provides C, C++, and Fortran libraries for parallel programming. MPICH is freely available and implements the MPI Standard version 1.2, and also supports some elements of version 2 of the MPI Standard. It is available for all Unix based operating systems such as FreeBSD and Linux, and there is also a version for Microsoft Windows 2000. MPICH also allows for clusters to be created from computers running different operating systems, which can be helpful in setup.

    MPICH setup was extremely easy and only required a few steps. The first step was to download MPICH from its’ website and to unpack the tar file into the directory we wanted it in. The second step was to run the configure command which configures all of the required scripts and header files for use by a user. The final step was to run a make command which built various tools and finished the setup.

    MPICH requires a communication program to allow it to execute programs on the machines in the cluster. The two available for use are RSH and SSH. RSH is simply a remote shell program that can setup a connection to another computer and execute programs without user input. SSH is like RSH but is a much more secure method of communicating. We setup our installation of MPICH with RSH because it is much easier to setup and involves less overhead when setting up a connection, such as cryptographic key checking.

    The only setup needed for RSH to run was to install the rpm package from our CD that the operating system came on and to edit the "hosts.equiv" file. The "hosts.equiv" file tells the machine what external machines can use RSH to connect without prompting for a password. This is required because MPICH can not enter the password for the user every time an RSH connection needs to be setup.

  5. MPI
  6. MPI stands for Message Passing Interface, and is a standard used in the creation of libraries to assist in parallel programming on clusters. The version of the MPI standard that MPICH implements is version 1.2 as stated previously. MPICH provides implementations for all the MPI constants, datatypes, and functions and is available for use in C, C++, and Fortran.

    MPI provides some valuable functions to let processes being executed know about themselves and the "world" in which they are running. Processes are able to get the hostname of the machine they are on, and their rank in the world. Ranks are used in every message sent, and are extremely useful. The "world" is also an extremely important part of any MPI program. It is a constant known as MPI_COMM_WORLD and contains all the processes in the cluster that are executing the program containing the MPI code. It is also used in every MPI function call that sends messages.

    1. Datatypes
    2. MPICH provides wrappers for all the datatypes that can be sent by the message passing functions. All normal datatypes built into C, C++, and Fortran are supported such as char, int, long, float, double, short, and also supports unsigned versions of some of the datatypes. The reason MPI and MPICH implement these wrappers is to ensure date integrity when data is sent between machines running two different operating systems. For example, ints can be stored with different precision and format in Linux and Windows 2000. This becomes an issue if a cluster is created from different types of machines and data is sent between them. The wrappers ensure that data is sent properly and understood at both ends.

    3. Important MPI Functions
    4. There is a wide range of functions available in MPI and MPICH for message passing and also file handling. The most important and basic of these are Send, Receive, Broadcast, Gather, Reduce, and Barrier. These functions all have multiple forms such as non-blocking versions and are at the heart of any parallel program programmed using the MPICH libraries.

      1. Send and Receive
      2. Send and receive are the two most basic functions in MPI. They are available in blocking and non-blocking forms, and provide safe message passing between computers in a cluster. Message passing in MPI is safe because the functions for send and receive include a tag value that ensure that a message sent with a certain tag can only be received by a receive function waiting to receive a message with the same tag number. If the tag was not included in the function calls, programs could receive the wrong data and cause various errors to occur. Sends and Receives require that a known number of elements be sent or else buffer under runs and over runs can occur and cause the program to exit.

        The values passed into send are the address of the send buffer, the number of elements to be sent, the MPI datatype of the elements to be sent, the id number of the receiving machine, the message tag, and the MPI world in which the send takes place.

        The values passed into a call to the receive function are the address of the receive buffer, the number of elements to be received, the datatype of the elements, the message tag, the source, and the MPI world.

      3. Broadcast
      4. Broadcast is one of the most useful functions in MPI as it allows a process to send out a buffer from itself to every other process in the cluster group. Broadcasts are used to populate lists created by a single process and also when a control message needs to be sent by a programmer to all the machines in a cluster. Broadcast eliminates the need for multiple sends by simply using one statement. A picture of data flow in a Broadcast is shown in Figure 2.

        The values passed into a broadcast function call are the address of the send buffer, the number of elements to be sent, the datatype of the elements, the rank of the root process (the sender), and the MPI world.

      5. Gather
      6. Gather is another important MPI function that essentially does the opposite of Broadcast. Gather is executed by all processes simultaneously and sends a set of values from all processes back to a root process. The basic Gather function requires that all machines send the exact same number of elements, but there is also a Gatherv function that can accept different numbers of elements from each sending machine. A picture of Gather is shown in Figure 3.

        The values passed into the normal Gather function are the address of the send buffer, the send count, the send datatype, the address of the receive buffer, the receive count, the receive datatype, the id of the root machine, and the MPI world. The receive data is only relevant at the root machines and is ignored by the sending machines.

      7. Reduce
      8. Reduce is a function that operates in nearly the same manner as the Gather function except that it takes the values sent by the machines and executes a reducing function on them and outputs the result from the function. MPI provides a variety of functions to operate on the data such as max, sum, product, min, the standard logical operators, and bitwise operators. An example of when this function is used is to take a number of values generated by the machines and reduce them to a single value that can be compared to a value known. An example program that calculated pi made use of the Reduce function to do this. A picture of Reduce is shown in Figure 4.

        The values passed into reduce are the address of the send buffer, the send count, the address of the receive buffer, the datatype of the elements, the MPI defined operation to be executed on the elements, the id of root, and the MPI world.

      9. Barrier
      10. Barriers are extremely useful for synchronizing the machines in a cluster running a program. Barriers work by blocking a calling process until all other processes in the communication world have executed the barrier. Barriers are typically used in order to make sure that all functions have arrived at a certain point so that output or messages can be sent in the proper order. A picture of a barrier in action is shown in Figure 5. The only value passed into the barrier function call is the MPI world the machines are in.

  7. MPE
  8. MPE stands for Multi-Processing Environment and is a visualization library for use with MPI and MPICH. It provides functions that can create log files that can be read and interpreted by viewers provided with the MPICH package. These log files provide an excellent trace of program execution and can aid greatly in optimizing code and seeing where bottlenecks occur.

    MPE function calls are embedded directly into program code and are placed around segments the programmer wants to know information about. When the MPE function is called, it writes timing information to the log file, and at the end a file is created.

    Log files are viewed using a variety of viewers provided with MPICH. The viewer we used most was upshot because we could not get Jumpshot to compile on our machines. Upshot provides a simple picture of each process’ execution based on the log file generated during execution. This is normally color-coded and the times are displayed at the bottom. A picture of an upshot screen is shown in Figure 5.

  9. Example Program
  10. One of the programs we wrote was a bucket sort program that operated on a file of 120,000 ints we generated randomly. The reason we chose bucket sort is because it is one of the easiest sorting algorithms to program in parallel, and also is conceptually easy for first time parallel programmers. Bucket sort works by scanning the list and placing the numbers into a series of buckets based on value range. All the processes executing do this in parallel on separate areas of the list. Once these small buckets are created by each process, each process gathers a different set of buckets based on range. The larger buckets created by this process are then sorted using a normal sorting algorithm suck as quicksort, merge sort, or heap sort. The overall worst case time complexity for this algorithm is O( n/p + n/p (log2 n/p) ) where n is the number of elements in the list, and p is the number of processes.

    The program we created when run actually took longer than a normal heap sort running on one processor. This was due to the number of gathers we had to use in order to gather up the buckets to the proper process. If we had used a faster network, and more numbers, my research partner and I believe we could have shown a gain over the normal heap sort algorithm. In fact, when we eliminated the time needed to transmit the values from the total time, we came up with a 7x speed up, which leads us to believe we can see a gain with more values.

  11. Issues Experienced
  12. Overall the research project proceeded smoothly and everything proceeded according to plan. Of course there were the requisite problems that my partner and I encountered during our research in the semester.

    The first problem encountered was during our setup of Linux and the network because Linux was incorrectly identifying our network cards. After searching various websites, the correct driver was found and we were able to get the cluster network up and running without problems.

    The second issue we came across was during the installation of MPICH. The problem was that MPICH refused to install in the directory we told it to during installation, but we worked around the problem by only allowing one user access to MPICH for the length of the research project.

    The third issue was the fact that MPICH has poor documentation and we struggled with what some functions did at the beginning of programming. Eventually we found the website with the MPI standard, and that has extremely detailed and helpful documentation.

    Our fourth issue was that whenever something went wrong during the execution of our programs, MPICH always echoed back connection refused errors. It was extremely difficult to figure out what went wrong because of the limited amount of information given.

  13. Future Plans
  14. My research into parallel programming and cluster computing is continuing into a second semester. My goals for next semester are to do more programming and little to none hardware setup, try to expand our cluster with some extra machines we have access to, and the implementation of various graph algorithms in parallel.

    In addition, I plan on expanding my knowledge of MPE and visualization in general. MPE provides methods to interact with an X server to create windows on a desktop, and this would be an interesting way to create visualization instead of log files. A

    Another area that I plan on looking into is real time visualization using these functions. While log files are interesting to see, it is more worthwhile in my opinion to be able to view the program visualized in real time during execution.

  15. Summary

Clusters have opened the world of parallel computing to the average user and small college. They are easy to setup, and give a taste of the power of a supercomputer to a programmer.

Overall I felt like I learned a large amount from doing my research on parallel programming. Learning to administrate and setup Linux was worthwhile, and the knowledge can be used in setting up and maintaining other Unix like operating systems. In addition learning MPI was extremely valuable. Parallel programming is deceptively different from normal programming for single processor systems. This is due in part to the fact that communication between processors must be kept to a minimum, and network speed must be taken into account during programming since it is much less than IO internal to the computers. Also, debugging a parallel program is much harder due to the fact that the program may be working right for all but one processor. I felt that the visualization tools and libraries were extremely useful, and helped me to understand the programs much better as well.

Bibliography and References

Wilkinson, Barry; Allen, C. Michael. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall: 1998.

NSF/TFCC Workshop on Teaching Cluster Computing by Barry Wilkinson:

http://www.coe.uncc.edu/~abw/CCworkshop2001/index.html

MPICH Homepage:

http://www-unix.mcs.anl.gov/mpi/mpich/

MPI Standard v1.2 and 2.0:

http://www.mpi-forum.org/docs/mpi-20-html/node306.htm