One of the parts of my day job is dealing with and managing our HPC cluster. This is an 8 node Rocks cluster that was installed maybe a week after I started. Now I was a bit green still at that point and failed to get a better grasp on some things at the time, like how to maintain and upgrade the thing, and I have recently been paying for that
Apparently, the install we have doesn’t have a clear-cut way to do errata and bug fixes. It was an early version of the cluster software. Well, after some heated discussions with our Dell rep about this, I decided what I really needed to do was a bit of research to see what the deal really was and if I could get us upgraded to something a bit better and more current.
Along came my June 2009 issue of The Linux Journal which just happened to have a GREAT article in it about installing your very own Rocks Cluster (YAY!). Well, I hung on to that issue with the full intention of setting up a development/testing cluster when I had the chance. And that chance came just the other day.
Some of you probably don’t have a copy of the article, and I needed to do some things a bit different anyhow, so I am going to try and summarize here what I did to get my new dev cluster going.
Now what I needed is probably a little different that what most people will, so you will have to adjust things accordingly and I’ll try and mention the differences as I go along where I can. First off, I needed to run the cluster on RedHat proper and not CentOS, which is much easier to get going. I also am running my entire dev cluster virtually on an ESX box and most of you would be doing this with physical hardware.
To start things off I headed over to The Rocks CLuster website where I went to the download section and then to the page for Rocks 5.2 (Chimichanga) for Linux. At this point, those of you who do not need specifically RedHat should pick the appropriate version of the Jumbo DVD (either 32 or 64 bit). What I did was to grab the iso’s for the Kernel and Core Rolls. Those 2 cd images plus my dvd image for RHEL 5.4 are the equivalent to your one Jumbo DVD iso on the website that uses CentOS as the default Linux install.
Now at this point, you can follow the installation docs there (which are maybe *slightly* outdated(?), or just follow here as the install is pretty simple really. You will need a head node and one or more cluster nodes for your cluster. Your head node should have 2 interfaces and each cluster node 1 network interface. The idea here is that your head node will be the only node of your cluster that is directly accessible on your local area network and that head node will communicate on a separate private network with the cluster nodes. With 2 interfaces, plug your eth0 interface on all nodes, head and cluster into a separate switch and plug eth1 of your head node into your LAN. Turn on your head node and boot it up from the Jumbo DVD, or in the case of the RHEL people, from the Kernel cd.
The Rocks installer is really quite simple. Enter “build” at the welcome screen. Soon you will be at the configuration screen. There you will choose the “CD/DVD Based Rolls” selection where you can pick from your rolls and such. I chose everything except the Sun specific stuff (descriptions on which Rolls do what are in the download section). Since I was using RHEL instead of CentOS on the jumbo dvd, I had to push that “CD/DVD” button once per cd/dvd and select what I needed from each one.
Once the selections were made it asks you for information about the cluster. Only the FQDN and Cluster name are really necessary. After that you are given the chance to configure your public (lan) and private network settings, your root password, time zone and disk partitioning. My best advice here would be to go with default where possible although I did change my private network address settings and they worked perfectly. Letting the partitioner handle your disk partitioning is probably best too.
A quick note about disk space: If you are going to have a lot of disk space anywhere, it’s best on the head node as that space will be put in a partition that will be shared between compute nodes. Also, each node should have at least 30gb of hdd space to get the install done correctly. I tried with 16gb on one compute node and the install failed!
After all that (which really is not much at all), you just sit back and wait for your install to complete. After completion the install docs tell you to wait a few minutes for all the post install configs (behind the scenes I guess) to finish up before logging in.
Once you are at that point and logged into your head node, it is absolutely trivial to get a compute node running. First, from the command line on your head node, run “insert-ethers” and select “Compute”. Then, power on your compute node (do one at a time) and make sure it’s set to network boot (PXE). You will see the mac address and compute node name pop up on your insert-ethers screen and shortly thereafter your node will install itself from the head node, reboot and you’ll be rockin’ and rollin’!
Once your nodes are going, you can get to that shared drive space on /state/partition1. You can run commands on the hosts by doing “rocks run host uptime”, which would give you an uptime on all the hosts in the cluster. “rocks help” will help you out with more commands. You can ssh into any one of the nodes by simply doing “ssh compute-0-1″ or whichever node you want.
Now the only problem I have encountered so far is I had an issue with a compute node that didn’t want to install correctly (probably because I was impatient). I tried reinstalling it and it and somehow got a new nodename from insert-ethers. In order to delete my bad info in the node database that insert-ethers maintains I needed to do a “rocks remove host compute-0-1″ and then “rocks sync config” before I was able to make a new compute-0-1 node.
So now you and I have a functional cluster. What do you do with it? Well, you can do anything on there that requires the horsepower of multiple computers. Some things come to mind like graphics rendering and there are programs and instructions on the web on how to do those. I ran folding at home on mine. With a simple shell script I was able to setup and start folding at home on all my nodes. You could probably do most anything the same way. If any of you find something fantastic you like to run on your cluster, be sure to pass it along and let us know!