Posted by Jan van den Ende on Friday November 28 2003 @ 08:55AM EST
If cluster uptimes are going to be boasted about, here is our story.
We are running several major applications for the Regio Politie Amsterdam-Amstelland (compare Greater Amsterdam Police). Call room, Criminal Investigation, Recognation Service etc. are really in demand 7*24!
April 13 1997 (sunday morning) after a lot of planning and preparing to allow for offline-police services, there was a 'big bang' total network upgrade, which interrupted the online services for several hours. We (the VMS group) took the opportunity to go down and do various changes that are so much more cumbersome to perform "rolling". At that time the cluster consisted of 2 Alpha 2100's and a VaxStation 4000-90, running VMS 6.2-1H3 (some sopporting programs were available only on VAX, they were to be replaced by different ways to do them). In june 1997 a third 2100 was added. In march 1999 we did a rolling upgrade from V6.2 to 7.1-1H2. A major change came in may 1999 when a second location (7 KM away) was activated. An FDDI ring was established. A test system (Alpha 1200) was configured into the cluster, and moved to the other (the "dark") site.
Then 2 2100 were removed, tranported over, and added again. The 1200 left again. The hardest part was explaining to management that we didn't go down for the move.
Septeber 2000 an ES40 was added. Februari 2001 VMS went from 7.1-1H2 to 7.2-1; the VaxStation was no longer needed and left. In may the "intestines" of the 2100's were moved into 2100A's to satisfy more PCI need. December 2002/january 2003 saw the upgrade to VMS 7.3-1, to prepare for the big change: the 2100A's were to be replaced and a SAN deployed. After adding 2 ES45's and an ES40, the data was moved from the HSZ40-connected SCSI disks to (HSG80-connected)SAN. (Over 850 concealed devices, moved at moments when a specific device was unused. Only 5 of those had to be forced by deliberately breaking availability during the SLA-specified "potential maintenance window", 0:30-1:00). Thereafter the 2100A's and the old disks were removed. So now we are running a cluster with uptime 2420 days, in which the oldest hardware (FDDI concentrators) is only 1650 days, and the oldest system's age is less half that of the cluster uptime.
Daily peak concurrent usage is some 600+ interactive users, 50+ batch jobs, 30+ network jobs, and 180+ detached processes ("other mode", mostly call-room service processes and the server-end of the radio-connected MobileDataTerminals in the policecars) Even more interesting, because the signify the need to not go down: weekly LOW usage is some 50+ interactive, 40+ batch, 20+ net, and 170+ detached.
The total environment is far from unchanging: about twice a month some or other application is upgraded (most support rolling upgrades), and there are about 100-150 mutations in personnel, and some 200-300 application (de-)autorizations per week.
All this is maintained by only 3 people: Frank Wagenaar (full time) Anton van Ruitenbeek (40%) and myself (full time).