Archive for the ‘cics’ Category
I love it when a plan (almost) comes together
Last weekend a number of colleagues and I did some fairly major upheaval of the uSpace setup, most of which should be fairly unnoticeable to our customers with the exception that it will now hopefully be a bit more reliable! There were two main four main components to this work:
- Upgrade the uSpace software from 4.0.14 to 4.5.5. This brings us up to date with the manufacturers current release, which gives us a number of benefits, including the ability to properly cluster the front end web servers (the servers that people actually connect to when you click on uSpace in MUSE or go to http://uspace.shef.ac.uk/).
- Enable clustering across a pair of web servers, rather than the single web server we had before. This means if we need to update or restart the webservers, we should be able to minimise the unavailability of the uSpace service.
- Move the uSpace software from physical to virtual servers running on VMware. This makes it easy for us to scale and manage the service.
- Move the uSpace binary content (eg, Word Documents) out of the database and into a filesystem. Again, this is a technical measure which makes it easier to manage the service.
The upgrade process had been thoroughly tested before it was started, but like the best laid schemes of mice and men, it went a bit wrong whilst we upgrading the live system. This required a slight alteration to the upgrade process which hadn't been planned. It all worked out OK in the end.
There were a couple of User Interface bugs which slipped through QA unnoticed and these were corrected yesterday afternoon and yesterday evening. There was also a problem with some uploaded documents not appearing correclty in uSpace. Unfortunately, these documents will have to be uploaded again, but I believe the numbers are very small.
What happened to our e-mail servers, or… when RAID arrays go bad
NB: I started writing this blog post back in September, a few days after the initial problems with the Staff and PGR e-mail service, so if it seems like I'm talking about "last week" when actually I'm referring to September, that's why!
NB: This blog posts deals with the technical work we did to get the e-mail service working again. It doesn't discuss the decision making process that we went through. A number of people were involved in planning the work.
I'm sure everyone knows we've had some problems with e-mail over the past few days. I thought I'd write a technical post of what happened and what we did to fix it.
The initial problem started on one of mail servers called "gazelle" at 21:02 on the 30th August. The first incident is recorded in the mail logs as:
2010-08-30 21:02:05 1OqAXR-0001j0-05 == <USER>@gazelle.shef.ac.uk R=cyrus_user T=cyrus_deliver defer (-44): LMTP error after RCPT TO:<USER@gazelle.shef.ac.uk>: 451 4.3.0 System I/O error
This error basically indicates a problem with the machine's ability to write the incoming e-mail out to it's hard disks. The recipient's name has been changed to protect the innocent, although it happened to be one of our helpdesk staff!
The server also keeps more "low-level" logs which details the specific problem:
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600c0ff000000000007d4d6366716900 (ssd8):
Aug 30 21:02:00 gazelle Error for Command: read(10) Error Level: Retryable
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Requested Block: 483039280 Error Block: 483039280
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number: 63667169-00
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600c0ff000000000007d4d6366716900 (ssd8):
Aug 30 21:02:00 gazelle Error for Command: read(10) Error Level: Fatal
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Requested Block: 483039280 Error Block: 483039280
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number: 63667169-00
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Sense Key: Hardware Error
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
The first error is indicated as "Retryable", which means what the server can try to talk to the storage again. The second is indicated as "Fatal", which means that the storage is now unavailable.
So, what happened?
gazelle has a RAID 5 disk array attached to it (if you're not sure what RAID is, the wikipedia RAID article has a good description). This RAID array is the main storage for all staff e-mail and it is split into 4 "Logical Drives" each with 4 "Volumes" on them, making 16 in total.
RAID 5 is designed to allow you to put together a group of hard disks, and if any one of them fails, you can keep going with degraded performance. However, if you lose two or more drives, all the data across those drives is useless. The technical details of this are explained in the wikipedia article about Parity.
In our disk array, each of our Logical Drives is constructed of 5 hard disks. At some point prior to 9:00pm on Monday night, one of the disks in the array failed. When this happens, the array picks a spare disk and reconstructs the data off the failed disk using the parity algorithm. It started doing this automatically. Unfortunately, whilst rebuilding the spare disk, the array marked another disk in the array as failed. This had the effect of taking that entire Logical Drive offline and marking it as dead.
This single failed Logical Drive meant that the 4 volumes on that Logical Drive were now unavailable. That is the equivalent of 25% of the staff e-mail. Normally we would notice such a failure immediatly through our automated monitoring, but because 75% of the e-mail was still working, the monitoring systems didn't notice.
When I arrived in the office, I already knew there must be a problem. My mobile had rung about 10 times on my way in to work and when I got in there were a number of the top brass in our office talking with my colleague.
The problem with the RAID array made itself apparent and it said that two disks had failed. However, after a colleague and I pulled some disks out, restarted the controller and reinserted some disks, it claimed the volume was back in a "CRITICAL" (but repairable) state. At this point we attempted to rebuild the critical volume onto a spare disk. This process takes quite a while, up to 10 hours on this array, so we let it go and decided to keep the mail server unavailable whilst it was repairing. We could have re-enabled access but given the fragile state the array was in we left the service unavailable. With hindsight, this turned out to be a prudent move as at about 3pm, the array failed in a similar way to the way it had previously.
Up until this point, we were confident that we would be able to recover all the data from the failed array, but now the array had failed again, we decided to start the tape recovery. Tape backups are a very useful thing to have in these circumstances, but they suffer from a couple of flaws:
1. They are slow. Tape restore takes a very long time.
2. Restoring from tape means that we were guaranteed to lose data. Initially this sounds like a strange thing to say, but it is because there is a gap between where the last backup was taken and when the array failed. In that time, many people will have moved e-mail into folders, deleted mail, received new mail. Restoring from a previous tape backup means we lose all of that data.
After a bit of scratching of heads, I thought of another way to start recovering data from the failed array. We started doing that by about 5pm on the Tuesday evening and left it going overnight. That process finished at about 4am the next morning and I had set my alarm clock so I could catch it and start a second rebuild of the array. I arrived in the office at about 7 and the process was still running, but as previously, it failed around 11am. By this time we had managed to restore a quarter of the failed volumes user accounts back onto some other storage, so we took the chance to bring the e-mail service back up for users where we had accounts. We also kicked off the restore of the remaining affected accounts from tape.
As a last attempt to recover e-mail from the array, I removed some disks and slotted them back in and restarted the disk controllers. This brought the array back to the critical state. In this state it is possible to access the data, but it is not a good way to run the e-mail service. We knew that there would be some e-mails missing in the copies that we had recovered from tape, but if we could recover the e-mails from the failed storage array then we could compare the two and redeliver the missing emails to those accounts.
The email server manager wrote a tool to work out which e-mails were missing whilst I worked on getting the data off the fragile array. In these circumstances, it is usually a good idea to use the simplest tool possible, so I chose one of the oldest and most reliable tools available - tar. With the array still in the critical state, we used tar to copy the files off the and onto our NetApp filers. Over the weekend, our e-mail administrator wrote a tool to redeliver mails that were missing from the gap between the tape backup and when the array failed and then the following week, we ran it to redeliver the mails.
In the end, I don't think we lost any e-mail, although some were missing from inboxes for a few days whilst we recovered them. It's probably worth mentioning that we are currently going through a process to migrate all our e-mail to Google, so we shouldn't have a problem like this again. We'll have different ones instead!
Lunch in Weston Park and getting down to some work!
Today I managed to get out for a walk with Chris and Dan for some lunch. We popped into the Weston Park Museum for something to eat and then went for a quick wander and took a couple of pictures. It's great to be back in Sheffield, especially when the weather is so good and I can get an hour out at lunch to have a walk about!
One of the first things I have been asked to do is to try and break into the machines. I am in the unusual position of having quite a bit of knowledge about the internal systems from my previous employment at the University but as yet I haven't been given access to much. My task has been to try and see what I can break into. We manage user access to most of our systems centrally but administrator access is managed using a much less formal approach, mainly because it is not a common event to add or remove systems administrators, so the effort involved in trying to formalise it isn't very worthwhile.
So far, the systems are holding up pretty well; I managed to log in to an old server related to the MUSE portal and I managed to gain administrator access to the iceberg HPC cluster without too much trouble. In both cases I managed to get in using old credentials from when I worked here 3 years ago. These don't present an immediate threat to our systems security but it does show an area where we should look to improve the way we manage access.
I've also been getting up to speed with the way our Solaris systems work now. When I left the University we used a fairly traditional approach to running systems. A server would be set up to run specific applications. Now we use features like Solaris Cluster and Solaris Zones to create virtualised servers that allow us consolidate many different applications onto fewer larger servers. This gives many benefits in terms of the way we manage our servers but also lowers the total number of physical machines we have. That has quite a few cost saving benefits for us, partly in terms of saving electricity but also lowering our total maintenance spend, which is a significant part of the total cost of running a server. Over time we are aiming to move more of our services onto Zones. One of the major project we in our team have is to move the servers that run the SAP software to a new cluster of 3 machines. They currently run on 13 power-hungry servers, which at some point we hope to be able to flog back to our supplier! Good for the planet and good for the University!







