Archive for the ‘Email’ Category
What happened to our e-mail servers, or… when RAID arrays go bad
NB: I started writing this blog post back in September, a few days after the initial problems with the Staff and PGR e-mail service, so if it seems like I'm talking about "last week" when actually I'm referring to September, that's why!
NB: This blog posts deals with the technical work we did to get the e-mail service working again. It doesn't discuss the decision making process that we went through. A number of people were involved in planning the work.
I'm sure everyone knows we've had some problems with e-mail over the past few days. I thought I'd write a technical post of what happened and what we did to fix it.
The initial problem started on one of mail servers called "gazelle" at 21:02 on the 30th August. The first incident is recorded in the mail logs as:
2010-08-30 21:02:05 1OqAXR-0001j0-05 == <USER>@gazelle.shef.ac.uk R=cyrus_user T=cyrus_deliver defer (-44): LMTP error after RCPT TO:<USER@gazelle.shef.ac.uk>: 451 4.3.0 System I/O error
This error basically indicates a problem with the machine's ability to write the incoming e-mail out to it's hard disks. The recipient's name has been changed to protect the innocent, although it happened to be one of our helpdesk staff!
The server also keeps more "low-level" logs which details the specific problem:
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600c0ff000000000007d4d6366716900 (ssd8):
Aug 30 21:02:00 gazelle Error for Command: read(10) Error Level: Retryable
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Requested Block: 483039280 Error Block: 483039280
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number: 63667169-00
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600c0ff000000000007d4d6366716900 (ssd8):
Aug 30 21:02:00 gazelle Error for Command: read(10) Error Level: Fatal
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Requested Block: 483039280 Error Block: 483039280
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number: 63667169-00
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] Sense Key: Hardware Error
Aug 30 21:02:00 gazelle scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
The first error is indicated as "Retryable", which means what the server can try to talk to the storage again. The second is indicated as "Fatal", which means that the storage is now unavailable.
So, what happened?
gazelle has a RAID 5 disk array attached to it (if you're not sure what RAID is, the wikipedia RAID article has a good description). This RAID array is the main storage for all staff e-mail and it is split into 4 "Logical Drives" each with 4 "Volumes" on them, making 16 in total.
RAID 5 is designed to allow you to put together a group of hard disks, and if any one of them fails, you can keep going with degraded performance. However, if you lose two or more drives, all the data across those drives is useless. The technical details of this are explained in the wikipedia article about Parity.
In our disk array, each of our Logical Drives is constructed of 5 hard disks. At some point prior to 9:00pm on Monday night, one of the disks in the array failed. When this happens, the array picks a spare disk and reconstructs the data off the failed disk using the parity algorithm. It started doing this automatically. Unfortunately, whilst rebuilding the spare disk, the array marked another disk in the array as failed. This had the effect of taking that entire Logical Drive offline and marking it as dead.
This single failed Logical Drive meant that the 4 volumes on that Logical Drive were now unavailable. That is the equivalent of 25% of the staff e-mail. Normally we would notice such a failure immediatly through our automated monitoring, but because 75% of the e-mail was still working, the monitoring systems didn't notice.
When I arrived in the office, I already knew there must be a problem. My mobile had rung about 10 times on my way in to work and when I got in there were a number of the top brass in our office talking with my colleague.
The problem with the RAID array made itself apparent and it said that two disks had failed. However, after a colleague and I pulled some disks out, restarted the controller and reinserted some disks, it claimed the volume was back in a "CRITICAL" (but repairable) state. At this point we attempted to rebuild the critical volume onto a spare disk. This process takes quite a while, up to 10 hours on this array, so we let it go and decided to keep the mail server unavailable whilst it was repairing. We could have re-enabled access but given the fragile state the array was in we left the service unavailable. With hindsight, this turned out to be a prudent move as at about 3pm, the array failed in a similar way to the way it had previously.
Up until this point, we were confident that we would be able to recover all the data from the failed array, but now the array had failed again, we decided to start the tape recovery. Tape backups are a very useful thing to have in these circumstances, but they suffer from a couple of flaws:
1. They are slow. Tape restore takes a very long time.
2. Restoring from tape means that we were guaranteed to lose data. Initially this sounds like a strange thing to say, but it is because there is a gap between where the last backup was taken and when the array failed. In that time, many people will have moved e-mail into folders, deleted mail, received new mail. Restoring from a previous tape backup means we lose all of that data.
After a bit of scratching of heads, I thought of another way to start recovering data from the failed array. We started doing that by about 5pm on the Tuesday evening and left it going overnight. That process finished at about 4am the next morning and I had set my alarm clock so I could catch it and start a second rebuild of the array. I arrived in the office at about 7 and the process was still running, but as previously, it failed around 11am. By this time we had managed to restore a quarter of the failed volumes user accounts back onto some other storage, so we took the chance to bring the e-mail service back up for users where we had accounts. We also kicked off the restore of the remaining affected accounts from tape.
As a last attempt to recover e-mail from the array, I removed some disks and slotted them back in and restarted the disk controllers. This brought the array back to the critical state. In this state it is possible to access the data, but it is not a good way to run the e-mail service. We knew that there would be some e-mails missing in the copies that we had recovered from tape, but if we could recover the e-mails from the failed storage array then we could compare the two and redeliver the missing emails to those accounts.
The email server manager wrote a tool to work out which e-mails were missing whilst I worked on getting the data off the fragile array. In these circumstances, it is usually a good idea to use the simplest tool possible, so I chose one of the oldest and most reliable tools available - tar. With the array still in the critical state, we used tar to copy the files off the and onto our NetApp filers. Over the weekend, our e-mail administrator wrote a tool to redeliver mails that were missing from the gap between the tape backup and when the array failed and then the following week, we ran it to redeliver the mails.
In the end, I don't think we lost any e-mail, although some were missing from inboxes for a few days whilst we recovered them. It's probably worth mentioning that we are currently going through a process to migrate all our e-mail to Google, so we shouldn't have a problem like this again. We'll have different ones instead!





