« balls of FURY | Main | Weird music video »

August 30, 2007

Storage woes

The CIS problem report is quoted below for posterity (and humor's) sake, but in brief:

Teague machine room tops 140F.

Clariion drive array fails, comes back up after a manual power cycle.

ACNC drive arrays do not visibly fail, but running scrub jobs on our ZFS pools reveals corruption and disk problems.

Then we discover we do not actually have enough spare drives on the shelf to replace all of those eaten by the heat.

And ZFS decides to hang on the command to add new hot spares to various pools.

It's been, seriously, ridiculous and a tad frightening. That's a lot of data to have to restore from backups, so the pressure is on to get these pools out of degraded states.

Around 5:45 this morning we lost power to the machine room and any systems that didn't have a backup UPS went down.

* As of 2007/08/29 09:36: The power and air-conditioning failure in the CIS Data Center on Sunday 8/23 was the result of multiple campus power outages and a failure in the software that monitors environmental conditions in the computer room.

At around 10 pm Saturday a feeder supplying electrical power to the building blew out and caused a dirty power failure. The UPS picked it up and switched to the backup generator. After about 12 minutes the power was restored and the UPS switched back to utility power.
This was only one of three major feeder failures that occurred Saturday night

The power surge generated by the feeder problem caused a fault in the chillers that supply utility chilled water to the Computing Services Center building. When that happened CIS was automatically switched over to backup chillers. Unfortunately, due to a malfunctioning valve, they shut down sometime early Sunday morning.

The building management system is monitored by the Operations Center staff 24 hours a day via the internet using a software product called APOGEE. This software is administered by the Physical Plant.s energy management group and the OC has access to it via the internet.
Network hubs were down in many locations all over campus as a result of the multiple power outages and the alarms that would normally have been generated by the APOGEE system were not available.

The software that is used to monitor the power and air-conditioning equipment in the Data Center is a product purchased from Liebert called SiteScan. It failed as well and as a result the OC staff did not receive any notification that the Data Center was overheating until nearly 5AM when one of the vendors automated call home features sent out a notification that it had reached a temperature in excess of 100 degrees.

Soon after that the mechanical room where the UPS is located overheated to the point that the UPS shut down and went into by-pass mode. An attempt to put the UPS back online resulted in a total black out of the whole facility.

There are some obvious improvements that need to be made to assure that the OC is not depending on the Internet to monitor the machine room environment and we have already taken steps to correct that.

Anyone needing more information about this outage may contact John Rauser at 979-845-8461. This is also the number to call to register a complaint or offer helpful suggestions.

Posted by jeff at August 30, 2007 05:26 PM

Comments

Post a comment




Remember Me?