Ever wonder what's going on behind the scenes (besides some elevated heart rates) when SCLS is having problems?
Tuesday, around 1:30, our building lost power due to a transmission line outage. Luckily, our server room equipment and many of our PCs and phones are on backup power units called UPSes. Unfortunately, these battery backups can't power us forever.
So what happened?
- We updated the SCLS Status Wiki -- it's a super-quick way for us to get the information out there about what's going on. In this case where the network was also down, it wouldn't help from libraries' SCLS network PCs, but it still could be accessed from smart phones or other non-SCLS devices.
- We called our power provider to see if they had information.
- We divided up a phone list and everybody with access to a phone started calling libraries.
- We started shutting down the server room equipment in a controlled fashion. Doing this ensures that when the UPSes run out of juice the servers and other equipment don't come crashing to a halt in a very unfriendly, bad-for-recovery sort of way.
Because the server room equipment was running on backup, libraries didn't even realize we were having issues right away. Often as we were calling to let them know we had problems, they were telling us they "just went down."
After the power was restored (around 1:56), we started bringing the equipment back up. This is a tricky process, as there's a definite protocol to follow in order to make sure everything works. A very simplified version looks something like this--
- Start up device A. Wait for it to finish booting up. Confirm it is working correctly.
- Start up devices B and C (which depend on being able to talk to device A). Wait for them to finish booting up. Confirm they are working correctly.
- ...and so on, and so on until everything is back up and working.
As soon as everything was powered up and we had confirmation that all the services were working again, we called the libraries to let them know that we were officially back up, everything should be working, and they should call the Help Desk if they ran into any services that weren't working.
A power outage is a bigger-than-usual service disruption, but other types of service outages are similar. Usually some folks are working on trying to troubleshoot and address the problem, while others are doing their best to communicate the problem to the libraries.
Whew! So glad these sorts of things don't happen every day!
Feedback? Please let us know!
Comments
You can follow this conversation by subscribing to the comment feed for this post.