OH &^%$, What Just Happened?!?

August 23, 2009 Michael Cruse Comments 0 Comment

The morning was just humming along, and we were working quite diligently on our list of maintenance activities. We had scheduled eight hours of system downtime and were progressing with only a few bumps in the plan. It actually looked like we were going to finish our entire maintenance list and get out of the office a little early. Everyone was moving along with focus, so we could get back to enjoying our weekend. Then the computer gods looked down on us mortals and laughed.

Within a few moments, the gentle rumbling of “damn, what just happened” started to echo among the IT staff. It did not take very long to track down the problem as the core network switch was in an error state. This is a bit problematic as most of our network traffic flows through our core HP Procurve 5308xl switches. We are small shop with a relatively simple infrastructure. These HP Procurve switches have done exceedingly well for us over the years and have been absolutely bulletproof. That is, until now.

We had just completed a firmware upgrade on one of these devices with no problems, but now it was unresponsive. The console port told us to reapply the firmware or replace the chassis. This is not the type of device that you just go to Fry’s and pick up. This is the ‘Oh Crap’ moment that IT managers and staff dread.

We were fortunate, after an hour and half we had the firmware updated and the switch resumed normal operation. It was a tense hour and half while we waited for the firmware to load and then did our best to make the switch fail.

While we were waiting, I reviewed our networking fail over plan. I wrote this plan a while ago and thought that it was still accurate. Fortunately, the plan was still spot on with what should be done if the switch had failed the second update. Unfortunately, the assumptions of the plan were not so accurate. They referenced standby equipment that had been disposed of or decommissioned due to age.After a bit of scrambling, I had identified equipment that could be re-purposed, counted needed ports, identified the ports that needed certain classes of service, and felt comfortable we could be fully operational in a few hours. I cannot express my delight when we realized we would not have to implement our fail over plan.

It is important for any IT group to document their fail over plans and regularly review them to make sure that they can actually be implemented. One of the challenges in working in a small environment is you tend to feel like you are “all knowing and all seeing.” Unfortunately, there are too many moving parts, even in a small network, to remember everything. This is why fail over plans must be documented and reviewed at least semiannual. My fail over plan had not been reviewed in the last year, and that was my failure. Let me tell you that Sunday was a fun day as I dug out much of our documentation and started reading and reading.

Here To Ask Why

A few pointless views on whatever is running through my mind

OH &^%$, What Just Happened?!?

August 23, 2009 Michael Cruse Comments 0 Comment

Related

Leave a ReplyCancel reply