On 3/31/2014, I experienced one of the largest IT outages of my career. I remember the specific date because it was just before my birthday and I had plans to hang out with my girlfriend (now wife). Shortly before I got ready to leave for the day, I received a phone call from our IT Operations department. One of our store-facing systems was offline.
I’ve worked in IT for a long time and have yet to see any situations as urgent an outage that impacts a store. This felt like it would be an all-hands on deck situation. There weren’t many people left in the office at that time. I yelled to my friend Kurt to see if he had seen any alerts or was aware of the issue.
Kurt and I started to dig through system logs. We were trying to identify a root cause of the problem. We quickly found out that critical software was uninstalled from a server. The log message included a technicians name. I gave the tech a call on their cell phone.
The technician ran over to my desk. They were testing a future upgrade. The tech wasn’t aware that the testing would impact our production system. They were obviously very upset.
Kurt and I began troubleshooting how to resolve the issue. We quickly realized that the best course of action was to complete the upgrade. We started to contact other impacted teams so that we could share our plan. The team all agreed that this was the best way to proceed.
The technician that inadvertently caused the problem was still fretting about the issue itself. Kurt and I took a few moments to remind the technician to keep perspective about the situation. Although the issue was store facing, it was relatively small in scope and a manual process had already been implemented at the stores as a stop gap. The world would keep spinning.
Was this a big deal? Absolutely. However, Kurt and I both realized that we shouldn’t spend too much time worrying about the impact of the issue. We needed to keep perspective and focus on the problem.
We eventually fixed the issue by proceeding with the upgrade. I could see the relief on the tech’s face after we received confirmation that the testing was successful. We knew that we would need to make some changes to ensure this would not happen again. Until then, we knew that the world would keep spinning.