Double Downtime: Good and Bad
I made the glorious mistake of informing my coworkers that I was going to take the week off. My first vacation in over four years. I jinxed us. I’d apologize if I didn’t think the cosmos should say sorry first. You see, while I’ve certainly gone out of town with the family, I’ve always ended up working what most would consider a full week. This week was to be different. We were making plans to head down to Destin, FL, for some R&R. We ultimately cancelled those plans the day before departure because the weather there wasn’t going to be any different than home. The beach at 65 degrees with a strong breeze is not that fun. Fortuitously, we ended up with a stay-cation. While it wasn’t quite as fun, it turned out to be a blessing in disguise. Onspring’s DR procedures would be tested that week.
vs Bad Downtime
For the first time in over two years, Onspring had unplanned downtime in our primary Cloud infrastructure. Being down for even a minute sucks. While we were ultimately unavailable for around four hours in the early morning, a set of events over the preceding four days lead to this event. My breakdown of those events is detailed below.
But first, a preface to the ultimate issue: Onspring encrypts *all* network communications between systems. This relies on a centralized [not] redundant [enough] Certificate Authority for issue and validation of certificates.
The Rough Ride to Downtime
On Tuesday, we arose to the joy of numerous early morning alerts regarding the primary disk cluster for a system which held a tertiary Active Directory (AD) Domain Controller, our original Certificate Authority (CA) and a half-dozen test systems of various configuration. The primary concern was the AD and CA functionality. Within a few hours we had the machine in working order, and we proceeded to back up the critical functions that required backup (CA). We saw no production downtime during this process.
We formulated a plan to fail over our CA to a secondary system that was built on newer technology that allows for better and automated fail-over. Wednesday afternoon we had achieved the complete removal of all AD functionality from the primary system and had fully migrated all of the CA duties to the secondary system. We disabled the CA service on the original system to avoid conflicts and issues with reliability while we ensured full functionality returned.
Friday morning, I awoke later than normal and looked at the clock on my phone. But instead of the time, I saw the face of our support director, who was trying to reach me. Why in the world would she be calling me while I’m on vacation? My heart immediately sank. I knew it: downtime.
As it turns out—and is not heavily documented—we cannot simply move our CA without also utilizing the exact same hostname, and in some cases the IP address. What we experienced was a strange mix of most-things-work-but-some-things-don’t. Nearly all of our network was encrypting communication in proper order. For their communication, they paid no attention to CA validation if they had already done it in the past. Those systems assumed it was fine unless and until they had a specific revocation.
When connecting in “code” however, the cache of this data is only relevant to the lifetime of the running software. As our production web servers flush their memory and restart their application code every few days, in the normal course of operations, they would come back up without any ability to validate the certificates of our databases. And there’s the problem. Our dedication to a secure and redundant environment created the problem, because we were not aware of the super-strict rules that our code placed on finding the exact same hostname hosting the CA functionality.
Upon diagnosing the issue, we were able to restore functionality immediately, bringing our downtime to an end. We also validated that communications were fully encrypted.
Going forward, we also have new policies and procedures (plus a new employee) to help prevent future downtime. And bonus: I can reschedule my vacation for a time when the weather on the beach is quite a bit nicer. Good downtime is in my future.