Earlier this week, Amazon Web Services had some internal issues that rippled across the Internet. I’m not talking about your Amazon shopping cart; I’m talking about Amazon’s hosting/cloud platform used by a whole array of internet-enabled businesses like Snap, Nest and Slack to serve their customers. Ironically, one of the companies effected was Down Detector which is where companies go to find out if there are site outages, unless of course Down Detector is, um, down.
Amazon AWS provided a detailed explanation on why the service outage occurred and why it took so long to fix. In a nutshell, their S3 billing system was running slowly and they wanted to take a few servers off-line to improve throughput. Due to a typo, more than the intended subsystems were taken off-line (index and placement subsystems). To add fuel to the fire, being taken offline required these systems to do a complete restart and according to Amazon “we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.” It’s not just a matter of flipping a switch, there is a data validation process on a massive amount of data that needed to happen before the systems were back online. Meanwhile, transaction requests were piling up waiting to be processed. So, what can print companies learn?
I’ve recently been working with several print organizations who are updating their infrastructures to support high-speed inkjet. Software is a key factor in that process and disaster recovery and business continuity is often a sticking point. Companies don’t want to pay $$$ for multiple production instances, but they do need fast cut-over. Some proposals for “on-demand” software availability have been woefully inadequate. They assume that software can be downloaded and configured in the case of an outage. Others allow the software to be pre-installed, but if it is never actually used in a DR test, there can be no confidence that communications, applications and software versions will be in synch when needed. And volume will be backing up while you figure it out.
In many other cases, I have seen reasonable attention paid to DR considerations at the time of purchase followed by months or years of neglect. A successful, growing business like Amazon is the most susceptible to this type of neglect. It is easy to miss the fact that you have outgrown your backup systems. Production inkjet is the source of the greatest growth in the print industry and attracts customers who demand quick turnaround. It’s critical that businesses size, scale and test their fail-over systems and processes to meet this demand.
Meanwhile, thank you to @DownDetect for providing some humor during a stressful interlude.