After years of trying to produce IT infrastructure capable of providing highly available applications, trying to provide redundancy, increase uptime to the impossible 100%; I came to a very sad realization. Downtime is impossible to avoid, despite the best effort of a lot of professional people like me. The reason is “Too many variables”, and often, the downtime is caused by the same tools that supposed to prevent it.
Linux-HA, Virtual IPs, Oracle RAC, DB2 HADR, OCFS2, etc; you name it. This technologies are very vulnerable, the simplest hiccup in the environment causes adverse effects. Now if you have been in the business for a long time, like me, and providing serious IT infrastructures, tell me that I am wrong and I don’t know what I am talking about.
So, now you ask, what’s the point? Well, my point is, instead of trying to avoid downtime, embrace it, as part of the infrastructure. At the design level, plan for unexpected downtime, and create an infrastructure that minimize and react fast when it happens. Virtualization technologies can help a lot here.
One more point, the more complicated the environment is, the longer the downtime is. Troubleshooting time is exponential to the complexity. This becomes a vicious cycle, you spend more money in tools preventing downtime, which integrates in this complex grids.