Bulletproofing the cloud

Here’s the thing about running your own cloud infrastructure: once you make the decision to rely on it, then it had better work.  The whole thing.  Every part of it.  Under heavy load.  All the time.

Obvious, right?  But it bears repeating.  When you decide to make the move to doing things The Cloud Way, you are placing a gigantic bet on your infrastructure layer — and that bet is placed not only on the Cloud As A Whole, but on every individual component that comprises that cloud.  In the open source world, these are frequently components that you didn’t write and do not control.  I can assure you that customers don’t care in the least.

At Eucalyptus, we have smart and demanding customers, with extremely high expectations.  They are not content with assurances that things will be production-ready at some magical release point in the future. They don’t care whether the bugs are in the cloud controller code, or the node controller code, or in libvirt, or in the kernel.  They are using Eucalyptus at extreme scale, right now, to solve extreme business problems, right now.  Which means that when their cloud breaks, they expect fixes right now — and if that means libvirt patches or kernel patches, that’s what it means.  That’s why they give us all that nice money.  That’s why customers pay us for free software.

Our customers try to squeeze every ounce of performance out of their machines; that’s part of the point of having a cloud, after all. And when the virtualization technologies we depend upon experience heavy load over a long period of time, we see some crazy things.  Like segfaults in libvirtd, for instance.  Or libvirt handlers that suddenly and inexplicably lose their mind.  Or other weird occurrences that might lead one to believe that libvirt isn’t quite as thread safe as advertised.  These failures may only occur at times of very high load, and they may not happen often — but they do happen.  And when they happen, we have to handle them.  The 3.1.2 release is the result of many hours of hard work by our engineers to find and fix these issues.

It’s a challenge and a privilege to serve customers like this.  At times it can put incredible stress on the entire organization — and it’s at precisely these times when we are at our very best.  Watching great engineers solve critical problems under pressure is a lot like watching great athletes at the end of a big game — and when they win, it’s just as exhilarating.  These engineers are at the heart of what we do. Compared to them, I’m just selling tickets and fetching Gatorade.

It’s not that hard to put together a bunch of components and call it a cloud.  But making a cloud bulletproof?  That’s hard.  And that, friends, is where we are the best in the world.

Bulletproofing the cloud