Greg DeKoenigsberg Speaks

Bulletproofing the cloud

Posted in Uncategorized by Greg DeKoenigsberg on October 8, 2012

Here’s the thing about running your own cloud infrastructure: once you make the decision to rely on it, then it had better work.  The whole thing.  Every part of it.  Under heavy load.  All the time.

Obvious, right?  But it bears repeating.  When you decide to make the move to doing things The Cloud Way, you are placing a gigantic bet on your infrastructure layer — and that bet is placed not only on the Cloud As A Whole, but on every individual component that comprises that cloud.  In the open source world, these are frequently components that you didn’t write and do not control.  I can assure you that customers don’t care in the least.

At Eucalyptus, we have smart and demanding customers, with extremely high expectations.  They are not content with assurances that things will be production-ready at some magical release point in the future. They don’t care whether the bugs are in the cloud controller code, or the node controller code, or in libvirt, or in the kernel.  They are using Eucalyptus at extreme scale, right now, to solve extreme business problems, right now.  Which means that when their cloud breaks, they expect fixes right now — and if that means libvirt patches or kernel patches, that’s what it means.  That’s why they give us all that nice money.  That’s why customers pay us for free software.

Our customers try to squeeze every ounce of performance out of their machines; that’s part of the point of having a cloud, after all. And when the virtualization technologies we depend upon experience heavy load over a long period of time, we see some crazy things.  Like segfaults in libvirtd, for instance.  Or libvirt handlers that suddenly and inexplicably lose their mind.  Or other weird occurrences that might lead one to believe that libvirt isn’t quite as thread safe as advertised.  These failures may only occur at times of very high load, and they may not happen often — but they do happen.  And when they happen, we have to handle them.  The 3.1.2 release is the result of many hours of hard work by our engineers to find and fix these issues.

It’s a challenge and a privilege to serve customers like this.  At times it can put incredible stress on the entire organization — and it’s at precisely these times when we are at our very best.  Watching great engineers solve critical problems under pressure is a lot like watching great athletes at the end of a big game — and when they win, it’s just as exhilarating.  These engineers are at the heart of what we do. Compared to them, I’m just selling tickets and fetching Gatorade.

It’s not that hard to put together a bunch of components and call it a cloud.  But making a cloud bulletproof?  That’s hard.  And that, friends, is where we are the best in the world.

About these ads
Tagged with: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 34 other followers

%d bloggers like this: