Bigcommerce engineers have been very pro-active in working with our storage provider, IBM Softlayer, in finding solutions. Unfortunately, it takes two parties to come to a solution. In this case, IBM Softlayer intentionally let their Object Storage cluster fall into disrepair and chose not to scale it. This has impacted Bigcommerce, IBM and many other Softlayer customers.
Our engineers placed too much trust in IBM Softlayer and that's on us. However, the catastrophic failures to see metrics and rapidly scale capacity, the decisions to let hard drives sit at 90% utilization for weeks and months, the cascading failures of an undersized cluster of 52 nodes for the busiest data center in their business speaks to IBM Softlayer’s lack of concern for their customers. We found this out 3 days ago.
We should have pressed more and held them to the fire; for that, we are sorry. I'm the head of Technical Operations and this pains me, because of the high uptime and reliability our engineering teams have built in the past year. Unfortunately, our trust in IBM Softlayer was misplaced. They have failed at every level of an operations team, they have failed as a business unit, they have failed in caring about how their customers are affected.
We care deeply about our customers and have been trying to work around Softlayer’s bad decisions. We are at the point where we feel we needed to say, "This isn't us; this isn't how we think about high availability."
We are already planning and working toward how we will move off Softlayer Object Storage and better plan for a single vendor failing, no matter how well regarded it is. Our best information at this time is that no data loss has occurred, and we are working around the clock to ensure that remains the case.
I take this personally. I've crafted highly available solutions at Apple, Digg, Eventbrite and now, Bigcommerce. When the site isn't performing, I get angry. And, like the Hulk, no one wants to see me angry. I am fighting for all of you and will continue to do so.
—Scott Baker, Head of Operations and Site Reliability
Posted Apr 03, 2015 - 21:51 CDT