Sites Slow due to storage system issues

Incident Report for BigCommerce

Resolved

Response times have been normal for one week. We are resolving this ticket. Projects are already in progress to add multiple layers of redundancy to this part of our platform. Stay tuned.

Posted Apr 14, 2015 - 12:39 CDT

Update

Good news! Response times from Softlayer's Object Storage cluster have improved to pre-outage levels. So, page response times are back to normal. There are still some files that aren't available yet and we are working with Softlayer to track them down. We are hoping to close this incident this weekend. Thanks for all your patience and shout out to http://surgicalcaps.com for being awesome!

Posted Apr 10, 2015 - 13:03 CDT

Update

The situation has been stable across the last 24 hours. Bigcommerce engineers are still watching this very closely, and will continue to post updates as the situation changes

Posted Apr 08, 2015 - 20:42 CDT

Update

SoftLayer is continuing the replication process on the object storage cluster. We've observed no negative impact but will continue to monitor system performance.

Posted Apr 08, 2015 - 00:59 CDT

Update

While storage object performance is better and errors have declined in frequency, there are still objects that are unavailable or showing a 0 bytes size. Softlayer tells us that this is because most of these files are only partial copies from when their disks filled up.

As soon as replication catches up and the complete copy is available, those files should be back to normal.

Posted Apr 07, 2015 - 13:37 CDT

Monitoring

Error rates are again returning to normal and we will continue to monitor the situation.

Posted Apr 06, 2015 - 23:16 CDT

Update

As part of our ongoing monitoring of this issue we have detected another increase in error rates from our storage service provider. They have been alerted to this and are currently working to remedy the problem ASAP

Posted Apr 06, 2015 - 23:00 CDT

Update

We're seeing error rates and page load times return to within normal bounds.

We're continuing to closely monitor the storage environment and all Bigcommerce stores.

Posted Apr 06, 2015 - 20:48 CDT

Update

Bigcommerce engineers are currently investigating a higher than normal error rate when talking to our storage service, and an increase in page load times across Bigcommerce stores. Whilst we are seeing an increase in error rates and page load times across the platform, only a small fraction of store traffic is affected at this stage.

We've engaged our service provider for assistance and will have another update momentarily.

Posted Apr 06, 2015 - 20:37 CDT

Update

Templates/design are now enabled in the Bigcommerce Control Panel for all stores. This is today's last update unless we have an issue or we start to see more of the few missing objects return. We appreciate all the feedback we've been getting. It will only make our service better.

Posted Apr 06, 2015 - 18:37 CDT

Update

All stores have WebDAV enabled. Performance is steady. We are now turning on templates/design for the Control Panel. There may still be objects that are not available. We expect that a lot of the still missing objects will show up over the next few days.

Posted Apr 06, 2015 - 18:08 CDT

Update

Currently, 50% of stores have WebDAV enabled with no issues. Response times and errors holding steady. It's going a little slower than anticipated so we are still looking at two hours until completion.

Posted Apr 06, 2015 - 16:07 CDT

Update

We're about 25% through enabling WebDAV clients with no issues. We anticipate completion in the next two hours. At that time, if all is still performing well, we'll enable template/design access for all stores in the control panel. Thanks again for your patience as we cautiously work through this.

Posted Apr 06, 2015 - 14:52 CDT

Update

We'll be slowly re-enabling WebDAV access over the course of the day. We believe the storage cluster, while still replicating data, is performing adequately enough to handle the extra load. Errors are not increasing and response times are mainly improved. In the next few hours, if all continues to go well, WebDAV will be re-enabled. We regret the inconvenience this has caused.

Posted Apr 06, 2015 - 12:32 CDT

Update

The storage cluster is continuing to make progress with the replication cycle. SoftLayer have managed to increase replication throughput while maintaining stable error rates and minimal impact to response times. Our engineering team is continuing to monitor the situation and has seen a downward trend in the error rates associated with missing images and stylesheets. WebDAV access remains disabled to support the recovery efforts.

Posted Apr 06, 2015 - 04:16 CDT

Update

Bigcommerce engineers are continuing to work and communicate with engineers on the SoftLayer Object Storage team to ensure the storage platform returns to normal operations ASAP.

SoftLayer are currently working on tuning the recovery/replication process that storage nodes are currently undergoing in order to speed up recovery efforts. This replication process is responsible for shuffling around data on their storage nodes into the correct/new locations and will further reduce the number of intermittent 404s some Bigcommerce stores may be experiencing for stylesheets and images. We are closely monitoring the performance of the storage environment while these changes are rolling out.

At this point in time, most Bigcommerce stores are up and operational and we are continuing to work through a number of isolated edge cases on some stores. Page load times across Bigcommerce stores are still slightly higher than normal - we are closely monitoring them and will provide another update should we believe it's impacting the ability to access stores.

As per our previous update, we have decided to leave the WebDav functionality used to update/modify assets on Bigcommerce stores disabled. This is to minimize the amount of data changing on the storage cluster, while the recovery process is underway and data is moved around.

Posted Apr 05, 2015 - 22:21 CDT

Update

We found a tuning parameter that has decreased 404 errors further. Performance is also improved. At this time, there are still some stores affected by storage cluster outage, but the number is steadily decreasing as Softlayer completes their repairs. We are still keeping this issue open as some stores are still feeling the impact, but most stores should be available and performant. We anticipate being able to turn WebDAV on again tomorrow morning Pacific time.

Posted Apr 05, 2015 - 19:41 CDT

Update

We are still seeing some errors retrieving files but the rate has remained low and steady. Response time have been much improved over the past few hours. Softlayer is going to be accelerating object replication soon which may cause some increased response times and errors. We will be monitoring as to how much that affects stores and if too much, we'll ask them to turn it back. Also, Softlayer is adding more capacity and will be moving requests for objects to less utilized storage nodes to make the cluster more resilient and balanced.

Posted Apr 05, 2015 - 15:33 CDT

Update

Response times continue to remain steady. SoftLayer have confirmed error rates are due to cluster replication. Majority of stores are still operational. Remaining stores are gradually recovering as cluster replication completes and missing assets become accessible. WebDAV remains disabled to assist with recovery.

Posted Apr 05, 2015 - 11:27 CDT

Update

Response times have returned to normal. While the majority of stores are now operational, there may be some stores which aren't loading correctly due to missing content, images, styles or other assets. We are continuing to work with SoftLayer to investigate the low error rates resulting in the missing assets. WebDAV access remains disabled to allow for a faster restoration resolution.

Posted Apr 05, 2015 - 07:37 CDT

Update

Nothing new to report except continued low error rate and acceptable page load speed. The storage cluster appears to be stabilizing. This will be the last update for tonight, unless there is a change for the better or the worse. WebDAV access is still disabled as we want to give the storage cluster tonight to get ahead on object replication before we put more stress on it. Goodnight and have a pleasant tomorrow!

Posted Apr 05, 2015 - 01:21 CDT

Update

For the past 40 minutes, errors have been steadily decreasing. Many stores that were unavailable before are now available and near normal page load speeds. Response times, while higher than we'd like right now, are moving towards acceptable. Traffic levels to Bigcommerce stores are still at the same levels as when we had the increased error rate so we are beginning to believe that IBM Softlayer's configuration changes are helping the cluster recover. We will be continuing to monitor, but are encouraged by the progress over the past hour. More updates as we have news to share. Thanks again for your patience.

Posted Apr 05, 2015 - 00:02 CDT

Update

Softlayer has rolled out all their configuration tunings. We have seen a slight improvement in error rates and page load speeds. The storage cluster is still replicating data and under heavy load repairing itself. We will continue with updates as we receive new information.

Posted Apr 04, 2015 - 21:15 CDT

Update

Bigcommerce Systems Engineers are still awaiting an update from Softlayer regarding the latest on the Object Storage cluster. Last we heard Softlayer Engineers were still working with SwiftStack consultants to bring the cluster 100% online, and were making some headway. As soon as we have more information we will update here.

Posted Apr 04, 2015 - 20:08 CDT

Update

There have been no updates from Softlayer. Stores continue to experience slow response times. Error rates remain higher than normal.

Posted Apr 04, 2015 - 16:50 CDT

Update

Softlayer is testing new performance tuning configurations on the cluster. They should be rolled out in an hour. Increased load due to replication is causing increased 404 errors on stores. Page load times are also currently sub-par. Bigcommerce engineers still investigating other things we can do to improve performance, but there isn't much we've found until Softlayer is able to make enhancements to the Object Storage cluster. We will update again in about an hour or as we get new information.

Posted Apr 04, 2015 - 15:42 CDT

Update

The Object Storage clusters continues to be under heavy load due to replication of objects. Softlayer has received some advice from SwiftStack on how to tune the cluster to recover faster and are working on implementing those changes. So, response times and errors will increase and there may be availability issues with some stores.

Posted Apr 04, 2015 - 13:52 CDT

Update

Bigcommerce engineers are applying some new request timeout thresholds to allow Object Storage to take a little longer to fetch an object and hopefully fail less. Object replication is still causing intermittent performance issues and errors retrieving objects but is remaining relatively stable.

Posted Apr 04, 2015 - 13:03 CDT

Update

SoftLayer are currently observing more load from SoftLayer customer traffic on the object storage platform in Dallas, which is leading to additional I/O and replication delays which are impacting their ability to stablize the storage cluster.

SoftLayer are currently implementing rate limiting for write operations across the storage cluster in an attempt to minimise outside writes while the cluster stabilises. This may result in issues across Bigcommerce stores with product imports, and addition/modification of product images - if you experience difficulty saving or updating product images, please try again.

At this time, Bigcommerce engineers are still observing intermittent failures with objects such as template files, product images, and other uploaded content, as well as slow page loads across a number of Bigcommerce stores that do not yet have this content cached. We expect for this behaviour to continue for a while.

We will provide another update as soon as we have additional information.

Posted Apr 04, 2015 - 12:35 CDT

Update

We are continuing to receive updates from SoftLayer regarding the availability and recovery of their object storage service.

From the latest information we have, all storage nodes have been reintroduced into the object storage cluster and are online. As part of the recovery process, a number of storage nodes are currently under high load while the consistency of all objects is checked and data is shuffled around to the correct locations.

Due to the high load placed upon the storage cluster at the moment, SoftLayer continue to advise us that we may see intermittent spikes in load times and page timeouts.

At this point in time, we're continuing to receive information from a number of merchants that their stores are back up and operational. Some stores may still be inaccessible, or missing certain assets such as product images or stylesheets - from what we've observed so far these issues will continue to correct themselves over the coming hours.

Due to the ongoing operations to stabilise the object storage environment, we are going to continue to leave the WebDav and template editing functionality across Bigcommerce stores disabled. We'll continue to review this decision as the object storage recovers and as soon as we and SoftLayer are confident the functionality can be enabled again, we will enable it.

We will continue to monitor the recovery process and provide updates as soon as we have additional information.

Posted Apr 04, 2015 - 10:19 CDT

Update

SoftLayer have just provided us with a detailed update regarding the recovery efforts of their object storage service, which has been impacting the ability to access Bigcommerce stores.

SoftLayer are reporting that since the cluster upgrade which was completed overnight, they are seeing better handling of storage devices, and correct behaviour for full storage nodes and failed devices.

Recovery efforts are still underway to bring the rest of the storage nodes back online. There are approximately 8 storage nodes remaining to come online out of the 52 storage nodes assigned to the cluster. As this process completes, we expect more and more objects to become available to Bigcommerce stores, and these stores to return to normal operations.

SoftLayer have advised us that the load on the object storage cluster is significantly high at the moment due to the replication that the service is undertaking in order to ensure all data is correct, and assigned to the right storage nodes. This process is necessary to get the cluster into a healthy state but may introduce intermittent periods of slowness for Bigcommerce stores that have data which is yet to be repopulated into our cache.

We're continuing to monitor the SoftLayer recovery effort and the impact to Bigcommerce stores and will provide updates as soon as we have them.

Posted Apr 04, 2015 - 07:53 CDT

Update

Some really good news: We are seeing more objects become available, response times are closer to expected and normal. We've had several stores report all their site is functional again. So, while we are still waiting for 100% of objects to be available, we're seeing progress from Softlayer's efforts tonight to revive this storage cluster. We will update as we can confirm more data.

Posted Apr 04, 2015 - 06:55 CDT

Update

Softlayer has deployed their software upgrade and we are slowly seeing missing objects become available. This caused a brief period of slower than normal response times while we adjusted timings to allow the data to flow into our cache servers. WebDAV is still disabled for reads and writes until we are confident the Object Storage cluster has stabilized. We'll continue to update as we get more data.

Posted Apr 04, 2015 - 05:22 CDT

Update

Softlayer is responding to the crisis and is about to do something pretty drastic to hasten getting objects back online. In order to aid them, we are disabling WebDAV functionality for a few hours while upgrades and new code are rolled out in the Softlayer Object Storage Cluster. We are working in concert with Softlayer to restore full accessibility and acceptable performance. There may be worse performance before it gets better. Please stand by.

Posted Apr 04, 2015 - 00:22 CDT

Update

Bigcommerce engineers have been very pro-active in working with our storage provider, IBM Softlayer, in finding solutions. Unfortunately, it takes two parties to come to a solution. In this case, IBM Softlayer intentionally let their Object Storage cluster fall into disrepair and chose not to scale it. This has impacted Bigcommerce, IBM and many other Softlayer customers.

Our engineers placed too much trust in IBM Softlayer and that's on us. However, the catastrophic failures to see metrics and rapidly scale capacity, the decisions to let hard drives sit at 90% utilization for weeks and months, the cascading failures of an undersized cluster of 52 nodes for the busiest data center in their business speaks to IBM Softlayer’s lack of concern for their customers. We found this out 3 days ago.

We should have pressed more and held them to the fire; for that, we are sorry. I'm the head of Technical Operations and this pains me, because of the high uptime and reliability our engineering teams have built in the past year. Unfortunately, our trust in IBM Softlayer was misplaced. They have failed at every level of an operations team, they have failed as a business unit, they have failed in caring about how their customers are affected.

We care deeply about our customers and have been trying to work around Softlayer’s bad decisions. We are at the point where we feel we needed to say, "This isn't us; this isn't how we think about high availability."

We are already planning and working toward how we will move off Softlayer Object Storage and better plan for a single vendor failing, no matter how well regarded it is. Our best information at this time is that no data loss has occurred, and we are working around the clock to ensure that remains the case.

I take this personally. I've crafted highly available solutions at Apple, Digg, Eventbrite and now, Bigcommerce. When the site isn't performing, I get angry. And, like the Hulk, no one wants to see me angry. I am fighting for all of you and will continue to do so.

—Scott Baker, Head of Operations and Site Reliability

Posted Apr 03, 2015 - 21:51 CDT

Update

Bigcommerce Systems Engineers have been alerted that current situation with storefronts has recently degraded. We are continuing to work with our storage provider to come to a solution. We will update as we have information.

Posted Apr 03, 2015 - 20:26 CDT

Update

Performance issues with our storage provider continue but some code changes have provided some improvement in response times and decreased error rates. We have had reports that performance issues have been severe enough on some sites to appear as though the store is unavailable. We apologize for that and are still trying to find ways to lessen the impact from our storage provider's issues.

Posted Apr 03, 2015 - 15:52 CDT

Update

Bigcommerce engineers have implemented a few changes to help with performance while our storage provider continues to work on bringing their performance back to acceptable levels. Response times have improved slightly with few errors. Our engineers are continuing to test and implement workarounds in response to our storage provider's issues.

Posted Apr 03, 2015 - 12:22 CDT

Update

After a quiet 8 hours, our storage provider is suddenly experiencing higher load and 3 times the worst error rate we've measured. We are waiting for updates on when this will stabilize. Our apologies again for this situation.

Bigcommerce engineers started testing a solution to move all Bigcommerce data to a new storage cluster. We'll update as we finish testing and start to move to the new solution.

Posted Apr 03, 2015 - 08:37 CDT

Update

Bigcommerce engineers are still working on new solutions to the issues with our storage provider. We will update status as we receive test results or updates from the provider.

Posted Apr 02, 2015 - 21:49 CDT

Update

Bigcommerce engineers have eliminated some errors but are still testing further solutions to fix all the errors. Thanks again for your patience.

Posted Apr 02, 2015 - 17:54 CDT

Update

The storage provider is still experiencing heavy load. Bigcommerce engineers are investigating solutions to work around the storage provider's issues. We are hoping to update on the results of testing these solutions soon.

Posted Apr 02, 2015 - 16:47 CDT

Update

Our storage provider has made us aware of several hard drive failures in their storage clusters where we keep storefront resources such as images and other template files. This is causing intermittent availability of these resources. At this time, we have no reason to believe there has been data loss. The storage provider has all hands on deck working to restore the service to full performance. We will provide further updates as they are available.

Posted Apr 02, 2015 - 15:02 CDT

Update

Work continues with our storage vendor to find a resolution. Updates will be made as more information becomes available.

Posted Apr 02, 2015 - 12:22 CDT

Update

We are continuing to work with our storage provider on a resolution to this issue.

Posted Apr 02, 2015 - 09:49 CDT

Identified

Bigcommerce engineers have identified a problem with the storage service for WebDAV. Due to issues with their system, some files are being cached as existing but having no data, resulting in intermittent availability of these files. We are working with them to correct this issue. Please do not edit affected files in WebDAV until we are able to restore their availability, as changing the files may cause them to become unrecoverable. We will be updating as we get more information from our storage provider.

Posted Apr 02, 2015 - 09:07 CDT

This incident affected: Control Panel and Storefront.