S3 Cluster Outage

Incident Report for CloudAfrica

Resolved

All S3 services are fully restored and all PGs and RocksDB OSD databases after the corruption following the upgrade a few weeks ago have been repaired.

Thank you for your patience.

Our S3 cluster is now fully operational.

Posted Dec 11, 2023 - 11:40 SAST

Update

We are continuing to monitor for any further issues.

Posted Dec 05, 2023 - 13:05 SAST

Monitoring

We've made good progress with restoring our S3 services.

The current status is as follows:

1. All PGs and RocksDB OSD databases across the cluster have been repaired - so all object data across the cluster is now properly referenced following the RocksDB and PG corruption after the update that precipitated the outage originally;

2. The cluster is still rebalancing and backfilling (i.e. re-creating redundancy across the various replicated and EC pools) - so customers will still experience various 50x errors when attempting to write to the cluster about 50% of the time. This issue is resolving as the backfilling is progressing. We are taking active steps to increase the rate of backfilling concurrently with ensuring that the cluster is not overloaded during this time. This impacts S3 buckets that were created prior to the outage. New buckets created as of Wednesday 29th November 2023 (and going forward) should not be impacted and should be fully readable and writable with minimal/no 50x errors.

As of now, all steps have successfully been taken to get the cluster stable again; we are still investigating the cause of the bug that resulted in this outage with our engineering team in Germany.

It's difficult to predict when services will be fully restored, but we anticipate that there will still be data movements across our cluster of between 50-100TB (a small % of the total cluster size) that still need to be finalised which may take a few more days.

We will keep you updated in this regard,

We apologise again for the extended outage, but in summary, are happy to report that good progress is being made to full recovery.

Posted Nov 30, 2023 - 09:01 SAST

Update

We are continuing to work on a fix for this issue.

Posted Nov 27, 2023 - 15:58 SAST

Update

We are still working on the recovery of the final placement groups, towards bringing the cluster back online.

Posted Nov 24, 2023 - 15:39 SAST

Update

We are continuing to work on a fix for this issue.

Posted Nov 23, 2023 - 13:47 SAST

Update

Recovery is still progressing slowly, with just a few of our Ceph placement groups outstanding.

We will continue to keep you updated as this progresses.

Posted Nov 22, 2023 - 12:53 SAST

Update

Recovery is progressing slowly and our Ceph platform indexes are about 94% restored.

We will continue to keep you updated as this progresses.

Once again, we apologise for the ongoing outage.

Posted Nov 21, 2023 - 14:31 SAST

Update

Recovery is progressing slowly and our Ceph platform indexes are about 92% restored.

We will continue to keep you updated as this progresses.

For further information on the bug we encountered that impacted our cluster during the upgrade, please see:

https://tracker.ceph.com/issues/63558

Once again, we apologise for the ongoing outage.

Posted Nov 20, 2023 - 13:32 SAST

Update

We are continuing to work on a fix for this issue.

Posted Nov 20, 2023 - 13:28 SAST

Update

Work is progressing well with restoration of services, albeit slower than we anticipated.

In the interim, customers utilising this platform from both within and outside of the CloudAfrica environment will have limited/very intermittent access to the storage platform while the issue is being resolved.

We apologise for the downtime, and are working expeditiously with our engineering partners to resolve the issue soonest.

Posted Nov 18, 2023 - 16:16 SAST

Identified

Our S3 storage cluster (BigStorage) remains impacted by a bug in Ceph.

The issue began just after 16h00 (SAST; UTC+2) on Wednesday 15th November 2023 following a routine version upgrade of the nodes in this S3 cluster.

We were unable to stabilise the cluster in line with our previous estimate of Thursday, 16th November 2023, and we, together with our engineering partners, continue to work on this issue as our top priority.

We have been able to bring back over 90% of all cluster services, thus far, and work continues on completing the remainder.

In the interim, customers utilising this platform from both within and outside of the CloudAfrica environment will have limited/very intermittent access to the storage platform while the issue is being resolved.

We apologise for the downtime, and are working expeditiously with our engineering partners to resolve the issue soonest.

Posted Nov 17, 2023 - 14:12 SAST

Investigating

We are currently investigating an issue resulting in an outage of our S3 storage cluster in Teraco JB1 Isando, Johannesburg.

The issue began just after 16h00 (SAST; UTC+2) on Wednesday 15th November 2023 following a routine version upgrade of the nodes in this S3 cluster.

Our engineering partners in Germany believe that they have identified the cause of the issue, which appears to be a previously unidentified bug with this Ceph release. They are currently working to fully test the fix/patch prior to rollout, and we anticipate that will be ready during the course of tomorrow (early) morning - i.e. Thursday 16th November 2023.

In the interim, customers utilising this platform from both within and outside of the CloudAfrica environment will have limited/very intermittent access to the storage platform while the issue is being resolved.

We apologise for the downtime, and are working expeditiously with our engineering partners to resolve the issue soonest.

Posted Nov 15, 2023 - 23:22 SAST

This incident affected: Storage Services.