Host Crash

Incident Report for CloudAfrica

Postmortem

Host Cluster Network Outage

Summary of impact:

Between 22:46 SAST (UTC+2) on 13 April 2023 and 05:03 SAST (UTC+2) on 14 April 2023, a number of Virtual Machines (VMs) were impacted by instability of our host cluster network across all DCs (including Teraco Bredell (JB2), Teraco Isando (JB1) and Teraco Cape Town (CT1)).

At the peak of the cluster network instability during this time, it appears that <5% of customer VMs were impacted as well as a number of core CloudAfrica platform services including:

Limited/no access to our customer self-service portal
Limited/no access to VM consoles
Limited/no access to the main CloudAfrica website (at www.cloudafrica.net)
Limited/no access to the BigStorage website (at www.bigstorage.io)

This impact meant that the <5% of customer VMs that were affected had periods of intermittent loss of connectivity to our core backbone network, effectively rendering those VMs unreachable for varying periods during this window.

Subsequent to this, on 17 April 2023 at 04:05 SAST (UTC+2), a single host became unreachable and required a reboot which was initiated at 04:10 SAST (UTC+2). The affected host was back online at 04:52 SAST (UTC+2).

Root Cause:

CloudAfrica operates a 10Gbps network across all DC locations which services both VM and cluster network traffic. Because of high traffic volumes and the impact of that not only on the core network but on increased network switch load - whilst network utilisation currently almost never exceeds 50% of the 10Gbps network capacity but is growing - this adversely impacted the cluster network, which is more sensitive to any network-related latencies and jitter.

As network and switch load has increased, this has resulted in increasing instability of the cluster network culminating in the two events previously described earlier.

Essentially, the cluster network instability caused the failure of the primary host network interfaces of the affected hosts and the networking failure of some (not all) VMs on those hosts. This requires the restart of the cluster network - and in the case of the host that this was not possible on, a host reboot (as described above).

Mitigation:

During the evening of Thursday 20 April 2023, we finalised installation of a separate 10Gbps network backbone to only service our cluster network - so we are now running a 10Gbps host and VM network and a separate 10Gbps cluster network. During the evening of Friday 21 April, final cluster network configuration changes were made and all cluster network traffic was separated from normal VM network traffic onto this dedicated cluster network. We have been monitoring this carefully over the past few days and we have seen no further cluster network issues since last Friday 21 April.

Next Steps:

We apologize for the impact to affected customers. We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

Expediting the rollout of our 25/40/100Gbps network core across the Teraco Isando (JB1) and Teraco Bredell (JB2) Data Centres - this will be completed by end-June 2023 to cater for increased network traffic.

Sincerely,

The CloudAfrica Team.

Posted Apr 25, 2023 - 08:57 SAST

Resolved

This incident has been resolved.

Posted Apr 25, 2023 - 08:48 SAST

Identified

The affected host is rebooting and will be back online shortly.

Posted Apr 17, 2023 - 04:30 SAST

Investigating

We are currently looking into a host that has become unresponsive and is requiring a reboot. We are attending and will provide further feedback.

Posted Apr 17, 2023 - 04:10 SAST

This incident affected: Web Sites, API, and Cloud Services.