Cluster outage

Incident Report for CloudAfrica

Postmortem

Network and Cluster Services Outage Affecting Platform Availability

Incident Status: Resolved
Duration: January 28, 2026, 08:36 - 11:32 SAST (UTC+2)
Date: Tuesday, January 28th, 2026

Impact

A network outage affecting our core infrastructure resulted in widespread service disruption across our platform:

  • Customer virtual machines were unreachable via SSH/RDP during the initial network outage (08:36 - 08:44 SAST)
  • All Virtual Machines were unreachable for 7-8 minutes during this time and again fully reachable from 08h44
  • CloudAfrica (app.cloudafrica.net) and BigStorage web applications were unavailable during the network outage
  • VM management functionality was degraded from 08:44 - 11:32 SAST - customers could not stop, start, shutdown, or manage VMs via the web interface
  • Backup operations (creation and restoration) were unavailable via the UI during the cluster services recovery period

The incident affected multiple customers across our infrastructure.

Timeline

Jan 28, 08:36 SAST - A core network switch unexpectedly rebooted following a port reset operation on a compute node, triggering a bug in the switch firmware. This caused a complete network outage across the affected infrastructure segment.

Jan 28, 08:44 SAST - Network connectivity was restored. Customer VMs became reachable via SSH/RDP and web applications (app.cloudafrica.net and BigStorage) became accessible again.

Jan 28, 08:48 SAST - Investigation identified that the sudden network interruption had caused cluster services to fail and not automatically restart. VM management functionality remained degraded despite network restoration and VMs being reachable and fully functional.

Jan 28, 08:56 SAST - Engineering team began systematic restart of cluster services across affected nodes.

Jan 28, 10:09 SAST - Cluster services continued starting up sequentially. VM management via app.cloudafrica.net remained unavailable.

Jan 28, 11:32 SAST - All cluster services successfully restarted and synchronized. Full functionality restored to app.cloudafrica.net, including VM management and backup operations.

Root Cause

The outage was triggered by an unexpected reboot of a core network switch, caused by a firmware bug that was activated during a routine port reset operation on a compute node. This switch reboot resulted in:

  1. Primary Network Outage (08:36 - 08:44 SAST) - Complete loss of network connectivity for VMs and management infrastructure
  2. Secondary Cluster Service Failure (08:44 - 11:32 SAST) - The sudden network interruption caused cluster services to lose quorum and fail. The cluster services did not automatically recover when network connectivity was restored, requiring manual intervention to restart services across multiple nodes.

The switch firmware bug specifically relates to improper handling of port reset operations, which should not trigger a full switch reboot under normal circumstances.

Resolution

Immediate Resolution:

  • Network connectivity was restored automatically when the switch completed its reboot sequence at 08:44 SAST
  • Cluster services were manually restarted across affected nodes between 08:48 and 11:32 SAST
  • All platform functionality was verified operational by 11:32 SAST

Emergency Response:

  • Engineering team immediately identified the network outage and monitored switch recovery
  • Once network was restored, cluster service states were assessed
  • Systematic restart of cluster services was performed to restore quorum and functionality
  • Continuous monitoring was maintained throughout the recovery period

Follow-up Actions

We sincerely apologize to our customers for this service disruption and the impact it had on your operations.

In-Place Monitoring and Response Measures:

  • Enhanced monitoring alerts configured for switch reboot events and cluster service failures
  • Documented cluster service recovery procedures for faster resolution in future incidents
  • Implemented automated health checks for cluster service status following network interruptions

Scheduled Maintenance:

  • Preventive maintenance will be scheduled for the evening of February 10th, 2026 to update switch firmware across our infrastructure
  • This firmware update will address the port reset bug that triggered the switch reboot
  • Maintenance will be performed during a low-traffic window to minimize any potential impact
  • Customers will receive advance notification of the maintenance window

Ongoing Actions:

  • Conducting comprehensive audit of all network switches to identify units requiring firmware updates
  • Enhancing cluster resilience to better handle transient network interruptions
  • Developing automated monitoring for switch stability and unexpected reboot events

Long-term Improvements:

  • Establishing proactive firmware update schedules for all critical network infrastructure
  • Creating automated runbooks for cluster service recovery procedures
  • Enhancing change management processes to ensure network firmware updates are deployed proactively

We remain committed to maintaining the highest levels of service availability and reliability for our infrastructure platform.

Once again, we apologize to affected customers, and thank you for your continued support.

The CloudAfrica Team.

Posted Feb 03, 2026 - 15:09 SAST

Resolved

All cluster services have been successfully restarted. All functionality is restored to app.cloudafrica.net and users can fully control their resources via the webapp. We will continue to monitor, however we do not anticipate any further disruptions.
Posted Jan 28, 2026 - 11:32 SAST

Update

Cluster services are still starting, but should be up soon.
VM management via app.cloudafrica.net is still unavailable.
Posted Jan 28, 2026 - 10:09 SAST

Update

We are continuing to work on a fix for this issue.
Posted Jan 28, 2026 - 08:56 SAST

Identified

We have identified and resolved the network outage. Vms should be reachable via ssh/rdp and our front end webapps are accessible.
We still have some cluster services that need to be restarted, which we are doing now, which leaves the front end VM management functionality degraded. VMs cannot be stopped/started/shutdown and backups can't be taken or restored via the UI.
Posted Jan 28, 2026 - 08:48 SAST

Investigating

We are aware of an outage to our services and we're busy investigating. The leading suspicion is that our cluster services failed and did not start back up automatically. At the moment some customer VMs are unreachable, our webapps for CloudAfrica and Bigstorage are also unavailable at the moment.
Posted Jan 28, 2026 - 08:41 SAST
This incident affected: Web Sites, Cloud Services, and Network.