Network and Cluster Services Outage Affecting Platform Availability
Incident Status: Resolved
Duration: January 28, 2026, 08:36 - 11:32 SAST (UTC+2)
Date: Tuesday, January 28th, 2026
Impact
A network outage affecting our core infrastructure resulted in widespread service disruption across our platform:
- Customer virtual machines were unreachable via SSH/RDP during the initial network outage (08:36 - 08:44 SAST)
- All Virtual Machines were unreachable for 7-8 minutes during this time and again fully reachable from 08h44
- CloudAfrica (app.cloudafrica.net) and BigStorage web applications were unavailable during the network outage
- VM management functionality was degraded from 08:44 - 11:32 SAST - customers could not stop, start, shutdown, or manage VMs via the web interface
- Backup operations (creation and restoration) were unavailable via the UI during the cluster services recovery period
The incident affected multiple customers across our infrastructure.
Timeline
Jan 28, 08:36 SAST - A core network switch unexpectedly rebooted following a port reset operation on a compute node, triggering a bug in the switch firmware. This caused a complete network outage across the affected infrastructure segment.
Jan 28, 08:44 SAST - Network connectivity was restored. Customer VMs became reachable via SSH/RDP and web applications (app.cloudafrica.net and BigStorage) became accessible again.
Jan 28, 08:48 SAST - Investigation identified that the sudden network interruption had caused cluster services to fail and not automatically restart. VM management functionality remained degraded despite network restoration and VMs being reachable and fully functional.
Jan 28, 08:56 SAST - Engineering team began systematic restart of cluster services across affected nodes.
Jan 28, 10:09 SAST - Cluster services continued starting up sequentially. VM management via app.cloudafrica.net remained unavailable.
Jan 28, 11:32 SAST - All cluster services successfully restarted and synchronized. Full functionality restored to app.cloudafrica.net, including VM management and backup operations.
Root Cause
The outage was triggered by an unexpected reboot of a core network switch, caused by a firmware bug that was activated during a routine port reset operation on a compute node. This switch reboot resulted in:
- Primary Network Outage (08:36 - 08:44 SAST) - Complete loss of network connectivity for VMs and management infrastructure
- Secondary Cluster Service Failure (08:44 - 11:32 SAST) - The sudden network interruption caused cluster services to lose quorum and fail. The cluster services did not automatically recover when network connectivity was restored, requiring manual intervention to restart services across multiple nodes.
The switch firmware bug specifically relates to improper handling of port reset operations, which should not trigger a full switch reboot under normal circumstances.
Resolution
Immediate Resolution:
- Network connectivity was restored automatically when the switch completed its reboot sequence at 08:44 SAST
- Cluster services were manually restarted across affected nodes between 08:48 and 11:32 SAST
- All platform functionality was verified operational by 11:32 SAST
Emergency Response:
- Engineering team immediately identified the network outage and monitored switch recovery
- Once network was restored, cluster service states were assessed
- Systematic restart of cluster services was performed to restore quorum and functionality
- Continuous monitoring was maintained throughout the recovery period
Follow-up Actions
We sincerely apologize to our customers for this service disruption and the impact it had on your operations.
In-Place Monitoring and Response Measures:
- Enhanced monitoring alerts configured for switch reboot events and cluster service failures
- Documented cluster service recovery procedures for faster resolution in future incidents
- Implemented automated health checks for cluster service status following network interruptions
Scheduled Maintenance:
- Preventive maintenance will be scheduled for the evening of February 10th, 2026 to update switch firmware across our infrastructure
- This firmware update will address the port reset bug that triggered the switch reboot
- Maintenance will be performed during a low-traffic window to minimize any potential impact
- Customers will receive advance notification of the maintenance window
Ongoing Actions:
- Conducting comprehensive audit of all network switches to identify units requiring firmware updates
- Enhancing cluster resilience to better handle transient network interruptions
- Developing automated monitoring for switch stability and unexpected reboot events
Long-term Improvements:
- Establishing proactive firmware update schedules for all critical network infrastructure
- Creating automated runbooks for cluster service recovery procedures
- Enhancing change management processes to ensure network firmware updates are deployed proactively
We remain committed to maintaining the highest levels of service availability and reliability for our infrastructure platform.
Once again, we apologize to affected customers, and thank you for your continued support.
The CloudAfrica Team.