Top-of-Rack (ToR) Switch Stack Failure at Teraco Isando JB1 14:59 (SAST / UTC+2) on 8th December 2023
Incident and Mitigation Details
Incident Timeline
[14:59:49]: Initial loss of connectivity reported
[15:01:29]: Network operations team alerted
[15:02:46]: Preliminary diagnosis began
[15:06:40]: Issue isolated to the specific ToR switch
[15:33:28]: Switch rebooted for temporary fix
[15:35-19:35]: During this window, customers experienced 3 episodes of 3-5 minute loss of connectivity due to host cluster network instability because of excessively high cluster network activity. Full resolution of cluster network stability was achieved at 19:35
Investigation and Findings
Physical Inspection: Physical inspection of the rack in question was unremarkable.
Log Analysis: Error logs indicated multiple failed attempts to handle incoming traffic, suggesting an internal cluster network processing issue.
Hardware Diagnostics: Cisco switches affected by firmware bug resulting in high CPU utilisation
Environmental Factors: No abnormal environmental conditions (temperature, humidity) were detected.
Vendor Consultation: Engaged with the switch manufacturer (Cisco) for detailed analysis. They noted similar issues in other instances due to a firmware bug.
Root Cause
Primary Cause: Firmware bug in the switch leading to internal switch stack processing failures under specific traffic conditions
Contributing Factors: Spike in inter host cluster network traffic, impacting stability of cluster. Currently all traffic in CA Data centres is on a redundant 10Gbps backbone, which is in the process of being upgraded
Impact Assessment
Service Downtime: Approximately 45 minutes of total service disruption across the affected hosts (and associated VMs) within the rack.
Data Loss: No data loss reported.
Performance Degradation: Temporary degradation in network performance until the issue was resolved.
Services Impacted: Interruption in network services for servers connected to the affected switch. This impacted VMs running in the affected rack.
Corrective and Preventive Measures
We apologize for the impact to affected customers. We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future.
Sincerely,
The CloudAfrica Team.