Network Degredation
Incident Report for CloudAfrica
Postmortem

Top-of-Rack (ToR) Switch Stack Failure at Teraco Isando JB1 14:59 (SAST / UTC+2) on 8th December 2023

Incident and Mitigation Details

Incident Timeline
[14:59:49]: Initial loss of connectivity reported
[15:01:29]: Network operations team alerted
[15:02:46]: Preliminary diagnosis began
[15:06:40]: Issue isolated to the specific ToR switch
[15:33:28]: Switch rebooted for temporary fix
[15:35-19:35]: During this window, customers experienced 3 episodes of 3-5 minute loss of connectivity due to host cluster network instability because of excessively high cluster network activity. Full resolution of cluster network stability was achieved at 19:35

Investigation and Findings
Physical Inspection: Physical inspection of the rack in question was unremarkable.
Log Analysis: Error logs indicated multiple failed attempts to handle incoming traffic, suggesting an internal cluster network processing issue.
Hardware Diagnostics: Cisco switches affected by firmware bug resulting in high CPU utilisation
Environmental Factors: No abnormal environmental conditions (temperature, humidity) were detected.
Vendor Consultation: Engaged with the switch manufacturer (Cisco) for detailed analysis. They noted similar issues in other instances due to a firmware bug.

Root Cause
Primary Cause: Firmware bug in the switch leading to internal switch stack processing failures under specific traffic conditions
Contributing Factors: Spike in inter host cluster network traffic, impacting stability of cluster. Currently all traffic in CA Data centres is on a redundant 10Gbps backbone, which is in the process of being upgraded

Impact Assessment
Service Downtime: Approximately 45 minutes of total service disruption across the affected hosts (and associated VMs) within the rack.
Data Loss: No data loss reported.
Performance Degradation: Temporary degradation in network performance until the issue was resolved.
Services Impacted: Interruption in network services for servers connected to the affected switch. This impacted VMs running in the affected rack.

Corrective and Preventive Measures

  1. Firmware of all affected Cisco switches/switch stacks has been updated to the latest version post-incident, and pending reboot (which will be done after the ongoing network upgrade at Teraco JB1, Teraco JB2, and OADC Isando)
  2. CA is expediting bringing live our new 2x 100GB network backbone across Teraco Isando JB1, Teraco Bredell JB2, and OADC Isando – the infrastructure and fibre links for this new backbone network are in place and we will be switching all hosts and network devices across to this new network backbone during December 2023.

We apologize for the impact to affected customers. We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future.

Sincerely,
The CloudAfrica Team.

Posted Dec 12, 2023 - 18:50 SAST

Resolved
Summary RCA Report: Top-of-Rack (ToR) Switch Stack Failure at Teraco Isando JB1 14:59 (SAST / UTC+2) on 8th December 2023

Incident Overview

Date of Incident: Friday 08 December 2023
Time of Incident Window Start: 15:02 (SAST / UTC+2)
Time of Incident Window End: 19:35 (SAST / UTC+2)
Detected By: Automated Monitoring

Initial Symptoms: Loss of connectivity to servers/hosts in the affected rack, increased packet loss, and latency issues observed for all VMs on hosts within the rack, and three general episodes of intermittent (multi-minute) loss of connectivity and intermittent packet loss between 15:35 and complete resolution at 19:35.

Summary: The failure of the ToR switch stack was primarily caused by a firmware bug. Corrective, supportive and preventive measures include finalising the upgrade of CA’s core datacentre network to 100GB (due for completion during December 2023), and upgrading current 10GB network devices (which will provide a 3rd layer of network redundancy) with latest firmware.

Date of Report: 12th December 2023
Posted Dec 12, 2023 - 18:46 SAST
Identified
At 14:59 we were alerted to degraded network performance.
The issue was traced to a switch that had hung up.

The switch was rebooted, and we are restoring access to services
Posted Dec 08, 2023 - 16:00 SAST
This incident affected: Cloud Services.