Power Outage in Network Device Rack at Teraco Isando JB1 at 01:05 (SAST / UTC+2) on 30th January 2024
Incident and Mitigation Details
Incident Timeline
[01:05]: Initial loss of connectivity detected by monitoring systems
[01:07]: Operations team alerted and preliminary diagnosis begins
[01:20]: Issue determined to likely be related to power problems in network device rack in Teraco JB1
[01:25]: Support Ticket logged at Teraco JB1 and engineer dispatched
[01:55]: Engineer onsite at Teraco JB1
[02:05]: Access to affected rack, and immediate assessment of failure of Automated Transfer Switch (ATS) in rack, and trip of Power Feed A in rack
[03:30]: Power fully restored in rack, ATS removed and devices re-cabled, and and all affected network devices powered up
[03:45]: Power issues resulted in corrupted configuration of a number of network devices - restored from backup - and all network services fully restored by 05:15
[06:10]: Restarting of host cluster network, due to network device issues affecting cluster network, and consequently impacting connectivity to approximately 15% of all VMs in Teraco JB1. Cluster network services fully restored, and hosts available and fully operational by 07:00.
Investigation and Findings
Physical Inspection: Physical inspection of the rack in question indicated a failed ATS unit and tripped Power Feed A
Log Analysis: N/A
Hardware Diagnostics: Various Core/Edge Routers and Network switches hung/powered off due to power issue
Environmental Factors: No abnormal environmental conditions (temperature, humidity) were detected.
Vendor Consultation: N/A
Root Cause
Primary Cause: Failure of Automated Transfer Switch in network device rack, causing Power Feed A side to trip as well as loss of output to connected devices
Contributing Factors: Loss of power required re-installation of a number of network device configuration files, as well as a restart of the main host cluster network
Impact Assessment
Service Downtime: Approximately 4 hours core network downtime, and subsequent loss of connectivity of approximately 15% of VMs prior to restoration of cluster network 2 hours later (with full restoration by 07:05)
Data Loss: No data loss reported.
Performance Degradation: Significant degradation/loss of network availability until the issue was resolved.
Services Impacted: All services impacted from 01:05 to 05:15
Corrective and Preventive Measures
We apologize for the impact to affected customers. We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future.
Sincerely,
The CloudAfrica Team.