Network Outage at Teraco JB1

Incident Report for CloudAfrica

Postmortem

Power Outage in Network Device Rack at Teraco Isando JB1 at 01:05 (SAST / UTC+2) on 30th January 2024

‌

Incident and Mitigation Details

Incident Timeline

[01:05]: Initial loss of connectivity detected by monitoring systems

[01:07]: Operations team alerted and preliminary diagnosis begins

[01:20]: Issue determined to likely be related to power problems in network device rack in Teraco JB1

[01:25]: Support Ticket logged at Teraco JB1 and engineer dispatched

[01:55]: Engineer onsite at Teraco JB1

[02:05]: Access to affected rack, and immediate assessment of failure of Automated Transfer Switch (ATS) in rack, and trip of Power Feed A in rack

[03:30]: Power fully restored in rack, ATS removed and devices re-cabled, and and all affected network devices powered up

[03:45]: Power issues resulted in corrupted configuration of a number of network devices - restored from backup - and all network services fully restored by 05:15

[06:10]: Restarting of host cluster network, due to network device issues affecting cluster network, and consequently impacting connectivity to approximately 15% of all VMs in Teraco JB1. Cluster network services fully restored, and hosts available and fully operational by 07:00.

‌

Investigation and Findings

Physical Inspection: Physical inspection of the rack in question indicated a failed ATS unit and tripped Power Feed A

Log Analysis: N/A

Hardware Diagnostics: Various Core/Edge Routers and Network switches hung/powered off due to power issue

Environmental Factors: No abnormal environmental conditions (temperature, humidity) were detected.

Vendor Consultation: N/A

‌

Root Cause

Primary Cause: Failure of Automated Transfer Switch in network device rack, causing Power Feed A side to trip as well as loss of output to connected devices

Contributing Factors: Loss of power required re-installation of a number of network device configuration files, as well as a restart of the main host cluster network

‌

Impact Assessment

Service Downtime: Approximately 4 hours core network downtime, and subsequent loss of connectivity of approximately 15% of VMs prior to restoration of cluster network 2 hours later (with full restoration by 07:05)

Data Loss: No data loss reported.

Performance Degradation: Significant degradation/loss of network availability until the issue was resolved.

Services Impacted: All services impacted from 01:05 to 05:15

‌

Corrective and Preventive Measures

Faulty ATS has been removed and in the process of being replaced by Teraco
The new 2x100GB backbone at Teraco JB1 has been completed across all racks across all data centre halls, and the switchover of all hosts and all network devices will be completed by Sunday 18th February 2024.

We apologize for the impact to affected customers. We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future.

Sincerely,
The CloudAfrica Team.

Posted Jan 30, 2024 - 21:09 SAST

Resolved

The issue has been fully resolved and power has been fully restored to the affected network device rack, and to all affected routers and switch within the rack at Teraco JB1.

We are finalising an RCA which we will post shortly.

Our apologies for the outage during the very early hours of today.

Posted Jan 30, 2024 - 14:56 SAST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 30, 2024 - 07:23 SAST

Update

We're in the process of starting up and verifying customer virtual machines on the platform.

Posted Jan 30, 2024 - 07:16 SAST

Update

The power supply to the networking equipment rack at Teraco JB1 has been successfully restored. We are currently in the process of isolating and addressing the remaining network issues that are impacting a subset of nodes. Our team is working to resolve these issues and restore full functionality as swiftly as possible. We appreciate your patience and understanding during this time.

Posted Jan 30, 2024 - 04:30 SAST

Investigating

We are currently investigating a network outage at Teraco JB1 impacting all services.

Posted Jan 30, 2024 - 01:26 SAST

This incident affected: Web Sites, API, Storage Services, and Cloud Services.