Network Incident

Incident Report for CloudAfrica

Postmortem

RCA Report: DDOS Attack against IPs in the 41.79.76.0/22 CloudAfrica subnet announced to upstream peers via BGP on ASN37352 on Friday 26th July 2024

 

Incident Overview

Date/Time of Incident Window Start: Friday 26th July 2024 at 13:36 (SAST / UTC+2)

Date/Time of Incident Window End: Saturday 27th July at 09h40 (SAST / UTC+2)

Detected By: Automated Monitoring

Initial Symptoms: At 13h36 on Friday 26th July 2024, our networks experienced rapid and sudden loss of Internet connectivity affecting all our Internet peers at the Teraco Isando JB1 and OADC JHB1 data centres.  Internal VM, storage and network availability were not impacted.

Summary: A DDOS attack impacted Internet-facing connectivity from our edge firewall and routing devices in Teraco JB1 (Isando) and OADC JHB1 (Isando). Further details of the scale and scope of the DDOS are provided below.

 

Date of Report: 1st August 2024

 

Post-Mortem

Incident and Mitigation Details

Incident Timeline

The below were the specific timelines of loss of Internet connectivity during this DDOS attack window:

Friday 26th July 2024

[13:36] – [15:12]

[16:06] – [16:21]

[19:23] – [19:52]

Saturday 27th July 2024

[03:00] – [03:44]

[07:26] – [09:40]

The total downtime during the DDOS attack window was 5 hours and 18minutes.

 

Our response timeline was as follows:

[13:36]: Initial loss of connectivity generated alarms from monitoring systems and immediate notification of loss of connectivity to CloudAfrica team.

[13:40]: Review of network elements (routers, switches and firewalls) showed almost 100% CPU utilisation on our edge firewall devices. This complicated diagnostics because of significantly reduced responsiveness of these network elements.

[14:00]: DDOS was ascertained at approximately 14:00 after analysis of router connection and traffic flow statistics.

[14:10]: Review of options begun to block DDOS at edge router level in order to decrease/nullify impact on edge firewalls.

[16:00] Installation and configuration of BGP black-holing capability begun on edge routers.

[16:50] Initial implementation of BGP black-holing capability on edge routers completed.

[17:00]-(Sat)[09:40] Tweaking of volumetric attack parameters for BGP black-holing solution on control VM and re-installation of one of our firewalls to deal with a load-associated configuration corruption, with mitigation of DDOS attack at 09:40.

Investigation and Findings

Physical Inspection: N/A

Log Analysis: In excess of 600000 live connections at any one time, and up to 200000 connection requests per second were identified on edge routers during the attack window.

Hardware Diagnostics: Cisco ASA Firepower firewall devices and Cisco edge routers impacted by massive load due to flood of connections as a result of the DDOS, resulting in almost maximum CPU utilisation of these devices.

Environmental Factors: No abnormal environmental conditions (temperature, humidity) were detected.

Vendor Consultation: N/A

Root Cause

Primary Cause: Significant DDOS attack on one of our subnets overwhelmed the capacity of edge routers and firewalls at Teraco JB1 and OADC JHB1.

Secondary Cause: N/A

Background:

  1. During a ‘normal’ weekday 24 hour period, our firewalls and access filters block between 2500-3000 attack attempts per second – that’s in the order of 200 million such attempts per 24 hour period.
  2. Those are successfully blocked attempts which customer VMs will not be aware of nor be noticed in VM logs.
  3. During this specific event, connection rates topped 200000 attack attempts per second, an increase of roughly 100x over ‘normal’ activity.
  4. The attack emanated from a dynamic changing range of between 2000-3000 IPs at any one time largely from the US, Russia, Belize, Seychelles, and Philippines – on initial assessment of geolocation of IP addresses we had also misidentified China as an additional source but this was incorrect.
  5. The attack consisted of significant SYN and UDP floods, as well as aggressive repeated port-scanning of the entire 41.79.76.0/22 subnet.

Contributing Factors: N/A

Impact Assessment

Service Downtime: 5hours and 18 minutes of loss of Internet connectivity was experienced during the DDOS incident window.

Data Loss: No data loss reported.

Performance Degradation: Significant disruption of Internet-facing services as per the previous timeline review. No internal network, VM or storage availability was impacted.

Services Impacted: Interruption of Internet Connectivity

 

Corrective and Preventive Measures

Immediate Actions:

  1. Implementation of BGP black-holing capability on edge-routers to block connection attempts from malicious IPs - COMPLETED
  2. Upgrade of firewall infrastructure to cater for a 20-fold increase in traffic and connection flows (upgrades such as this provide some initial mitigation and breathing space but it is simply not possible to mitigate such significant attacks purely on the basis of additional hardware) – IN PROGRESS – we hope to conclude all upgrades within 2 weeks from the date of this report. Equipment is onsite and installed but we are still developing a cutover approach that does not disrupt customer connectivity.
  3. Implementation of DDOS scrubbing service – we are currently in talks with various possible vendors offering DDOS scrubbing services (including Allot, Radware, Imperva, and Cloudflare) - IN PROGRESS.

We apologize sincerely for the impact to affected customers.

We are continuously taking steps to improve the CloudAfrica Platform and our processes to help ensure such incidents do not occur in the future.

Sincerely,

The CloudAfrica Team.

Posted Aug 01, 2024 - 17:45 SAST

Resolved

This incident has been resolved.
Posted Aug 01, 2024 - 17:36 SAST

Update

We have successfully mitigated the major DDOS attack against VMs on one of our subnets.

Systems and networks have been fully stable as of Saturday 27 July at 09h40 SAST (UTC+2).

Further, we have put in place significant additional resources to be able to deal with such attacks in the future.

As part of this process, we have now upgraded our edge firewall clusters over the weekend and will be switching over to those in during this evening.

We are scheduling network maintenance for tonight in order to get the new firewalls in place - a maintenance notice will go out to all customers shortly advising of the details for the maintenance window.

Subsequently, after the successful migration this evening to the new firewall clusters, an RCA will be supplied to all customers detailing the course of events and mitigation steps that we have taken. That will be issued 24 hours after successful migration to the new firewalls.

We apologise sincerely for the impact this DDOS attack has caused over periods of last Friday afternoon and early Saturday morning, and we appreciate your support and patience.
Posted Jul 29, 2024 - 08:31 SAST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 27, 2024 - 09:59 SAST

Update

The co-ordinated DDOS attack is still ongoing and we continue to urgently update and adapt our mitigation measures in response. Unfortunately, our network connectivity is still impacted.
Posted Jul 27, 2024 - 07:56 SAST

Identified

We identified a Distributed Denial of Service (DDOS) attack targeting our infrastructure. This incident resulted in intermittent service disruptions affecting ingress and egress traffic.

Our engineering team has been actively engaged in mitigating the DDOS attack.
We have been filtering, rate limiting and blocking offending IP ranges, depending on the severity.

Our immediate priority is to fully mitigate the ongoing attack and restore normal service levels. Further updates will be provided as we make progress in tracing and neutralizing the attack source.
Posted Jul 26, 2024 - 17:26 SAST

Investigating

We are currently investigating a network incident with traffic reaching our cloud environment
Posted Jul 26, 2024 - 13:55 SAST
This incident affected: Web Sites, API, Cloud Services, and Network.