DOS Causing Network and Connectivity Disruption (07h50-10h29 SAST / UTC+2 on Saturday 29th March 2025)

Incident Report for CloudAfrica

Postmortem

Incident Report: CloudAfrica.net Service Disruptions on March 29, 2025

Summary

On March 29, 2025, CloudAfrica.net experienced a series of intermittent service disruptions between 07:31 and 10:29 (GMT+02:00). The total cumulative downtime was approximately 1 hour and 50 minutes, impacting availability and performance across our hosted services. The issue was caused by a denial-of-service (DoS) attack originating from a compromised virtual machine (VM) within the CloudAfrica platform.

Timeline of Events

Monitoring via Pingdom recorded multiple distinct short outages, ranging from 1 to 26 minutes each:

  • Total downtime: ~110 minutes

Root Cause

A small 2 vCPU / 2GB RAM virtual machine was compromised by a third party and used to initiate a high-throughput DoS attack. This VM generated over 1.5 million concurrent active connections, which overwhelmed CloudAfrica’s edge firewalls. This volume of traffic, although legitimate from a resource perspective, far exceeded any expected behavior patterns.

While CloudAfrica’s DDoS protection infrastructure — strengthened after the July 26, 2024 external DDoS attack (incident link) — successfully guards against external threats, this internal vector exposed a new class of attack risk.

CloudAfrica Platform Philosophy & Current Context

Unlike most other cloud providers, CloudAfrica provides generous, out-of-the-box resource allocations for virtual machines, including CPU, RAM, and most notably, network bandwidth and connection concurrency. We offer industry-leading price-to-performance in the African market, and this includes allowing VMs to operate without strict artificial limits on network throughput.

While the majority of VMs on our platform generate moderate traffic, some run tens of thousands of concurrent connections consistently — and at present, these are not restricted.

This open performance model enables customers to achieve high efficiency and low latency, but the trade-off is a potential vulnerability in the case of compromised VMs, which can unleash excessive internal traffic. That is exactly what occurred during this incident, when a small VM unexpectedly created over 1.5 million concurrent connections, flooding our infrastructure.

Immediate Mitigation Actions

CloudAfrica’s operations team:

  • Identified and isolated the offending VM
  • Flushed all affected connection states at the network edge
  • Restored full platform availability by 10:29

Post-event forensics confirmed stability and no lateral compromise to other customers or services.

Preventative Measures Going Forward

To mitigate this class of issue moving forward — and to extend our external DDoS protections to internal contexts — we are implementing the following platform-wide policy:

All virtual machines will be limited to a maximum of 75,000 concurrent active connections.

This is more than 3x the current peak load of our busiest VMs (which average ~25,000 connections). For customers with higher requirements, custom limits will be available upon request.

This safeguard strikes a balance between maintaining our performance-first approach and protecting the platform from extreme, anomalous behaviors caused by misconfigurations or compromises.

These changes are currently being tested and will be gradually rolled out by 5th April 2025.

Conclusion

CloudAfrica.net is committed to delivering a high-performance, secure, and transparent cloud experience. This incident has helped us identify and close a critical edge case in our infrastructure design. We regret the inconvenience caused and are taking proactive steps to ensure such an incident does not recur.

Please visit status.cloudafrica.net for further updates, or reach out to our support team with any questions or feedback.

Posted Apr 01, 2025 - 12:09 SAST

Resolved

At approximately 07h50 on Saturday 29th March 2025, the CloudAfrica platforms ate Teraco Isando (JB1) started experiencing intermittent loss of network connectivity between internal systems, and external Internet peers.

The issue was identified to be a DOS attack emanating from a compromised internal customer VM that was identified and shut down.

Total network outage was in the order of 110 minutes, and full recovery of all connectivity was completed by 10h29on the same day.

We sincerely apologise to customers and partners, and will provide a post-mortem review and details of mitigation steps being undertaken.
Posted Mar 29, 2025 - 08:00 SAST