Loss of connectivity - Terraco Isando
Incident Report for CloudAfrica
Resolved
Background

1. CloudAfrica peers to the Internet through 4 upstream IP transit providers;
2. This is done using standard peering mechanisms for IP transit utilising BGP (Border Gateway Protocol);
3. BGP allows us to failover between different transit providers (as well as providing least 'cost' routing) via our various peering arrangements;
4. Failover (i.e. loss of a BGP session and selective use of other peers and routing of traffic) typically occurs when anything leads to the disruption of a BGP session with any one (or more) IP transit providers, resulting in traffic routed over alternative BGP peers
5. Our default preferred upstream peer is WorkOnline (WOL), Africa's leading IP transit provider;
6. As part of securing BGP, more and more IP transit providers are making use of RPKI (Resource Public Key Infrastructure) to securely exchange and manage route announcements between various network and service providers - this is done to ensure that the correct routes are exchanged between legitimate network owners and operators, and minimises the possibility of network and IP address hijacking
a. WOL utilises RPKI extensively and this is one of the reasons that it is our preferred IP transit partner

Root Cause Analysis for Connectivity loss between 18h47-19h17 (SAST/UTC+2) on Wednesday 22nd February 2023

1. At 18h47 our networks and systems experienced severely degraded and intermittent connectivity to the Internet;
2. Initial fault-finding showed that all peers were active and there was no clear indication as to the cause of the outage;
3. Deeper analysis showed that in spite of all BGP peers (and sessions) being up, no traffic was flowing across the WOL peering link;
4. During a process to manually re-route traffic and disable the WOL peer, traffic flow was restored across the WOL peering link at 19h17.
5. Feedback received from WOL:
a. " The edge service router on which your service terminates lost connectivity to both of our RPKI servers during that time. This caused a routing issue because of the differences in the way our different routing platforms handle "RPKI NotFound" and "RPKI Valid" routes."
6. This event did not result in loss of the BGP session with WOL but effectively scrambled routing on the WOL side of our peering link
a. So whilst routing was effectively disrupted to us on the WOL side, the BGP session remained up (meaning that as far as our peering routers were concerned, WOL was still available as a transit peer)

Mitigation

1. WOL has indicated that they are in the process of optimising their RPKI infrastructure - they will communicate timeframes to us as and when they engage with their vendor as part of this project;
2. Whilst this is a fairly complex (and rare) issue to deal with from the CloudAfrica perspective, we are looking at various mechanisms to automate optimal route selection across BGP peers that go beyond whether a BGP session is active or not, and we'll communicate decisions made in this regard in the next while.

We sincerely apologise for the network disruption experienced during this event by our customers and partners.

The CloudAfrica Team
Posted Feb 23, 2023 - 15:43 SAST
Update
All access is restored.

We are still investigating and will report back
Posted Feb 22, 2023 - 19:30 SAST
Investigating
We are experiencing intermittent connectivity failures into DC1 and DC2.

We are investigating the cause and will revert ASAP.
Posted Feb 22, 2023 - 19:11 SAST
This incident affected: Web Sites, API, Storage Services, and Cloud Services.