We're performing emergency network maintenance tonight to address what we believe to be the root cause of recent service interruptions.
MAINTENANCE DETAILS:
Date: February 19, 2026
Time: 00H30 to 01H30 SAST (UTC+2)
Duration: 60-minute window (periods of 30-90 seconds downtime during this window)
Impact: Brief service interruptions, automatic reconnection
Scope: All services in Teraco JB1 East, Teraco JB1 West, and OADC JNB1
THE PROBLEM:
Over the past 16 days, we've experienced 15 unplanned service interruptions totaling 72 minutes of downtime, with 4 incidents occurring today alone (32 minutes of collective downtime). These outages have been characterized by CPU saturation on core switches during traffic peaks, leading to packet drops and service degradation.
TROUBLESHOOTING:
Initially, we believed these issues might be firmware-related. We completed comprehensive firmware upgrades last night across all 15 core switches in our network fabric, which interconnects at 200Gbps across Teraco JHB1 East, Teraco JHB1 West, and OADC JNB1 datacenters. While these upgrades improved stability marginally, the core outage pattern persisted.
After significant further analysis of switch logs, CPU utilization patterns, and packet drop statistics, it appears that the root cause is excessive protocol overhead in our MSTP (Multiple Spanning Tree Protocol) configuration. Our switches are running MSTP with ~4000 VLANs, forcing the CPU to perform complex MD5 hash calculations and region verification every 2 seconds. This consumes 10-15% baseline CPU and spikes to 100% during traffic bursts, at which point packet forwarding degrades and outages occur.
SOLUTION:
Based on this analysis, we're migrating to RSTP (Rapid Spanning Tree Protocol), which:
- Provides identical network protection and redundancy - and considering that we do not use the features that MSTP is designed for
- Eliminates unnecessary protocol overhead (3-5x reduction)
- Frees CPU resources to handle traffic instead of redundant calculations
- Aligns with industry best practices for our network topology
This change is analogous to disabling resource-intensive features that provide no functional benefit - we maintain the same protection with significantly lower overhead.
EXPECTED RESULTS:
If our analysis is correct (which we believe it is), this change should deliver:
- 50-70% reduction in CPU utilization during traffic peaks
- Significantly improved stability and fewer service interruptions
- Faster packet processing and better overall performance
- Elimination of the outage pattern we've observed
ONGOING ANALYSIS:
Between now and the maintenance window (00:30 tonight), we will continue conducting detailed core switch reviews to verify this diagnosis and ensure we're addressing the root cause. We're monitoring CPU patterns, packet drops, queue depths, and protocol overhead in real-time.
While we're confident in our analysis, we're scheduling this emergency maintenance on the basis that MSTP overhead is the primary issue, while maintaining vigilance for any additional contributing factors.
TECHNICAL ASSURANCE:
- No change to network topology, routing, or VLAN configuration is planned
- Identical loop prevention and failover capabilities
- Industry-standard protocol (IEEE 802.1w) used globally in production
- Comprehensive rollback plan ready if needed (though not expected)
- Change will be applied systematically across affected switches
WHY EMERGENCY TIMING:
With the frequency and severity of outages increasing (4 incidents today), the current configuration is clearly actively degrading service quality. Our analysis indicates each day of delay adds approximately 4-5 minutes of unplanned downtime.
POST-MAINTENANCE:
We will monitor network performance continuously for 48 hours following the change. If the outage pattern continues despite this optimization, we'll immediately escalate to additional diagnostic measures and corrective actions.
You'll receive a completion notice with initial results shortly after the maintenance window closes.
We sincerely apologize for the short notice and any inconvenience. Your service reliability is our highest priority, and we're taking systematic, evidence-based action to restore the stability you expect from us.
Posted Feb 18, 2026 - 15:16 SAST