Loss of Network Connectivity from Single Host

Incident Report for CloudAfrica

Postmortem

Network Stack Failure Affecting Virtual Machine Connectivity from a Single Host

Incident Status: Resolved
Duration: December 1, 2025, 13:15 - 15:40 SAST (UTC+2)
Date: Monday, December 1st, 2025

Impact

Customer virtual machines hosted on a single compute node (pve-127-dc5-r08-vox-teraco-jb1) at our Teraco JB1 East Data Centre facility in Isando, Johannesburg were impacted by a complete network stack failure.

The incident affected less than 1% of all virtual machines across our platform.

All VMs on this host experienced complete loss of network connectivity during this outage.

Timeline

Dec 1, 13:15 SAST - The host's Intel E810 4x25Gbps network interface controller (NIC) began experiencing hardware monitoring read failures. RDMA subsystem reported HMC (Host Memory Cache) errors and initiated a reset request.

Dec 1, 13:15 SAST - IOMMU/DMAR faults occurred as the NIC attempted DMA operations with invalid page table entries. The network bond interface failed as the slave interface (ens1f3np3) went offline, resulting in "bond0: now running without any active interface" and complete loss of VM network connectivity.

Dec 1, 13:16 SAST - The ice network driver attempted automatic recovery but failed to rebuild the VSI (Virtual Station Interface), generating the error "Rebuild failed, unload and reload driver." Continuous hardware monitoring failures indicated the NIC was in an unrecoverable state.

Dec 1, 15:40 SAST - After repeated attempts to restart networking services and multiple reboot cycles, the host was successfully stabilized and network services were fully restored.

Dec 1, 15:50 SAST - All virtual machines on the affected host successfully restarted and were fully operational with network connectivity restored.

Root Cause

The host experienced a catastrophic failure of its Intel E810 4x25Gbps network interface controller (NIC) running the ice kernel driver. Analysis of system logs indicates the failure sequence began with:

Hardware monitoring subsystem failures - Repeated HW read failures (-5 error code) indicating the NIC firmware or hardware entered an unstable state
RDMA stack collapse - The iWARP RDMA driver (irdma) detected an HMC error and requested a reset
IOMMU/DMAR violations - DMA Read operations failed due to invalid page table entries, suggesting memory management unit corruption
Driver recovery failure - The ice driver's automatic recovery mechanism failed to rebuild the network VSI, leaving the interface in an unrecoverable state

This failure pattern is consistent with known instability issues in Intel E810 NICs when operating with RDMA enabled, particularly related to firmware/driver interaction issues that can cause the network stack to enter an unrecoverable state requiring a full driver reload or system reboot.

Resolution

The immediate issue was ultimately resolved by performing a controlled reboot of the affected host server at 15:40 SAST. All virtual machines were successfully restored to full operation by 15:50 SAST.

Emergency Preventive Maintenance was performed on December 2nd, 2025, from 23:15 to 00:05 SAST, during which:

Network adapter firmware was updated to the latest stable version for the Intel E810 25Gbps NICs
RDMA functionality was disabled on the host to eliminate the known instability vector
Network bond configuration was verified and tested

Since the emergency maintenance, the host has operated without any network-related issues.

Follow-up Actions

We sincerely apologize to our customers for this network outage and the service disruption it caused.

Immediate Actions Completed:

Emergency firmware updates deployed to the affected host's network adapters
RDMA disabled to eliminate the primary failure vector
Enhanced monitoring alerts configured for ice driver errors and IOMMU faults

Ongoing Actions:

Conducting a comprehensive audit of all Intel E810 NICs across our infrastructure to identify hosts requiring firmware updates
Evaluating RDMA configuration across our fleet and implementing risk-based decisions on where RDMA functionality is critical versus optional
Implementing proactive monitoring for ice driver hardware monitoring failures as an early warning indicator
Developing automated recovery procedures for network stack failures to reduce mean time to recovery
Scheduling preventive maintenance windows for firmware updates across remaining hosts with Intel E810 NICs

Long-term Improvements:

Establishing a regular firmware update cadence for critical network infrastructure components
Implementing automated failover testing for bonded network configurations
Enhancing our change management processes to ensure firmware updates are deployed proactively rather than reactively

We remain committed to maintaining the highest levels of service availability and reliability for our infrastructure platform.

Once again, we apologise to affected customers, and thank you for your continued support.

The CloudAfrica Team.

Posted Dec 03, 2025 - 16:35 SAST

Resolved

We have experienced a loss of network connectivity from a single host on the CloudAfrica platform. The host in question is located within our racks at Teraco Isando JB1.

Posted Dec 01, 2025 - 01:30 SAST