Incident Status: Resolved
Duration: December 1, 2025, 13:15 - 15:40 SAST (UTC+2)
Date: Monday, December 1st, 2025
Customer virtual machines hosted on a single compute node (pve-127-dc5-r08-vox-teraco-jb1) at our Teraco JB1 East Data Centre facility in Isando, Johannesburg were impacted by a complete network stack failure.
The incident affected less than 1% of all virtual machines across our platform.
All VMs on this host experienced complete loss of network connectivity during this outage.
Dec 1, 13:15 SAST - The host's Intel E810 4x25Gbps network interface controller (NIC) began experiencing hardware monitoring read failures. RDMA subsystem reported HMC (Host Memory Cache) errors and initiated a reset request.
Dec 1, 13:15 SAST - IOMMU/DMAR faults occurred as the NIC attempted DMA operations with invalid page table entries. The network bond interface failed as the slave interface (ens1f3np3) went offline, resulting in "bond0: now running without any active interface" and complete loss of VM network connectivity.
Dec 1, 13:16 SAST - The ice network driver attempted automatic recovery but failed to rebuild the VSI (Virtual Station Interface), generating the error "Rebuild failed, unload and reload driver." Continuous hardware monitoring failures indicated the NIC was in an unrecoverable state.
Dec 1, 15:40 SAST - After repeated attempts to restart networking services and multiple reboot cycles, the host was successfully stabilized and network services were fully restored.
Dec 1, 15:50 SAST - All virtual machines on the affected host successfully restarted and were fully operational with network connectivity restored.
The host experienced a catastrophic failure of its Intel E810 4x25Gbps network interface controller (NIC) running the ice kernel driver. Analysis of system logs indicates the failure sequence began with:
This failure pattern is consistent with known instability issues in Intel E810 NICs when operating with RDMA enabled, particularly related to firmware/driver interaction issues that can cause the network stack to enter an unrecoverable state requiring a full driver reload or system reboot.
The immediate issue was ultimately resolved by performing a controlled reboot of the affected host server at 15:40 SAST. All virtual machines were successfully restored to full operation by 15:50 SAST.
Emergency Preventive Maintenance was performed on December 2nd, 2025, from 23:15 to 00:05 SAST, during which:
Since the emergency maintenance, the host has operated without any network-related issues.
We sincerely apologize to our customers for this network outage and the service disruption it caused.
Immediate Actions Completed:
Ongoing Actions:
Long-term Improvements:
We remain committed to maintaining the highest levels of service availability and reliability for our infrastructure platform.
Once again, we apologise to affected customers, and thank you for your continued support.
The CloudAfrica Team.