Key Takeaways
- Training almost always needs a lossless fabric, while single-request inference often runs fine on well-provisioned lossy networks - large-model inference that shards across GPUs needs lossless treatment like training.
- For most enterprises, RoCEv2 on Ethernet is the mature choice today, with Ultra Ethernet as the forward path arriving in 2025 and beyond on platforms with committed upgrade paths.
- Cost-per-port planning misses the real budget impact - optics, cabling, and host adapters often exceed switch port costs, especially at 800G speeds.
- A dedicated AI fabric typically pays for itself once you cross 16 to 32 GPUs doing distributed training, providing performance and operational isolation from production traffic.
- Modern lossless Ethernet with RoCEv2 performs within a few points of InfiniBand on real AI benchmarks while avoiding vendor lock-in and reusing existing operational skills.
Operational considerations
Beyond the technical design, enterprise GPU fabrics require clear operational boundaries and ongoing visibility into performance that traditional monitoring does not provide.