7 Things That Quietly Kill Performance on an AI or NVMe Fabric
Lossless Ethernet sounds simple. Don't drop packets. Feed the GPUs. Feed the storage. It isn't simple. It's precision engineering at the physical, link, and transport layers, and the failure modes are rarely loud.
A poorly designed RoCEv2 fabric doesn't fall over, it quietly underperforms, which means training jobs take 20% longer than they should, storage latency spikes for reasons nobody can trace, and GPU utilization sits below 70%. Here are the seven failure modes we see most often on AI and NVMe-oF fabrics, and the design decisions that keep them out of your environment.
1 - You built a lossy fabric and told it to be lossless
RoCEv2 is the dominant RDMA transport for AI east-west traffic and high-performance NVMe-oF storage, and it's acutely sensitive to packet loss. A single dropped packet triggers retransmissions that tank throughput for the affected flow. Actual lossless behavior requires meticulous DCB configuration: PFC to pause only the RoCE class during congestion, ETS for dedicated bandwidth allocation, and ECN to signal congestion before drops occur. Get any of these partially right and you have the operational overhead of a lossless fabric with the performance of a lossy one.
2 - You under-bought on switch buffers
AI and storage traffic produces microbursts, and microbursts produce incast: many flows converging on a single egress port faster than the buffer can drain. Shallow-buffer switches drop packets in this scenario long before the "congestion" has any meaningful duration. Deep-buffer platforms (Arista's R-series, specifically the 7800R3 and 7700R3) are purpose-built for exactly this pattern. The difference isn't a marginal optimization, it's the difference between a fabric that works and one that doesn't.
3 - QoS is the one knob everyone assumes someone else tuned
Not all traffic on an AI cluster is RoCEv2. Management, general IP, storage, and tenant traffic share the same physical infrastructure. Without an explicit QoS policy that maps RoCEv2 to a dedicated lossless strict-priority queue and maps every other class appropriately, your critical flows get head-of-line blocked by something unimportant. Priority queue assignment has to be consistent end to end; one device with the wrong mapping can nullify the design.
4 - You're operating the fabric without enough telemetry to verify it
Running a lossless fabric blind is a failure mode of its own. You need continuous visibility into PFC pause frames (per port, per priority), ECN markings, buffer utilization, queue depths, and RDMA counters from the NICs. Without that data in real time, you can't verify PFC is protecting the right traffic class, or that a slow drain on one port isn't silently affecting cluster-wide performance. Streaming telemetry from Arista EOS into CloudVision, correlated with NIC counters, is the default starting point.
Operating a lossless fabric without detailed telemetry is a more common failure than misconfigured PFC. The design is usually fine on paper; nobody's watching the counters that would tell them it's misbehaving in production.
5 - You're mapping too many traffic classes to one priority
PFC's dirty secret is head-of-line blocking within a paused priority. When PFC pauses a queue, everything in that queue stops, not just the congested flow. If you've crammed RoCE, critical storage, and important IP traffic into the same priority class, a pause event for one will block the others. The fix is deliberate classification, using ECN to reduce reliance on PFC where the transport supports it. RoCEv2 with DCQCN is better behaved than RoCEv2 with PFC alone.
6 - Your physical layer is creating the problem you're trying to engineer out
This one hurts. You designed for lossless, configured DCB perfectly, tuned QoS, deployed deep-buffer switches, instrumented everything, and you're still seeing packet errors. Check the cabling and optics. High-quality optics, well-terminated fiber, and correct link speed negotiation matter more than expected. Physical-layer errors produce corruption and drops that DCB cannot compensate for. The number of AI fabric "mysteries" that trace back to a $40 optic is higher than most operators would admit.
7 - trewaYou picked InfiniBand when Ethernet would have won the decision
Here's the one that will annoy some readers. For enterprises (as distinct from hyperscalers), the economics and operational profile of InfiniBand are hard to justify in 2025. InfiniBand creates vendor lock-in across switches, NICs, cables, and management software. Your team has to learn a separate operational stack, and your existing monitoring, automation, and security tooling doesn't apply.
Modern Ethernet with RoCEv2, ECN, deep buffers, and proper DCB delivers latency competitive with NVMe/FC, and increasingly close to InfiniBand for most AI workloads. It runs on open standards. It uses the tools your team already knows. Arista's AI-optimized platforms (7060X6 leaf, 7800R3 spine, 7700R3 DES) are purpose-built for this use case, and the Ultra Ethernet Consortium roadmap is closing remaining gaps.
InfiniBand is impressive engineering and a bad business decision for most enterprises. The performance gap over a well-engineered RoCEv2 fabric doesn't justify the lock-in, the operational silo, or the cost, outside of narrow workloads where every microsecond matters.
How We Build Lossless Fabrics
We build lossless Ethernet fabrics for AI clusters and NVMe-oF storage using a consistent pattern: Arista hardware sized to the workload, deep buffers where incast is expected, RoCEv2 with ECN-aware congestion control, deliberate QoS mapping, and continuous observability via CloudVision. Post-deployment, we stay in the loop through Aegis co-managed services: fabric health, workload performance, proactive tuning.
FAQ
Is RoCEv2 actually comparable to InfiniBand for AI training workloads?
For most enterprise AI workloads, yes. With deep buffers, ECN-aware congestion control, and the UEC roadmap closing remaining differences, the decision usually comes down to operational economics rather than raw performance. Hyperscalers running tightly coupled training at extreme scale are a different conversation.
How do we size switch buffers for our specific workload?
Buffer sizing is a function of oversubscription ratio, incast fan-in, traffic class mix, and burst profile. We run traffic characterization against production telemetry or a sizing workload before committing to a platform. Under-buffered causes drops; over-buffered costs money.
Can we retrofit a lossless configuration onto an existing Arista fabric?
Often yes. If the existing hardware has the buffer depth for a lossless traffic class, the retrofit is primarily a DCB, QoS, and telemetry project. Shallow-buffer designs are a harder story; the performance ceiling will be constrained by hardware.
What telemetry do we need to trust a lossless design?
At minimum: PFC pause frames per port per priority, ECN markings, queue depth distributions, buffer utilization, and RDMA counters from the NICs. Streaming telemetry from Arista EOS into CloudVision is the fastest path. Without this, you can't distinguish a healthy fabric from one that happens not to be failing loudly yet.