Designing a RoCEv2 Backend Fabric for a 4,000-GPU B200 Cluster

This post is a companion to Designing the Network Fabric for a 4,000-GPU B200 Cluster: A Complete Walkthrough, which covers the same 4,000-GPU B200 cluster with an InfiniBand backend fabric using NVIDIA QM9700 switches. That design uses a 3-tier rail-optimized fat-tree with Subnet Manager routing, SHARP in-network computing, and hardware adaptive routing. This post redesigns the backend from the ground up using RoCEv2 over standard Ethernet. Same GPU count, same rail-optimized topology, fundamentally different fabric engineering.

Most hyperscaler GPU clusters run InfiniBand for the backend training fabric. It's the default for good reasons: a mature RDMA stack, hardware adaptive routing, in-network computing via SHARP, and a proven ecosystem from NVIDIA. But InfiniBand comes with trade-offs. A proprietary control plane (Subnet Manager), vendor lock-in to NVIDIA networking, unfamiliar operational tooling for teams raised on Ethernet, and premium switch pricing.

RoCEv2 (RDMA over Converged Ethernet v2) offers an alternative. It delivers the same RDMA semantics (zero-copy, kernel-bypass, microsecond-latency transfers) over standard Ethernet and IP infrastructure. The same ConnectX-8 NICs that run InfiniBand can run RoCEv2 with a firmware mode change. The switches are standard Ethernet platforms from Cisco, Arista, Broadcom, or any vendor. The routing is BGP. The management is SNMP, gNMI, and Terraform. All the tools your network team already knows.

The catch? Lossless Ethernet is hard. InfiniBand was born lossless; Ethernet was not. Making Ethernet behave losslessly at 400Gb/s across a 3-tier fat-tree with 4,000 endpoints requires precise configuration of Priority Flow Control (PFC), ECN marking, DCQCN congestion control, and PFC storm mitigation. Getting any one of these wrong can cascade into a fabric-wide deadlock. There is no SHARP for in-network reduction. There is no hardware adaptive routing; you get ECMP. And the tail latency profile is wider than InfiniBand.

This post walks through the complete design of a 4,000-GPU NVIDIA B200 cluster with a RoCEv2 Ethernet backend, a dedicated storage Ethernet fabric, a frontend management fabric, and an out-of-band network. It covers the 3-tier Clos topology, lossless Ethernet engineering, BGP routing, congestion control tuning, and the full physical infrastructure including power, cooling, rack layout, and storage placement. If you haven't read the InfiniBand version first, I'd recommend starting there. This post assumes familiarity with the baseline cluster architecture and focuses on what changes (and what doesn't) when you swap IB for RoCEv2.

The Starting Point: 500 Nodes, 8 GPUs Each

The B200 ships in the HGX form factor: 8 SXM5 GPUs per baseboard, connected internally via NVLink 5.0 and NVSwitch. Each GPU gets 192 GB of HBM3e and a dedicated ConnectX-8 400GbE NIC operating in RoCEv2 mode for the backend fabric. Each server also carries two 100GbE NICs for the frontend management network, two 100GbE NICs for the dedicated storage network, and a 1GbE BMC port for out-of-band management.

That gives us:

4,000 GPUs / 8 per node = 500 compute nodes
4,000 backend RoCEv2 ports (one 400GbE per GPU)
1,000 frontend Ethernet ports (two 100GbE per node, management/scheduling)
1,000 storage Ethernet ports (two 100GbE per node, dedicated storage I/O)
500 OOB management ports (one 1GbE BMC per node)

The intra-node interconnect is NVSwitch over NVLink 5.0, delivering 1.8 TB/s bidirectional per GPU: 36× the bandwidth of a single 400GbE RoCEv2 link. This ratio reinforces that tensor parallelism stays intra-node on NVSwitch, while data parallelism and pipeline parallelism flow over the RoCEv2 fabric.