On this page
When you're building a GPU cluster at the 4,000-GPU scale, the network fabric stops being just another design decision — it becomes the defining constraint. Everything flows from it: rack placement, power distribution, cooling zones, cable tray routing, and ultimately whether your training jobs hit peak FLOPS or spend half their time blocked on collectives.
This post walks through the full fabric design for a 4,000-GPU NVIDIA B200 cluster — covering the InfiniBand backend (where training traffic lives), the Ethernet frontend (management and scheduling), the dedicated storage Ethernet fabric (where checkpoint and dataset I/O lives), the out-of-band management network (where you go when everything else is down), storage subsystem placement, and the physical infrastructure that keeps it all running. At this scale, the 2-tier fat-tree that worked for smaller clusters hits its ceiling, and a 3-tier fabric becomes the only path to full bisection bandwidth.
The Starting Point: 500 Nodes, 8 GPUs Each
The B200 ships in the HGX form factor: 8 SXM5 GPUs per baseboard, connected internally via NVLink 5.0 and NVSwitch. Each GPU gets 192 GB of HBM3e and a dedicated ConnectX-8 NDR 400Gb/s HCA for the backend fabric. Each server also carries two 100GbE NICs for the frontend management network, two 100GbE NICs for the dedicated storage network, and a 1GbE BMC port for out-of-band management.
That gives us:
- 4,000 GPUs / 8 per node = 500 compute nodes
- 4,000 backend IB ports (one per GPU)
- 1,000 frontend Ethernet ports (two per node, management/scheduling)
- 1,000 storage Ethernet ports (two per node, dedicated storage I/O)
- 500 OOB management ports (one BMC per node)
The intra-node interconnect is NVSwitch over NVLink 5.0, delivering 1.8 TB/s bidirectional per GPU — roughly 36× the bandwidth of a single NDR 400Gb/s IB link. That ratio reinforces the design principle that tensor parallelism should stay intra-node, on NVSwitch, while data parallelism and pipeline parallelism flow over the IB fabric.
Four Fabrics, Four Purposes
The backend InfiniBand fabric carries GPU-to-GPU RDMA traffic — AllReduce, AllGather, ReduceScatter — the collective operations that dominate distributed training. This traffic is latency-sensitive at the microsecond level, bandwidth-hungry, and must never contend with storage or management flows. It gets its own dedicated fabric with its own routing domain.
The frontend Ethernet fabric carries management and orchestration traffic: job scheduling (Slurm), container image pulls, in-band SSH access, monitoring telemetry (Prometheus, Grafana), syslog, DNS, NTP, and LDAP/AD authentication. This is a routed IP network with VXLAN-EVPN overlay segmentation. It does not carry storage I/O.
The storage Ethernet fabric is a dedicated, physically separate network for all storage I/O: distributed filesystem traffic (Lustre, GPFS, Weka), model checkpoint writes, training dataset reads, and artifact staging. Separating storage from the frontend eliminates the risk of a checkpoint storm saturating the management plane — a failure mode that has killed clusters in production. Storage traffic is bursty, high-bandwidth, and unrelated to the lighter control-plane flows on the frontend.
The out-of-band (OOB) management network is the lifeline you use when the other three fabrics are unreachable. BMC/IPMI access, serial console, firmware updates, power cycling — all of this runs on a physically isolated 1GbE network that has zero dependency on the production fabrics.
Mixing these traffic domains on shared infrastructure is a common shortcut that creates operational nightmares. Keeping them separate is non-negotiable at this scale.