Abstract
This blog post presents a standards-based framework for designing data center networks that support artificial intelligence (AI) and machine learning (ML) workloads using Ethernet VPN (EVPN) and Virtual Extensible LAN (VXLAN) technologies. We outline the key network requirements of AI/ML clusters – including high throughput, low latency, and scalability – and show how an EVPN-VXLAN fabric can meet these demands with a standardized, interoperable approach. The paper provides an overview of EVPN-VXLAN mechanisms based on the latest IETF standards and RFCs, and details best practices for building a leaf-spine IP fabric that enables multi-tenancy, efficient east-west traffic handling, and lossless data delivery for distributed AI training. Architectural considerations such as routing design, congestion control (for RDMA over Converged Ethernet), and network automation are discussed. We conclude with recommendations and emerging trends for future-proof AI network designs leveraging EVPN-VXLAN.
Introduction
Artificial intelligence (AI) and machine learning (ML) workloads are becoming increasingly commonplace in modern data centers. These workloads – such as training large neural networks or running inference on massive datasets – impose stringent demands on the network infrastructure. Unlike traditional enterprise applications, AI/ML clusters generate extremely high east-west traffic (server-to-server) and require near line-rate throughput with minimal latency and packet loss. For example, a distributed training job across many GPU servers may exchange terabytes of data during an all-reduce operation, saturating links and stressing network buffers. The network must therefore provide maximum throughput, minimal latency, and minimal interference for AI traffic flows. Additionally, consistency and scale are crucial – clusters can range from a few nodes to thousands of nodes, and network architectures must scale out accordingly while isolating different teams or tenants running concurrent AI workloads.
To meet these challenges, cloud and data center networks are evolving beyond traditional Layer-2 VLAN or Layer-3 only designs. A popular modern approach is to build an IP fabric (often in a leaf-spine Clos topology) with an overlay that provides Layer-2 extension and tenant segmentation as needed. Ethernet VPN (EVPN) combined with VXLAN encapsulation has emerged as a leading standards-based solution for such scenarios. EVPN-VXLAN offers the flexibility of Layer-2 adjacency over a routed fabric, enabling large-scale virtual networks on top of the physical infrastructure. At the same time, it uses a control-plane (BGP EVPN) to disseminate network reachability information, avoiding the flooding and scaling issues of older overlay approaches. This paper explores how an EVPN-VXLAN fabric can be designed specifically to support AI/ML workloads, leveraging the latest standards (IETF RFCs) and industry best practices. We discuss the key technical components of EVPN-VXLAN, then outline design considerations – such as multitenancy, routing efficiency, congestion management for RDMA traffic, and high availability – for AI-centric networks. Finally, we present recommendations and highlight emerging trends influencing next-generation AI networking.
EVPN-VXLAN Overview and Standards
VXLAN was initially introduced as an overlay mechanism to stretch Layer-2 networks over a Layer-3 infrastructure, and is documented in RFC 7348 (2014). The original VXLAN approach relied on flood-and-learn behavior, using multicast or head-end replication to handle unknown MAC address discovery and broadcast traffic (as described in RFC 7348). While this allowed virtual Layer-2 networks (identified by a 24-bit VNI) to be created over IP networks, the lack of a control plane in early VXLAN meant limited scalability and potential efficiency issues. For example, the flood-and-learn VXLAN model required underlying network support for multicast and could lead to excessive broadcast traffic in large deployments.
To overcome these limitations, the networking industry adopted BGP EVPN as the control-plane for VXLAN overlays. EVPN was originally defined for MPLS-based VPNs (RFC 7432, 2015) as a way to carry Layer-2 and Layer-3 VPN information in BGP and support features like multipath and redundancy. Subsequently, RFC 8365 (2018) extended EVPN for network virtualization overlays, specifying how EVPN can operate with VXLAN and similar encapsulations in data center networks. In an EVPN-VXLAN fabric, each switch (typically a top-of-rack leaf) that participates in the overlay runs BGP and advertises tenant MAC addresses and IP prefixes using EVPN route updates. This eliminates the need for data-plane flooding for MAC learning — instead, endpoints (e.g., servers or VMs) are learned locally and distributed to all relevant fabric nodes through BGP EVPN. The result is a highly scalable solution where Layer-2 segments can be extended across the data center over a Layer-3 IP underlay, with efficient control of broadcast, unknown unicast, and multicast (BUM) traffic and optimized unicast forwarding.
Several IETF standards govern this EVPN-VXLAN framework. Key among them, RFC 7432 defines the foundational EVPN BGP routes and attributes (for MPLS networks originally), including route types for MAC/IP advertisement, Ethernet segment discovery, etc. RFC 8365 (“Ethernet VPN (EVPN) as a Network Virtualization Overlay”) defines how those EVPN route types are used with VXLAN and other encapsulations, and introduces enhancements for split-horizon filtering (to prevent loops on multi-homed links) and mass-withdraw (to efficiently withdraw routes on link failure) in an NVO3 context. Furthermore, EVPN supports integrated routing and bridging (IRB), which is critical for Layer-3 communications in the overlay: RFC 9135 (2021) specifies how EVPN is used for inter-subnet routing within the fabric, providing options for symmetric and asymmetric IRB models. In practice, modern data center deployments use symmetric IRB with EVPN-VXLAN – meaning each leaf acts as the default gateway for its local hosts and routes traffic to remote subnets via VXLAN, using EVPN route type-5 (IP prefix) or type-2 (MAC/IP) advertisements for cross-subnet forwarding. This ensures optimal East-West routing without hair-pinning traffic through a centralized router.
A standard EVPN-VXLAN based network fabric thus consists of an IP underlay (often running eBGP/IBGP or an IGP for basic IP reachability) and an overlay where BGP EVPN sessions exchange tenant network information. This approach is vendor-neutral and standardized, with multi-vendor interoperability widely demonstrated. By adhering to these RFC standards (and relevant IEEE standards for data center bridging, discussed later), network architects can build fabrics that are future-proof and not locked into proprietary protocols.
Network Requirements for AI/ML Workloads
AI/ML workloads introduce unique network requirements that influence the design of the EVPN-VXLAN fabric. First and foremost is bandwidth and scalability: AI training clusters interconnect dozens or hundreds of servers (each with multiple GPUs or specialized AI accelerators) and exchange large volumes of data, often saturating 25 GbE/50 GbE/100 GbE links today and moving towards 200 GbE/400 GbE in the near future. The network must support non-blocking or low-oversubscription pathways so that East-West traffic can scale linearly as more nodes are added. A leaf-spine Clos topology is well-suited here, as it provides predictable scaling – new spine switches increase fabric capacity, and new leaf switches add rack capacity without impacting existing connections. EVPN-VXLAN fits naturally with this Clos design, since it allows an arbitrary mesh of tunnels between leaf switches (VTEPs) and leverages equal-cost multi-path (ECMP) load balancing across the IP underlay. This ensures that large flows can utilize multiple links in parallel and that no single path becomes a bottleneck. Advanced hashing or flow-distribution techniques (e.g., flowlet-based load balancing) may be employed to avoid out-of-order packets while maximizing utilization of available links.
Another critical requirement is low latency and minimal packet loss, especially for training workloads that use synchronization primitives or perform iterative computations sensitive to network delays. Distributed AI jobs often use protocols like AllReduce or parameter servers that aggregate gradients from many workers; excessive latency in the network can slow down the entire training convergence. Packet loss is equally problematic, as it can drastically reduce the throughput of TCP flows or pause RDMA transfers. Therefore, the network must behave in a quasi-lossless manner under load. This is where data center transport enhancements come into play. Many AI clusters today employ RDMA over Converged Ethernet v2 (RoCEv2) to allow direct memory-to-memory transfers between GPU servers with minimal CPU involvement. RoCEv2 operates on UDP/IP (so it can traverse a VXLAN/IP fabric seamlessly), but it requires a lossless Ethernet underlay since RDMA has very limited tolerance for packet loss. Achieving losslessness in an Ethernet network is done via Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) mechanisms. PFC (IEEE 802.1Qbb) provides per-priority link-level flow control, meaning a switch can pause traffic on a specific class (e.g., the RDMA traffic class) when its egress buffers fill beyond a threshold, rather than dropping packets. ECN (as defined in RFC 3168 for IP) allows network devices to mark packets (setting an ECN bit in the IP header) instead of dropping them when experiencing congestion; the endpoints then react by slowing their send rate. Together, PFC and ECN – often implemented via algorithms like DCQCN (Data Center Quantized Congestion Notification) – enable a feedback loop that controls congestion proactively for RDMA traffic. An AI-oriented network design must incorporate these features by dedicating at least one lossless traffic class for RDMA and configuring ECN thresholds appropriately on all switches. This ensures that even at very high loads, the training traffic experiences negligible packet drops, preserving throughput and keeping latency low.
Beyond raw performance, multitenancy and segmentation are important in AI data centers. It is common for a shared AI infrastructure to be used by multiple teams or user groups, or to host different types of workloads (e.g., training vs. inference, or production vs. development). Isolation between these is often required for security, resource management, or performance guarantees. EVPN-VXLAN provides a built-in solution for this via virtual routing and forwarding instances (VRFs) and VXLAN network identifiers. Each tenant or workload can be assigned its own Layer-2 VNI and/or Layer-3 VRF (with one or more associated VNIs), so that their traffic is separated from others at the overlay level. According to Cisco’s design blueprint, if a network must serve multiple tenants and functions, one can leverage an MP-BGP EVPN VXLAN fabric – VXLAN overlays allow network separation between tenants on a shared infrastructure. Similarly, Juniper’s AI data center design guidance notes that using an EVPN-VXLAN IP fabric overlay improves scalability by enabling multitenancy within the same data center, so that different AI workloads (for example, training different large language models for different departments) can run concurrently in isolation over a shared physical fabric. This segmentation extends both at Layer 2 (through EVPN MAC-VRF instances that keep broadcast domains separate) and at Layer 3 (through EVPN IP-VRF instances using EVPN Type-5 routes for tenant-specific routing tables). In practice, this means an AI cluster network can be logically partitioned into multiple virtual networks – for instance, a research network and a production network – using the same physical switches but with traffic and control-plane information cleanly isolated by EVPN policy.
Finally, network reliability and manageability are key requirements. AI workloads can run for days or weeks, so any network downtime or instability can be costly. The EVPN-VXLAN fabric should therefore be designed for high availability, leveraging features like redundant spines and fast convergence. EVPN itself supports fast convergence by using BGP’s reliable transport and can leverage BFD (Bidirectional Forwarding Detection) for rapid failure detection on links. Moreover, EVPN’s multi-homing capabilities allow servers or racks to be dual-homed to two leaf switches in an all-active fashion (enabled by EVPN Ethernet Segment routes and designated forwarder election as per RFC 7432), ensuring that a leaf failure does not isolate the servers connected to it. From an operations standpoint, automation and telemetry are vital. Many operators deploy centralized SDN controllers or fabric management tools to automate EVPN-VXLAN deployments (for example, using Cisco Nexus Dashboard Fabric Controller or Juniper Apstra to push configurations and manage BGP EVPN route reflectors). Streaming telemetry data – such as switch buffer occupancy, ECN marking rates, and PFC pause frame counters – helps network engineers proactively identify congestion hotspots or verify that the QoS mechanisms are working as intended. In summary, networks for AI/ML must not only deliver high performance, but also robust fault-tolerance and operational visibility to support the continuous and intensive nature of these workloads.
Designing an EVPN-VXLAN Fabric for AI Workloads
Given the above requirements, we now outline a reference design for an EVPN-VXLAN based network tailored to AI/ML workloads. The design balances Layer-3 routed scalability with Layer-2 flexibility and incorporates advanced traffic management for performance. Key aspects of the design are discussed below.
Physical Topology and Underlay
The baseline physical topology is a two-tier leaf-spine architecture (a 3-stage Clos). Each rack of AI servers connects to one or two Top-of-Rack (ToR) leaf switches, and all leaf switches are interconnected through a set of spine switches. This provides a fabric where any two servers are at most two hops apart (leaf → spine → leaf). High-bandwidth uplinks (e.g., 100 Gbps or 400 Gbps) between leaves and spines ensure ample capacity for east-west traffic, and multiple spines provide parallel ECMP paths. The underlay network runs pure IP routing – typically eBGP on each leaf-to-spine link, or an IGP like OSPF/IS-IS – to advertise loopback addresses and establish basic IP reachability among all switches. Each leaf is configured with a unique loopback IP address that serves as its VTEP endpoint in the overlay.
Routing and Gateway Design
On top of the physical underlay, the EVPN-VXLAN overlay is deployed to provide tenant connectivity. All leaf switches (and optionally spines acting as route reflectors) establish BGP sessions for the EVPN address family. These BGP sessions can form a full mesh or use a route-reflector design (often the spine switches act as route reflectors to simplify configuration). Every tenant network (or VLAN) in the cluster is mapped to an EVPN instance and a VXLAN VNI. For example, the AI cluster’s GPU interconnect network might correspond to VLAN 10 mapped to VNI 1010, and a storage network could be VLAN 20 mapped to VNI 1020, each either in the same VRF or different VRFs depending on the need for inter-routing. The EVPN control-plane ensures that if a server in Rack A advertises a MAC address or issues an ARP request, that information is distributed to other racks that have the same segment, without flooding the entire network.
For inter-subnet routing, the best practice is to use distributed anycast gateways on the leaf switches (the symmetric IRB model). In this design, each VLAN/VNI is extended to every leaf that has hosts in that network, and each of those leaves is configured with the same gateway IP/MAC for that VLAN (e.g., all leaf switches have interface VLAN 10 with IP 10.1.1.1/24 and the same virtual MAC). Hosts simply use the default gateway IP which is present on their local leaf. When a host in one subnet needs to communicate with a host in another subnet, the traffic is routed by the local leaf and sent across the fabric to the remote leaf, already encapsulated with the destination VNI of the target subnet. EVPN conveys the reachability of hosts via MAC/IP route advertisements, and can also distribute the anycast gateway MAC so that each leaf recognizes remote gateway endpoints. The symmetric IRB approach ensures that routing happens at the edge and that traffic between subnets travels optimally (it doesn’t require a dedicated centralized router device). It also simplifies ARP/NDP handling – EVPN can perform ARP suppression, where a leaf responds to ARP requests for remote hosts using information learned via EVPN, thus reducing broadcast traffic. In summary, each leaf acts as a combined L2/L3 gateway for its attached servers, and BGP EVPN coordinates the exchange of MAC and IP reachability so that both Layer-2 and Layer-3 connectivity are provided across the fabric.
Multitenancy and Security
For a multi-tenant AI cluster or any environment where multiple distinct user groups share the infrastructure, the design should allocate separate logical network segments to each tenant. In EVPN-VXLAN, this is achieved by using multiple EVPN instances/VRFs. For example, one might define VRF “AI-Research” and VRF “AI-Production”, each with their own set of VLANs/VNIs. Traffic is completely isolated between these VRFs (no EVPN route leaking occurs unless explicitly configured on a firewall or border node). If communication is required between them, it can be done via an external router or firewall that connects to both VRFs and enforces policy. This aligns with the security principle of micro-segmentation: even if teams share physical switches, their training jobs and data flows are isolated at the network level.
EVPN-VXLAN also supports more fine-grained network segmentation within a tenant, if needed. For instance, one could separate the storage network and the compute network of the same AI cluster into different VNIs, isolating their layer-2 domains while still keeping them in a common VRF for routing. This way, broadcast domains are kept minimal and fault domains are limited (a broadcast storm or misbehaving NIC on the storage network won’t affect the compute network, for example). The EVPN control plane enforces these separations by using distinct route distinguisher (RD) and route target (RT) values for each EVPN instance, so that a given MAC/IP route is only imported by switches that share the same RT (i.e., belong to the same tenant context).
In terms of security, using EVPN-VXLAN means the core of the network (the IP fabric) does not need to carry any customer or tenant VLANs; it only sees encapsulated VXLAN traffic. The fabric can thus be an IP-only network, which is easier to secure and operate (for example, standard IP ACLs or routing policies can be applied at the borders). Moreover, since BGP EVPN has built-in security features like route origin validation and PMAC (proxy MAC) for ARP, it inherently hardens the network against certain L2 attacks (e.g., ARP spoofing can be mitigated by the fact that only the legitimate leaf will advertise a given MAC/IP). Additional measures like DHCP snooping or dynamic ARP inspection can also be integrated at the leaf level if needed, similar to traditional enterprise networks, and EVPN will propagate only valid information. Overall, the EVPN-VXLAN architecture allows the AI network to be partitioned and secured in a manner analogous to cloud multi-tenant networks, which is important as AI becomes a shared service across large organizations.
Congestion Control and Losslessness
As noted, enabling RoCEv2 for AI training traffic is a key design consideration in these networks. Each leaf and spine switch in the EVPN-VXLAN fabric should be configured with at least one no-drop queue (PFC-enabled queue) for the RDMA traffic class. Typically, one of the 8 priority code points (PCP values) in Ethernet is reserved for AI/RDMA traffic (for example, priority 3 could be designated as the no-drop class). PFC is then enabled on that priority on all switch ports. This means that if a switch’s egress buffer for priority 3 traffic hits the threshold, a PFC pause frame is sent to the upstream device (e.g., to the sending leaf or server), indicating it should pause sending priority 3 traffic for a moment. Simultaneously, ECN marking is configured on that same queue with a lower threshold so that early congestion is signaled to end hosts by marking packets instead of stopping them. The NICs in the servers (which support RoCEv2 and DCQCN) will react to the ECN marks by slowing down the flow rate, hopefully preventing the buffers from ever overflowing and thus minimizing the need for PFC pauses.
Tuning these parameters is an important part of the deployment. If ECN thresholds are set too high (or not used at all), then the network might resort to PFC too often, which can cause head-of-line blocking if many flows get paused behind a congested flow. On the other hand, if ECN marking is too aggressive or NICs respond too sharply, throughput might be underutilized. Industry best practices suggest starting with vendor-recommended settings (for example, Cisco and Mellanox/NVIDIA provide tested configurations for ECN and PFC for various link speeds and traffic patterns) and then iteratively fine-tuning in a staging environment that simulates AI traffic. It’s also beneficial to separate traffic classes beyond just the RDMA vs. rest. For instance, control traffic (SSH, NFS metadata, etc.) could use a priority that gets weighted fair queueing treatment to ensure it’s not starved by bulk data. Similarly, if there are both training and inference workloads sharing the network and one is more latency-sensitive, they could be assigned different priorities or even separate VNIs with different QoS policies.
From the EVPN-VXLAN perspective, it’s important to note that VXLAN encapsulation can carry the inner packet’s priority across the network. Most implementations copy the inner 802.1p priority to the outer VXLAN header’s DSCP or outer priority field, so that the underlay can treat AI traffic with high priority even though it’s encapsulated. This behavior should be verified and configured accordingly on the switches (for example, using QoS maps). By doing so, the entire path from source server, through the fabric, to destination server honors the no-drop policy for AI traffic. Moreover, when extending the network or connecting to external networks, one should ensure those policies carry over. For instance, if the AI cluster connects to a storage array in another network segment, that segment should also run PFC on the relevant class to maintain end-to-end losslessness.
Resiliency and High Availability
The EVPN-VXLAN fabric is designed to be highly resilient to both link and node failures. At the spine layer, multiple spine switches provide redundancy; if one spine goes down, traffic is redistributed over the remaining spines via ECMP without disruption (flows may see a momentary pause as hashing adjusts, but protocols like TCP or RoCE handle that gracefully). Spine failures are generally handled entirely at Layer 3 (underlay routing reconverges or the ECMP just naturally masks it if using static ECMP hashing).
At the leaf layer, redundancy is achieved through dual-homing of servers and the EVPN multi-homing mechanism. In many AI deployments, each server has two network interfaces (or dual-ported NICs) that can connect to two separate leaf switches in a rack. Those two leaf switches form an “Ethernet segment” for the server (identified by an Ethernet Segment Identifier, ESI). EVPN allows both leafs to advertise the server’s MAC/IP routes with a special community that indicates multi-homing, enabling remote switches to know that those routes are reachable via either leaf interchangeably. One of the leaf switches will be elected as the Designated Forwarder (DF) for each VLAN on that segment, which means it handles broadcast/multicast traffic to the server to avoid duplication. In all-active mode, both leafs can forward unicast traffic to/from the server simultaneously (effectively the server is using a LAG across the two leafs, without requiring MLAG protocols on the switches – EVPN handles the coordination). If one leaf fails, the other leaf simply takes over all traffic for that server; the failover is fast because BGP detects the session loss and withdraws routes, and the remaining leaf had already been advertising the server’s routes as well (or can quickly send an update indicating it now is the sole forwarder). This multi-homing via EVPN is a significant improvement over traditional MLAG setups in terms of standardization and scale, since it doesn’t rely on pairwise switch syncing and can support an arbitrary number of leaf pairs in the fabric.
In addition to node/link failure resilience, the design should consider maintenance scenarios. EVPN-VXLAN fabrics support hitless upgrades and incremental changes more gracefully than older designs. For example, adding a new leaf switch (to add more racks) involves assigning it an IP, configuring BGP, and integrating it into the EVPN overlay – this can be done with zero impact on existing traffic, especially if using an automation system that pre-provisions the required policies ( highlights how new tenants or devices can be added without changing the intermediate fabric configuration). Similarly, if one needs to upgrade spine switch software, you can do rolling upgrades (draining traffic by adjusting BGP announcements, etc.) as you would in any IP network. The EVPN control plane will gracefully handle routes moving around.
Inter-Data Center Connectivity
While our focus is on a single data center fabric, many AI infrastructures span multiple sites or availability zones for geo-redundancy or collaboration between teams in different regions. EVPN-VXLAN can be extended to support multi-site connectivity. One approach is to deploy EVPN Multi-Site or EVPN gateways at the edge of each fabric: essentially, a pair of border leaf switches can perform VXLAN stitching between two sites, advertising routes from one fabric into the EVPN domain of the other (with appropriate route-target controls). This is described in various IETF drafts and solutions (e.g., use of EVPN Type-5 routes across data center interconnect). The benefit is that all the segmentation and tenant constructs remain consistent across sites – a given tenant’s network in Data Center 1 can be extended to Data Center 2 using EVPN, allowing VMs or containers to migrate or communicate across sites as if on one network.
However, inter-DC networking for AI also must account for latency and throughput constraints. Often, AI clusters within one site will have very high-bandwidth, low-latency connections (e.g., within one campus or metro area), but inter-site links might be more limited. A common practice is to keep the training cluster’s primary communications local, and use inter-DC links for periodic synchronization or replication (which can tolerate higher latency). If EVPN-VXLAN is used across the WAN, it may be beneficial to enable features like EVPN ARP proxy and IPv6 ND proxy so that even broadcasts like ARP/ND are not sent over the WAN unless necessary. Also, mapping of DSCP to WAN QoS classes should ensure that important traffic (perhaps check-pointing of AI models) gets priority over less critical traffic.
In summary, the EVPN-VXLAN fabric design for a single site can serve as a building block for a larger multi-site network, with EVPN providing a control-plane that can stitch these blocks together. This capability is useful as AI deployments grow and require disaster recovery, active-active clustering, or cloud bursting between on-premises and cloud.
Emerging Trends and Future Outlook
The landscape of networking for AI/ML is rapidly evolving. One emerging trend is the increasing link speeds and the transition to new network interface technologies. 400 Gbps Ethernet is becoming standard in cutting-edge AI clusters (with 800 Gbps on the horizon), driven by the immense bandwidth needs of clusters with hundreds or thousands of GPUs. EVPN-VXLAN as a technology is capable of scaling to these speeds, but it places demands on switch hardware to support larger forwarding tables (for potentially tens of thousands of endpoints in massive deployments) and deeper buffers to handle incast traffic from many-to-one communication patterns. Network vendors are responding with next-generation switch ASICs that include features tailored for AI workloads – for example, more on-chip buffer memory and intelligent traffic management algorithms to handle the unpredictable, bursty nature of AI traffic. It’s important for network engineers to track these hardware developments, as they will influence design choices (such as how many servers per leaf, or whether a 2-tier Clos is enough or a 3-tier is needed at ultra-large scale).
Another development is the rise of SmartNICs and DPUs (Data Processing Units) in servers. These are intelligent NICs that can offload networking functions such as VXLAN encapsulation/decapsulation, firewalling, and even portions of the RDMA congestion control algorithm. In AI clusters, where each server may be equipped with a DPU (for example, NVIDIA BlueField), the network design could leverage these capabilities by pushing certain functions to the server edge. For instance, a DPU could terminate the VXLAN tunnel on the server itself, meaning the server is directly participating in EVPN – effectively acting as its own leaf for overlay purposes. This could improve efficiency (reducing the load on ToR switches) and provide finer control of traffic right at the source. While this blurs the line between compute and networking, it is an area to watch: standards bodies and vendors are working on protocols to integrate such smart endpoints. EVPN, being based on BGP, is flexible enough that a server with a BGP agent on a DPU could join the EVPN fabric securely. Over time, we may see “leafless” fabrics where the concept of a ToR switch is partly virtualized into the server SmartNICs for certain deployments.
On the standards front, the IETF is exploring network virtualization beyond VXLAN. Geneve (Generic Network Virtualization Encapsulation) is a newer encapsulation designed to be more extensible and feature-rich than VXLAN (which has a fixed 50-byte overhead and limited fields). Geneve allows optional TLV (Type-Length-Value) options in the header, making it easier to carry metadata such as tenant identifiers, traffic tags, or even in-band telemetry data. EVPN’s control-plane has been designed to be transport-agnostic; indeed, RFC 8365 notes that while it focuses on VXLAN/NVGRE, the same procedures apply to other encapsulations and that Geneve is supported with only minor extensions. As Geneve matures (and if hardware support for Geneve options becomes common), future AI networks might migrate to Geneve overlays for additional capabilities. For example, one could carry application-level identifiers in Geneve options to facilitate more granular routing or policy decisions in an AI workflow. For now, VXLAN is ubiquitous and well-supported, but Geneve is an important part of the future-proofing conversation.
Additionally, there is active research and development in improved congestion control and load balancing tailored for high-performance workloads. We’ve seen algorithms like HPCC (High Precision Congestion Control) and others being proposed to further optimize Ethernet for HPC/AI traffic. These might leverage more detailed telemetry (e.g., per-flow queue occupancy feedback) to adjust rates faster and with more stability than current ECN-based schemes. In an EVPN-VXLAN context, such enhancements would largely be implemented at the device/NIC level and configured via software – the network design principles (having a lossless queue, marking, etc.) would remain, but the specific algorithms could change. The good news is that an EVPN-VXLAN fabric, being IP-based and standards-based, can usually accommodate these upgrades with firmware or software updates, as opposed to closed proprietary fabrics where one might be locked in.
Finally, automation and intent-based networking are increasingly important as AI networks scale. The complexity of managing thousands of VXLAN segments and BGP configurations can be mitigated by automation. We anticipate more integration of AI/ML techniques in network operations as well – for instance, using machine learning to predict congestion or failures in the EVPN fabric before they happen, or to automatically tune parameters like ECN thresholds based on traffic patterns. The concept of the network being “self-driving” might eventually apply to AI fabrics, where the network can adjust itself to best serve the AI workloads running on top of it. Operators should design with automation in mind: using programmatic interfaces (NetConf/RESTCONF, gNMI for telemetry, etc.) and perhaps modeling their network in an intent framework so that as new technologies (like those mentioned above) become available, they can be adopted with minimal manual reconfiguration.
Conclusion
AI and ML workloads push data center networks to their limits in terms of performance, scale, and reliability. Designing a network that can accommodate these demands requires leveraging modern, standards-based technologies. EVPN-VXLAN offers a powerful and flexible framework to build such networks, combining the benefits of Layer-3 IP fabrics (optimal routing, massive scale-out) with the ability to overlay Layer-2 networks and segmented tenants as needed for AI workloads. In this paper, we reviewed how EVPN-VXLAN operates and why it is well-suited for AI clusters: it enables massive east-west scalability, seamless multi-tenancy for isolating different workloads, and the integration of lossless transport techniques for RDMA traffic. By adhering to open standards (IETF RFC 7348, 7432, 8365, 9135, and related specifications), network engineers can design fabrics that are interoperable across vendors and ready to incorporate future advancements.
We presented an overview of key design considerations, including the use of a leaf-spine Clos topology with distributed anycast gateways, proper QoS configuration with PFC/ECN to support high-performance training traffic, and EVPN multi-homing for resiliency. Implementing these best practices ensures that the resulting network can deliver the high throughput and low latency needed by AI applications, while also being resilient to failures and straightforward to manage at scale.
The industry is continuing to innovate in this space. As emerging trends like 400G/800G Ethernet, SmartNIC offloads, and advanced congestion control algorithms gain traction, the EVPN-VXLAN based design is expected to adapt and incorporate these improvements. The fundamental blueprint, however, remains consistent: an IP fabric with an EVPN overlay provides a robust, standards-based backbone for AI workloads today and tomorrow. Network engineers architecting AI/ML infrastructures are encouraged to build on this framework, leveraging the collective best practices from cloud data centers and the latest advancements in networking. By doing so, they can ensure that the network will not be a bottleneck – but rather a powerful enabler – for the next generation of AI breakthroughs.
References
[1] M. Mahalingam et al., “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” IETF RFC 7348 (Informational), Aug. 2014.
[2] A. Sajassi et al., “BGP MPLS-Based Ethernet VPN,” IETF RFC 7432 (Proposed Standard), Feb. 2015.
[3] A. Sajassi et al., “A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN),” IETF RFC 8365 (Proposed Standard), Mar. 2018.
[4] A. Sajassi et al., “Integrated Routing and Bridging in Ethernet VPN (EVPN),” IETF RFC 9135 (Proposed Standard), Oct. 2021.
[5] Cisco Systems, “Cisco Data Center Networking Blueprint for AI/ML Applications,” White Paper, accessed 2023. [Online]. Available: Cisco.com (Cisco White Papers).
[6] A. Chatterjee and V. V., “Designing Data Centers for AI Clusters,” Juniper Networks, White Paper, 2023.
[7] Juniper Networks, “EVPN-VXLAN for AI-ML Data Centers – Configuration Example,” Junos OS Evolved Technical Documentation, 2023.