NSX-T Tier0 Gateway supports ECMP in BGP routing with the Leaf Switches which helps utilizing all the Edge Uplinks for Egress and Ingress traffic. ECMP mode is available only when the Tier0 Gateway is deployed in Active-Active mode. ECMP is applied at three levels:
- T0 DR-SR ECMP : Between the T0 DR component and T0 SR Components. This is NSX-T managed and is auto-enabled when the T0 Gateway is deployed in Active-Active mode regardless of whether we enable the ECMP Toggle button under the T0 BGP Configuration. Upto 8 T0 DR-SR ECMP paths are supported, which means we could have upto 8 T0 SR Components attached to the T0 DR Component.
- BGP Uplink ECMP: At the Uplink interfaces of each T0 SR Component sitting on the Edge nodes. This is managed by the ECMP feature of BGP routing process.
- iBGP Inter-SR ECMP : This is enabled at the Inter-SR iBGP links between the T0 SR Components in the Edge nodes.
Note that when we enable the ECMP Boolen Toggle button under the BGP Configuration of the T0 Gateway, ECMP is applied at the uplinks of the T0 SR Component on each Edge node. Enabling the Toggle button simply sets the “maximum-paths” parameter of the BGP process on each SR Component to 8. This means that if the T0 Gateway has 4 uplinks where two of them are connected to one SR and other two to another SR, then with BGP ECMP disabled, only one Uplink from each Edge will be utilized as BGP effectively selects only one best path by default. To utilize all the 4 Uplinks, then BGP ECMP needs to be enabled.
Edge VMs are usually deployed with 2 uplinks. With a two node Edge VM cluster, we can have 2 T0 DR-SR ECMP paths and 4 Uplink ECMP paths. If we need more, we have to spin up additional Edge nodes in the cluster.
There are some requirements to be met for uplink ECMP to work. Firstly, the below BGP attributes of the routes on the SR Components of the Edge nodes should match.
- Weight (set locally on T0 Gateway)
- Local Preference (set locally on T0 Gateway)
- AS Path number and AS Path length (Both leaf switches should be on same BGP ASN). This requirement can be relaxed by using AS-Multipath-Relax which we will cover in a separate article.
- Origin Code (of received routes from the Leaf switches)
- MED (of routes advertised from the Leaf switches)
- IGP Metric to reach the Leaf switch. It is directly connected, so the metric should be 0.
Secondly, ECMP should be enabled on the BGP process on the Leaf switches, so that Ingress traffic to the NSX-T networks from Leaf switches could utilize multiple NSX-T Edges.
Lets look at the Egress / Ingress Traffic directions with and without ECMP enabled on an Active-Active T0 Gateway.
Before we proceed, I would recommend reading my previous post on establishing eBGP peering between the T0 gateway and Dell Networking switches in the link below. This is because, I have used the same infra for this post and many of the statements in this article are connected to that.
Let’s get started.
T0 Active-Active Gateway and BGP Configuration with L3 Leaf Switches
This is a summary of the deployed configuration. The Active-Active T0 Gateway is deployed over two Edge nodes in the Edge Cluster.
The T0 Gateway has 4 Uplinks:
- Two Uplinks via Edge node 1 (nsxedge01) over VLANs 60 & 70 respectively. Each uplink on Edge node 1 connects to separate Leaf Switches.
- Another two Uplinks via Edge node 2 (nsxedge02) over VLANs 60 & 70 respectively. Each uplink on Edge node 2 connects to separate Leaf Switches.
The L3 Leaf Switches used are DellEMC Networking S5048-ON in VLT. They appear as a single logical switch to the ESXi host but not to the Edge nodes as they are connected to N-VDS VLAN Logical Segments on the host networking.
T0 Gateway has a total of 4 eBGP connections with the Leaf Switches – Two from Edge node 1 and next from Edge node 2.
There are two Tier 1 Gateways attached to the T0 Gateway and advertises the below subnets:
Lets check the eBGP peering on Edge node 1. We should see two neighborship – one with the Leaf Switch 1 over VLAN 60 and other with Leaf Switch 2 over VLAN 70.
And from Edge node 2.
Lets check the BGP peering on Leaf Switch 1. We should see three neighborship – one eBGP peering with Edge node 1 over VLAN 60, other eBGP peering with Edge node 2 over VLAN 70 and the third iBGP Peering with it’s VLT peer Leaf Switch 2.
And from Leaf switch 2.
Egress / Ingress traffic pattern with ECMP OFF
When ECMP is turned OFF, it’s actually setting the “Maximum-paths” parameter in the BGP process of the T0 SR Components to 1. This means that there will be only one eBGP path selected from among the two uplinks on the T0 SR Components based on the BGP path selection process. This literally means both SR Components (Edge nodes) are active but on single uplinks – Refer the architecture discussed earlier.
This is the BGP Configuration of the SR Component on one of the Edge nodes. You can export this from the Edge nodes using the below commands:
get service router running-config
Active-Active T0 Gateway will automatically enable ECMP between the T0 DR Component and the T0 SR Components. T0 DR Component on the hosts will have two default routes – one to the T0 SR Component 1 (one Edge node 1) and the other to T0 SR Component 2 (on Edge node 2). This configuration is not dependent on the ECMP Toggle button under the T0 BGP configuration option.
Let’s look at the BGP table on SR Component 1 (on Edge node 1). We could see only a single path to reach customer networks advertised from the Leaf switches (on 192.168.X.X/24 networks)
And same for SR Component 2 (on Edge node 2) also.
This shows that in an Active-Active T0 Gateway with ECMP Turned OFF, both Edge nodes forwards traffic but using only a single Uplink.
Egress / Ingress traffic pattern with ECMP ON
let’s turn ON ECMP under the BGP Configuration option on the T0 Gateway. I have also enabled Inter-SR Routing as I don’t see a reason for not using it. For more explanation on Inter-SR Routing, please refer to my previous post:
Enabling ECMP sets the “maximum-paths” parameter in the BGP process of the T0 SR Components to 8. This means that upto 8 paths on each SR Component (Edge node) could be selected to achieve ECMP. However we can have only a maximum of 2 uplinks on each edge node. Overall, we could achieve ECMP via 4 uplinks on two Edge nodes. If we want more ECMP paths, add additional Edge nodes to the cluster.
Let’s look at the BGP table on SR Component 1 (on Edge node 1). We could see that both uplink paths are used to reach customer networks advertised from the Leaf switches (on 192.168.X.X/24 networks). We could also see an iBGP route plumbed by the Inter-SR routing process which is used as a backup link.
And from SR Component 2 on Edge node 2, we see the same ECMP paths.
Now the Egress traffic from the NSX-T Edges utilize ECMP. But to achieve ECMP for the Ingress traffic, we have to enable the parameter on the Leaf Switches as well.
Enabling ECMP on the DellEMC S5048-ON Leaf Switches
This is the configuration to enable BGP ECMP on the Leaf Switches.
Let’s look at the BGP table on Leaf Switch 1. We could see that two active paths are selected to reach NSX-T networks – One via Edge node 1 and second via Edge node 2. We could also see an iBGP route plumbed by the Inter-SR routing process which is used as a backup link.
And from Leaf Switch 2, we see the same ECMP paths to Edge nodes.
In this way, we achieved ECMP for both Egress and Ingress traffic. This shows that in an Active-Active T0 Gateway with BGP ECMP Turned ON, both Edge nodes forwards traffic utilizing all the uplinks.
Now, what if the T0 Edges peers with Leaf Switches that differ in BGP ASN path attribute values? We can’t achieve BGP ECMP in that case unless we enable AS-Multipath-Relax feature in the BGP configuration. I have covered this as a separate article:
I hope this post was informative. Thanks for reading
8 thoughts on “NSX-T Tier0 ECMP Architecture and Routing Explained”
I found your blog to be very informative, thanks for taking the time to write it..
LikeLiked by 1 person
Thanks Gerald, glad to hear that you enjoyed it.
A fantastic article, I learned a ton about ECMP! I did have a question as I am confused by the diagram. As both Edge Nodes are part of the Overlay Transport Zone, they would both have a T0-DR correct? I only see one T0-DR. Shouldn’t there be two, one for each Edge Node?
Fantastic article, learned a ton about ECMP! I did have a question as I am confused by the diagram. As both Edge Nodes are part of the Overlay Transport Zone, they would both have a T0-DR. I only see one T0-DR. Shouldn’t there be two, one for each Edge Node?
Hi Kent, you are exactly correct. The T0 DR is distributed and is available on both Edge nodes as well as on the Transport nodes which are a part of the same transport zone. One thing to note is that T0 DR-SR ECMP is available only on the host transport nodes and not on the edge nodes. Each T0 DR on the edge node has only a single default route to an SR construct which is hosted within itself. I will modify the diagram to avoid the confusion.
Thanks and glad to hear that this article was helpful.
LikeLiked by 1 person
Thank you so much for explaining and sorry for posting twice. Keep up the excellent work, absolutely love reading your articles, very informative:)
LikeLiked by 1 person
Thanks Kent, glad to hear this. 🙂
very good explanation Hari
LikeLiked by 1 person