NSX-T introduced Inter-SR iBGP routing between the SR Components of the T0 Gateway in version 2.4. This feature helps to tolerate an asymmetric failure of the SR component on the Edge node (Eg: uplink failures) by routing the internal traffic reaching to it (from the T0 DR Component) over to the iBGP inter-SR link to the SR component on the other Edge nodes for northbound reachability. This feature is available only when the Tier0 Gateway is deployed in Active-Active mode and that with ECMP enabled under the BGP process.
In case you missed it, I have already covered the Tier0 Inter-SR Routing basics in my earlier article below.
In this article, we will look at how ECMP is achieved on the Inter-SR routing link, so that in the event of uplink failures on one T0 SR Component, traffic is load shared to all the other T0 SR components avoiding a possible bottleneck.
ECMP is applied at three levels:
- T0 DR-SR ECMP : Between the T0 DR component and T0 SR Components. This is NSX-T managed and is auto-enabled when the T0 Gateway is deployed in Active-Active mode regardless of whether we enable the ECMP Toggle button under the T0 BGP Configuration. Upto 8 T0 DR-SR ECMP paths are supported, which means we could have upto 8 T0 SR Components attached to the T0 DR Component.
- BGP Uplink ECMP: At the Uplink interfaces of each T0 SR Component sitting on the Edge nodes. This is managed by the ECMP feature of BGP routing process.
- iBGP Inter-SR ECMP : This is enabled at the Inter-SR iBGP links between the T0 SR Components in the Edge nodes.
The first two cases (T0 SR-DR ECMP and BGP Uplink ECMP) are already covered in my previous article below:
We will go through the third case here. Let’s get started.
Inter-SR iBGP ECMP Architecture
Let’s look at the deployed configuration in my infra. A Tier 0 Gateway is Deployed in Active-Active mode on an Edge cluster with 4 Edge nodes.
The T0 Gateway has 8 Uplinks:
- Two Uplinks via Edge node 1 (nsxedge01) over VLANs 60 & 70 respectively. Each uplink on Edge node 1 connects to separate Leaf Switches.
- Two Uplinks via Edge node 2 (nsxedge02) over VLANs 60 & 70 respectively. Each uplink on Edge node 2 connects to separate Leaf Switches.
- Two Uplinks via Edge node 3 (nsxedge03) over VLANs 60 & 70 respectively. Each uplink on Edge node 2 connects to separate Leaf Switches.
- Two Uplinks via Edge node 4 (nsxedge04) over VLANs 60 & 70 respectively. Each uplink on Edge node 2 connects to separate Leaf Switches.
ECMP and Inter-SR iBGP is enabled under the BGP configuration of the Tier0 Gateway. Each T0 SR Component will establish a full-mesh iBGP connection with each other.
As seen below, all the SR Components over the 4 Edge nodes are Active.
Each T0 SR Component established eBGP peering with the L3 Leaf switches over both the Uplinks.
I have enabled ECMP for the BGP process on the Leaf Switches, so traffic to the NSX-T networks are load-shared over to all the four Edge nodes.
Looking at the BGP Neighbor table on the T0 SR components, we could see full-mesh iBGP sessions established between each other.The Inter-SR link connects to a system generated Overlay logical segment which is transparent to us.
We have a total of N*(N-1) iBGP Inter-SR sessions, which is equal to 12 in our case.
Now lets look at the BGP table and forwarding table on the T0 SR Components.
We could see 5 paths in the BGP table to reach a VLAN network advertised from the Leaf switches.
- Two eBGP paths via Leaf switches. This is the preferred path as eBGP paths have higher preference over iBGP paths in BGP path selection process.
- Three iBGP paths via Inter-SR routing which works as backup.
The eBGP path via Leaf switches are kept in the routing table. Note that BGP ECMP is enabled, so we would see two eBGP paths in the forwarding table.
Now lets look at the BGP table and forwarding table on the Leaf Switches.
We could see 5 paths in the BGP table to reach an NSX-T network advertised from the T0 Gateway.
- 4 eBGP paths via T0 SR Edge nodes. This is the preferred path as eBGP paths have higher preference over iBGP paths in BGP path selection process.
- 1 iBGP path learned via the VLT peer which works as backup.
The eBGP path via T0 SR Edge nodes are kept in the routing table. Note that BGP ECMP is enabled, so we would see four eBGP paths in the forwarding table.
Uplink failure on the T0 SR Component (Edge node)
Let’s invoke a failure on both the uplinks on one of the T0 SR components. With Inter-SR routing enabled and with ECMP, we should see all the iBGP paths to the other T0 SR Components getting installed in the forwarding table.
What if no Inter-SR iBGP ECMP?
Without ECMP, we would see only one iBGP Inter-SR path getting installed into the forwarding table, which means that all northbound traffic reaching this T0 SR Component will be routed to only one best T0 SR Component for northbound reachability, which is suboptimal becomes a bottleneck.
Note that we cannot enable Inter-SR Routing without ECMP, so we wont encounter these scenarios.
Let’s confirm we have northbound reachability by pinging a VLAN network from this T0 SR Component.
SUCCESS!!! This T0 SR Component has northbound reachability via the other SR Components.
I hope the article was informative. Thanks for reading.
4 thoughts on “NSX-T Tier0 Inter-SR iBGP ECMP Explained”
You clearly explained lots of NSX-T ECMP in your articles. Fantastic, Thanks for going deep into the routing aspects and helping us understand it. Keep up the great work.
LikeLiked by 1 person
Thanks for the feedback Mark, glad to hear that the articles were informative.
Hello Hari, I recently started working on NSX-T and found your articles to be very useful. Brilliant work. Would you mind connecting over Twitter or LinkedIn?
LikeLiked by 1 person
Thanks Alan, glad to hear your feedback. My Twitter and Linkedin links are available on the home page, happy to connect with you.