Inter-SR Routing is an iBGP peering feature between the SR Components of the Tier 0 Gateway deployed in Active-Active mode. This feature helps to tolerate an asymmetric failure of the SR component on the Edge node by routing the internal traffic reaching to it over the iBGP inter-SR link to the SR component on the other Edge node for northbound reachability. Currently EBGP and static routes (NSX created as well as manual) are synced over the iBGP Inter-SR link. Note that this is only applicable for Active-Active T0 Gateway deployments.
In this article, we will discuss how the T0 gateway Active-Active architecture looks like, Ingress/Egress traffic patterns, how to enable Inter-SR routing and how northbound failure scenarios are handled. Before we proceed, I would recommend reading my previous post on establishing eBGP peering between the T0 gateway and Dell Networking ToR swithes in the link below. This is because, I have used the same infra for this post and many of the statements in this article are connected to that.
Lets get started.
T0 Gateway Active-Active Architecture and Traffic flow
The below sketch shows the logical architecture of a T0 Gateway in an Active-Active deployment. It also explains how the Ingress and Egress traffic directions look like.
This shows how the traffic direction is affected during an uplink failure of the T0 SR Component. We will discuss more about this towards the end of the article.
Let’s discuss about the architecture. All the Logical segments attach to the DR Component of the Tier 1 Gateway. Depending on the Stateful services that are used, the T1 Gateway also have an SR component deployed. Note that the SR Components are not distributed in nature and they always sit on the Edge nodes. Currently only Active/Standby option is available for the T1 SR Components. The Standby SR Component is in an operationally down state and will take over only when the Active SR Component fails. The DR & SR components of the T1 gateway attach to each other using an NSX-T managed link on a 169.254.0.0/28 network (can be modified)
The T1 SR components attach to the DR Component of the T0 Gateway over a T0-T1 Transit link on 100.64.160.X/31 subnet (can be modified). In an Active-Active deployment (with ECMP), the T0 DR component will do load sharing of the traffic to both of the SR components of the T0 Gateway. This is achieved with the help of two default routes on the T0 DR Component pointing to each of the T0 SR components as the next-hop. The DR and SR components attach to each other using an NSX-T managed link on a 169.254.0.0/28 network (can be modified). In reality, there is a Transit Overlay logical segment created for the TO DR-SR connectivity and is completely transparent to the user.
On the T0 DR component, the routing entry looks like this:
- One default route 0.0.0.0/0 pointing to 169.254.0.2/25 (SR1 on Edge node 1) with AD value 1
- Second default route 0.0.0.0/0 pointing to 169.254.0.3/25 (SR2 on Edge node 2) with AD value 1
Load sharing is then achieved by the T0 DR routing process.
The T0 router has two Uplinks (on the SR components) via each of the Edge nodes. The SR component uplink connectivity looks like this:
- SR1 has an Uplink on VLAN 60 over Edge node 1 connecting to Dell Networking L3 Leaf 1
- SR2 has an Uplink on VLAN 70 over Edge node 2 connecting to Dell Networking L3 Leaf 2
Both SR Components establish eBGP peering on their respective VLANs with the Dell Networking L3 Leaf switches.
With Inter-SR Routing enabled on the BGP Process of the T0 Gateway, both the SR Components establish an internal SR link between each other over an NSX managed subnet 169.254.0.0/25. An iBGP peering is the established between them. We will discuss more on this later in this article.
Let’s have a look at the the Egress traffic pattern. Any traffic reaching on the T0 SR Component towards northbound direction will prefer the eBGP path to the Leaf switch rather than the iBGP link to the other SR component. This is because eBGP paths are more preferred over iBGP path in path selection. The Inter-SR link is used only when there is a failure scenario. For example, a failure of the uplink on SR2.
Now for the Ingress traffic also, the same rule applies. Leaf switches will forward traffic to the T0 SR Component to which they have an eBGP relationship, rather than using iBGP link to its VLT peer.
Enabling Inter-SR Routing for the BGP Process
For more details on how to configure BGP peering of the T0 Gateway with the external L3 Leaf DellEMC Networking, please refer to my previous post below. I am not covering the BGP configuration in this post.
Let’s do a quick overview of how the configuration looks like:
We have a Tier 0 Gateway deployed in Active-Active mode with two uplinks – One on VLAN 60 via Edge node 1 and second on VLAN 70 via Edge node 2.
We have a Tier 1 Gateway attached to the T0 Gateway. This T1 Gateway has 3 Segments attached to it. The networks used on the Segments are 172.16.10.0/24, 172.16.20.0/24 and 172.16.30.0/14. They are advertised to the T0 Gateway (NSX-T managed).
BGP is enabled on the T0 Gateway and is peered with the L3 Leaf Switches over two VLANS – 60 & 70.
This is where the Inter-SR Routing is enabled on the T0 Gateway. We are presented with just a Boolean Toggle button for this. NSX-T takes care of the configuration.
Now there are two SR Components created for the T0 Gateway sitting on each of the edge nodes. The T0 DR Component makes connection to both of them over an Overlay Transit Segment,as discussed earlier.
Route Redistribution is enabled on the T0 Gateway to advertise all the T1 Connected Segments in BGP.
We can see that the iBGP session is established between the SR Components. BFD is not used for this iBGP peering, instead the keepalives are lowered to 1 sec and holddown timer to 3 sec to quickly detect a failure on this link.
Let’s login to the Edge nodes and check the iBGP Inter-SR links that are created.
Look at the iBGP Neighborship of SR components:
Inter-SR iBGP Route Advertisement
Both the T0 SR Components exchange eBGP and static routes with each other with a Next-hop set to itself. All these advertised routes are set with a BGP community tag of NO_EXPORT. This means that any routes are are learned from the iBGP SR peer won’t be advertised to the external Leaf switches via eBGP relationship.
Let’s look at the routes received by each SR component from its iBGP inter-SR peer.
Let’s look at the advertised routes as well.
Let’s look at the Community tagging in the iBGP relationship.
This is the running-configuration of the T0 SR Component. We can see that there is a Route-map named “autogenerated_rmap_for_inter_sr_peers_out” attached to the iBGP neighbor statement. This route map attaches the community tag of NO_EXPORT.
Let’s look at the community tagged routes. We can see that SR 2 on Edge node 2 advertises all the eBGP routes learned from the External Leaf switches and all the T1 Connected routes to its iBGP peer SR1 on Edge node 1 and vice-versa.
Just for additional information, if you want to view or export the running-configuration of the service router, you can enable debug mode and issue the command “get service router running-config“
Egress – Ingress Traffic Flow direction before a T0 Uplink Failure
Lets revisit the architecture sketch at the beginning of the post. As mentioned, the T0 DR component will have two default routes that points to the SR components on both edges to achieve load sharing in Active-Active scenario. This is the forwarding table of the T0 DR Component.
Below is the Forwarding table of the SR components. We can see that each SR component will have two routes for the networks advertised from the external Leaf switches – One via eBGP neighborship with the Leaf switch and other via iBGP relationship with the other SR component. As per BGP path selection procedure, eBGP paths are preferred over iBGP paths. Hence the iBGP links act like a backup and comes into picture only when there is a failure scenario (Eg: Uplinks). For routes to the T1 gateway, each SR Component routes the traffic locally to the T1 SR component. A failure is not expected for these routes, but still the routes are also learned via iBGP. The path selection criteria is different here. Locally originated routes have more priority over iBGP learned routes.
The RED arrows shows the routes advertised from external Leaf switches. BLUE arrow shows the routes on the T1 Gateway
Let’s look at the BGP table of the Dell EMC Networking Leaf Switches. Leaf 1 eBGP peers with SR 1 on Edge node 1 over VLAN 60. Leaf 2 eBGP peers with SR 2 on Edge node 2 over VLAN 70. The leaf switches are in VLT and are iBGP peers with each other.
Each leaf switch learns about the overlay network from 2 ways – one via eBGP relationship with the respective SR component on Edges and second via iBGP relationship with the VLR peer. As said before, eBGP routes are preferred over iBGP routes and hence Southbound traffic from the leaf switches are directly routed to the respective T0 SR components.
Egress – Ingress Traffic Flow direction after a T0 Uplink Failure
The below sketch shows the traffic flow direction after invoking an Uplink failure. In this case, the Uplink of SR2 on Edge Node 2 is brought down.
The SR Component with failed uplink will loose the eBGP peering wit the external Leaf switch. Any northbound traffic that reaches the SR component with failed uplink will be routed over the iBGP inter-SR link to the other SR component on the other Edge node where they are routed to external leaf swiches. This is the output from the SR node with failed Uplink.
Similarly for Ingress traffic, the leaf that looses the eBGP peering wit the SR component will choose the iBGP path to its VLT peer from where the traffic is routed to the respective SR component.
I hope this article was informative and gave you a good learning experience.
Thanks for reading
7 thoughts on “NSX-T Tier0 Inter-SR Routing Explained”
Thanks for posting. Your article very well explains the concept. Do you know if we can achieve ECMP for the Inter-SR link as well considering the T0 spanning over an Edge cluster with 4 edge nodes?
Hi Daniel, Yes this is supported. When we enable ECMP under the BGP process, it sets “maximum-paths” parameter for both eBGP and iBGP paths to 8. So upto a max of 8 iBGP Inter-SR routes are supported. I have an article on the same. Please see https://vxplanet.com/2019/08/22/nsx-t-tier0-inter-sr-ibgp-ecmp-explained/
Thanks for another great article 🙂
I have a query about the egress traffic flow after uplink failure on SR2 – In “traditional” (non-NSX) routing, if SR2 lost its uplink then it would no longer receive the default route from the leaf and so it would stop advertising it to the DR. In your screenshots, you are not showing the DR forwarding table _after_ the failure, but would it not now only have just one next-hop for 0.0.0.0/0?
You write “Any northbound traffic that reaches the SR component with failed uplink […]”, but if the forwarding table on the DR only had one route (via SR1) then, there would never be any traffic reaching SR2 anyway?
I understand SR and DR are not running a conventional routing protocol between them but I would have thought that the DR forwarding table would get updated (by CCP/LCP?) in that scenario.Can you clarify this point?
Thanks in advance,
This is a good question. Actually a T0 SR node is considered dead when either one or both of these conditions are met – The TEP interface of that edge node is down OR all the T0 SR Uplinks are down OR both. If an edge node is marked dead, then the T0 DR construct is updated with only one default route – that is to the edge node which is alive.
In this case, for demonstration, I just brought the T0 SR uplink adjacency down by shutting down BGP neighborship. In this case, the T0 DR still have 2 ECMP paths to both the T0 SR Constructs.
The real use case for this would be a multi-T0 env having a common shared T0 attached to a workload provisioner or orchestrator. In this model , each T0 gateway will have a connection to the shared T0 upstream as well as an uplink towards customer VRF.
First of all thanks for posting this. I started NSX-T 20 days back. As in my new working environment NSXT is there. I was very much confused about T0/T1 SR and DR concepts but the way you explained and the diagram helped me a lot, now these concepts are clear to me. I searched on google from different writers but finally got your POST and thank god you made concepts clear.
Thanks a lot Vikram, glad to hear it helped. Cheers
Brilliantly written Hari. Appreciate your work