Tier-0 Inter-SR Routing changes in NSX-T 3.0


Inter-SR routing is an iBGP peering option between the SR constructs of Tier 0 Gateway deployed in Active-Active mode intended to handle asymmetric northbound failures. It does so by routing the traffic (which doesn’t have a valid northbound route) to the peer SR construct on the other Edge node(s) which has a northbound reachability. I wrote two articles previously with NSX-T 2.4 version, and you can read them below:

Inter-SR Routing Explained : https://vxplanet.com/2019/07/12/nsx-t-tier0-inter-sr-routing-explained/

Inter-SR Routing ECMP : https://vxplanet.com/2019/08/22/nsx-t-tier0-inter-sr-ibgp-ecmp-explained/

The same concepts apply for NSX-T 3.0 as well but with few changes to the control plane which we will cover in this article.

Let’s get started.

Pre NSX-T 3.0 versions

Here is a quick summary of Inter-SR iBGP for pre-NSXT 3.0 versions:

  • Inter-SR iBGP peering is maintained in the BGP default VRF instance.
  • Inter-SR iBGP routes are tagged with a BGP “No-Export” community which prevents advertising iBGP learned routes to northbound eBGP Leaf peers. This is to prevent T0 Edges from serving as a transit path for the Leaf Switches under certain scenarios.
  • Inter-SR iBGP backup routes are maintained along with the eBGP routes in the BGP default table.
  • A edge cluster supports upto 8 nodes, which means an edge node (SR) maintains 7 additional iBGP ECMP paths in the BGP default table when deployed in Active-Active mode.
  • The default VRF maintains BGP control-plane information and the data-plane. Inter-SR routing puts additional control plane information (prefixes and paths) and won’t scale well.

Let’s see how that looked like:

This is the inter-SR iBGP Peering established on the default VRF. BGP default VRF table maintains both eBGP and iBGP paths. eBGP paths are placed on the default VRF RIB according to the BGP best-path selection process.

The RED arrow shows the prefixes advertised from the leaf switches. BLUE arrow shows the prefixes on the downstream T1 Gateways. Note the iBGP backup routes for each prefix.

Scaling out an Active-Active edge cluster means more iBGP sessions per edge node and subsequently more ECMP prefixes on the BGP default VRF table.

Inter-SR iBGP advertised routes are tagged with a “No-Export” BGP community, hence the prefixes are not advertised northbound to other eBGP neighbors.

In a complete northbound failure scenario (non-L1 adjacent such as a shutdown eBGP neighbor), inter-SR routes are placed into the forwarding table. This is BGP process driven.

What’s changed with NSX-T 3.0?
  • A new control VRF for Inter-SR iBGP peering (called ‘inter_sr_vrf’). iBGP backup routes are no longed placed on the BGP default VRF table.
  • Northbound eBGP peerings are established on the default VRF and the eBGP routes are placed on the default BGP table.
  • All prefixes from the default BGP table are leaked into the Inter-SR VRF (via vrf import) with a BGP weight attribute of 32765. These leaked routes are advertised to all other iBGP SR peers. Weight is a local attribute and is not carried in advertisements.
  • Community tag of ‘No-Export’ used in previous NSX-T versions is not used here, as this Inter-SR VRF doesn’t have any eBGP peering interfaces.
  • Inter-SR VRF will have northbound access via a “leaked” route through the default VRF
  • Currently only default VRF is leaked into the inter-SR VRF. Support for custom T0 VRFs is not available.
  • Scaling out Edge clusters wont have an additional overhead on the default VRF (which maintains the data plane) as all iBGP backup routes are maintained on the Inter-SR control VRF.

Let’s lab this and understand it more clearly.

Enabling Inter-SR iBGP

Same like previous versions, Inter-SR iBGP is enabled under the BGP configuration on the T0 Gateway.

However we won’t see the peering details on the default VRF.

Inter-SR control VRF is not exposed to us, but we can access it backend over the FRRouting docker instance in an unsupported way.

Accessing the T0 SR Backend (FRRouting docker instance)

Note: The below steps are not officially supported and should not be performed on a production setup.

FRRouting is used as the backend for Tier 0 Gateway SR routing. Each SR construct is backed by a docker containerized instance of FRRouting running on the Edge nodes. FRRouting instances are managed together by the NSX-T manager and the Edge control plane.

Login to an NSX-T Edge node as ‘root’ and connect to the FRRouting container instance.

frr.conf under /etc/frr is the running-configuration of the FRRouting instance. Launch the FRR shell ‘vtysh’ to run cmdlets.

List the VRFs. We could see Inter-SR-VRF among the list.

Running-configuration will show additional settings for this VRF. The command is ‘show running-configuration’

As mentioned earlier, Inter-SR VRF has a default route ‘leaked’ from the default VRF through which it gets northbound access as highlighted below.

All prefixes from the ‘default’ routing table is imported into the Inter-SR VRF and tagged with a ‘Weight’ attribute of 32765. Weight attribute is local and wont be advertised along with the prefixes to other iBGP SR peers.

Each prefix in the Inter-SR VRF has an extended community tagging automatically applied during the VRF export-import foperation. Also an attribute called ‘Originator’ (with the BGP Router ID of the advertising router) is added for loop prevention of VRF leaked routes.

Handling northbound failures
  • Complete non-L1 adjacent failure

If all the Edge uplinks have an L1 adjacent failure, it is considered as an Edge failover event. All stateful services are failed over to other edge nodes in this case and inter-SR recovery doesn’t come into play here.

The candidate to test inter-SR iBGP is a non-L1 adjacent failure. If the failure is non-L1 adjacent, for eg a shutdown event of eBGP neighborships on the Leaf switches, BFD notifies the BGP process about the failed adjacency and the prefixes are removed from the BGP default table. As inter-SR VRF does a ‘vrf import’ from the default VRF, it sees no routes there and places the routes learned from it’s iBGP SR peer in it’s VRF table. The Edge control plane imports all missing routes in the default VRF from the inter-SR VRF, which is actually learned from it’s iBGP SR peer and the forwarding table is thus updated with the next hop to the iBGP peers (with ECMP).

This is the BGP default table and the forwarding table on an edge node before failure. For each prefix, we have eBGP northbound to the Leafs.

This is the BGP table and forwarding table for the Inter-SR VRF before failure. The prefixes are learned via VRF import from local BGP default table (with weight 32765) and iBGP SR peer. As paths with higher weights are preferred, the VRF imported routes are placed into the RIB for Inter-SR VRF.

Now let’s bring down the eBGP adjacency with the Leaf switches by shutting down the neighborship.

The BGP table and RIB for the default VRF will not have any prefixes from the Leafs.

Inter-SR VRF will update it’s BGP table with the next-hop pointing to the peer SR.

Now the Edge Control plane will program the edge forwarding table with prefixes from the Inter-SR VRF.

  • Partial / Assymetric failure

If the failure is partial or assymetric where not all, but just few prefixes are missing, the edge control plane programs the forwarding table with the missing prefixes taken from the Inter-SR VRF BGP table.

To simulate a partial failure, let’s apply a route-map on the BGP neighbor statement on the Leaf to filter out a 172.16.20.0/24 prefix.

The prefix won’t appear in the BGP default VRF table and the edge control plane picks it up from the Inter-SR VRF table and updates the forwarding table.

The route is now accessible via the iBGP SR peer.

I hope the article was informative.

Thanks for reading

2 thoughts on “Tier-0 Inter-SR Routing changes in NSX-T 3.0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s