NSX 4.0.1 Stateful Active-Active Gateway – Part 2 – Two Tier Routing

Welcome back!!! We are at Part 2 of the blog series on NSX Stateful Active-Active Gateways. In Part 1, we dealt with a single tier routing scenario where we had the logical segments attached to stateful A/A T0 Gateway and discussed about Edge sub-clusters, Interface groups, shadow and peer-shadow interfaces, traffic punting, edge node selection for traffic etc. In case you missed it, please read it below:

NSX 4.0.1 Stateful Active-Active Gateway – Part 1 – Single Tier Routing

In this article we will extend the topology to stateful A/A two tier by attaching Active-Active stateful Tier 1 gateways downstream to the T0 gateway and walkthrough the changes and routing involved. Let’s get started.

Two Tier stateful Active-Active gateway

We will use the same T0 configuration (Stateful Active-Active) that we used in the previous article. Hence we won’t discuss any T0 specific configuration here. We will attach two stateful Active-Active T1 gateways downstream where the workloads will be attached to. Below is our new topology which we have set up for this article. [Click to open hires]

This topology has:

The same T0 configuration we had in Part 1 including interfaces, interface groups, edge sub-clusters, BGP configuration and route redistribution
The same edge cluster used for stateful A/A T0 gateway will be used to host the stateful Active-Active T1 gateways. This is currently a requirement in order to support T1 stateful Active-Active services.
The below T1 gateways are created in stateful Active-Active mode:
- LR_T1_Tenant_01_DevApps
- LR_T1_Tenant_02_StgApps
Logical segment LS-DevApps01 is attached to LR_T1_Tenant_01_DevApps
Logical segment LS-StgApps01 is attached to LR_T1_Tenant_02_StgApps

The configuration is pretty much same with regards to our traditional Active-Standby gateway configuration but with few additional settings (similar to T0 Gateway configuration in Part 1) which we will discuss below.

At this moment, this is how our new topology looks like (taken from NSX topology viewer).

Now, with regards to routing:

This is a two tier routing with T1 gateways and logical segments are attached to respective T1 gateways
Both T1 gateways are in stateful active-Active mode and are instantiated in an edge cluster. Hence we don’t have any T0 DR construct on the ESXi transport nodes.
There is T1 DR to T1 SR ECMP (four paths) on all the ESXi transport nodes. This is a notable change in the routing behaviour at T1 DR.
Edge nodes always prefer local forwarding. ie, T1 DR next hops to T1 SR on the same edge node. Likewise for T0 DR and SR as well.

Below is the T1 DR next hops (ECMP) from an ESXi transport node. Notice the four ECMP paths northbound.

So far what we saw was pretty much same with regards to the routing methodology we dealt previously except the fact that we had ECMP at the T1 DR-SR level on the ESXi host transport nodes.

Similar to what we saw in Part 1 for the stateful Active-Active T0 gateway, we also have traffic punting happening at the T1 SR level to support stateful flows. This is also based on the same hashing algorithm. The hash is based on destination IP address of the packet if the flow is northbound and is based on source IP address if the flow is southbound.

Now, let’s walkthrough over the new terminologies and changes:

Edge Sub-clusters

Similar to edge sub-clusters created for the T0 gateway, we will have edge sub-clusters defined for each stateful T1 active-Active gateway. Sub-cluster creation and edge node selection is automatically handled by the system. Sub-clusters are defined at the scope of gateway’s SR. That means, sub-clusters of one T1 gateway will have a different UUID compared to the sub-clusters of another T1 gateway and that the membership might also vary.

Note that the same edge cluster will need to be shared across T0 and T1 gateways to support stateful active-active services on T1.

Also, as mentioned in Part 1, edge clusters need to be scaled in even numbers to ensure we have redundancy for stateful services in an edge sub-cluster

High availability status of edge T1 SRs will now display sub clusters to which they are a member of.

Please check out Part 1 to understand more about edge sub-clusters.

Interface Groups

There are two Interface groups (IFG) that are auto-created by the system for each stateful A/A T1 gateway:

Interface Group (T1 SR Uplink) – This is a system created interface group for the T1 SR uplink interfaces that attaches to T0 DR (or RouterLink ports). This has four interfaces (one from each edge node)
Interface Group (T1 SR – DR backplane) – This is a system created interface group for the T1 SR – DR backplane interfaces. This has four interfaces (one from each edge node)

Similar to T0 gateway, we also have shadow interfaces and peer-shadow interfaces for the interfaces in interface group, which we will cover next.

The below sketch shows the interface group created for the T1 SR uplink interfaces.

The below sketch shows the interface group created for the T1 SR – DR backplane interfaces.

Below is the interface group configuration from our topology.

Note : This is auto-created

Below is a cli output from one of the edge nodes.

Shadow and peer-shadow interfaces

Like how we saw in Part 1 for the stateful A/A T0 Gateway configuration, shadow and peer-shadow interfaces will be created for the T1 SR uplinks and T1 SR backplane interfaces. The same concepts we discussed in Part 1 apply here as well. Shadow and peer-shadow interfaces will be auto-created by the system. They will be attached to a system generated punt logical segment that spans across all the edge nodes in the cluster. This punt-LS is unique per gateway, which means that for the current topology with one T0 stateful A/A gateway and two T1 stateful A/A gateways, we will have three punt logical segments.

Shadow interfaces are always in operational UP status

The peer-shadow interface is in operationally DOWN status and will come up only if peer edge node failure is detected.

The below sketch shows the shadow and peer-shadow interfaces for an interface (T1 SR uplink or T1 SR backplane) which is a part of interface group. Traffic is punted through the shadow interface over the punt logical segment (shown below) to reach the right edge node for statefulness.

Now let’s verify the shadow interfaces, peer-shadow interfaces and shadow-macs using cli. We will try this on edge node 1.

We can see that the shadow macs are in sync across all the edge nodes for the external interface.

We see shadow and peer-shadow interfaces created for the T1 SR uplink interface.

We see shadow and peer-shadow interfaces created for the T1 SR – DR backplane interface.

Punt logical segment

Punt logical segments will carry the traffic that is punted between the edge nodes. A punt logical segment is created for every stateful Active/Active gateway. For the current topology with one T0 stateful A/A gateway and two T1 stateful A/A gateways, we will have three punt logical segments.

Determining the edge node for traffic flows

Similar to what we discussed in Part 1, edge node for a specific stateful flow is selected using an IP hash based on source or destination IP address depending on the traffic direction.

For northbound traffic (S -> N), edge node is selected based on a hash value using destination IP address. This IP hashing is performed as soon as the traffic reaches the backplane interface of T1 SR construct of an edge node. Traffic will be punted to the right edge node for further northbound lookups. IP hashing also happens again at the northbound tiers too (for eg, backplane interface of T0 SR – DR interface) but the hash of destination IP address will always select the same edge node only.

For southbound traffic (N -> S), edge node is selected based on a hash value but using source IP address instead. This IP hashing is performed as soon as the traffic reaches the uplink interface of T0 SR construct or T1 SR construct of an edge node. Traffic will be punted to this edge node for further southbound lookups.

Lets test this for two scenarios:

Traffic flow northbound to an external destination (eg: 8.8.8.8) from a segment attached to LR_T1_Tenant_01_DevApps
Traffic flow east – west to a destination on a different stateful A/A T1 gateway LR_T1_Tenant_02_StgApps (eg: 192.168.2.10) from a segment attached to LR_T1_Tenant_01_DevApps

Traffic flow northbound to an external destination (eg: 8.8.8.8) from a segment attached to LR_T1_Tenant_01_DevApps

The below sketch shows the traffic flow northbound to an external destination (eg: 8.8.8.8) [Click to open hires]

The below sketch shows the traffic flow southbound from the external destination (return traffic) [Click to open hires]

We see that:

For northbound flow, traffic has initially reached edge node 2 (through T1 DR to SR ECMP on the ESXi transport nodes) which is punted to edge node 3 (destination IP hash) for further northbound lookups and egress.
For southbound flow (return traffic), traffic has initially reached edge node 4 (through T0 SR uplink ECMP on the external fabric) which is punted to edge node 3 (source IP hash) for further southbound lookups.

Lets do NSX Traceflow and verify.

Traffic flow east – west to a destination on LR_T1_Tenant_02_StgApps (eg: 192.168.2.10) from a segment attached to LR_T1_Tenant_01_DevApps

The below sketch shows the flow from a segment attached to LR_T1_Tenant_01_DevApps to a segment attached to different stateful A/A T1 gateway LR_T1_Tenant_02_StgApps

We see that:

For the flow from source (LR_T1_Tenant_01_DevApps) to destination (LR_T1_Tenant_01_StgApps) , traffic has initially reached edge node 4 (through T1 DR to SR ECMP on the ESXi transport nodes) which is then punted to edge node 2 (destination IP hash) for further northbound lookups. This will be routed to the destination T1 SR (LR_T1_Tenant_01_StgApps) by the T0 DR construct.
As the traffic reaches the T1 SR of destination (LR_T1_Tenant_01_StgApps), this will be treated as a southbound flow. Hence edge node is determined using IP hash based on source IP address. We see that traffic has been punted to edge node 4 for further southbound lookups.