VMware Enterprise PKS supports two deployment models with respect to T0 and T1 Gateways since NSX-T version 2.5.
- Active-Passive T0 Gateway with a dedicated T1 Gateway per Kubernetes namespace.
- Active-Active T0 Gateway with a dedicated T1 Gateway per Kubernetes cluster (NSX-T v2.5 onwards)
In the first deployment model, a T1 Gateway is created for every namespace that is created in the Kubernetes cluster. Each namespace gets a unique network subnet which is carved out of a POD block defined in NSX-T manager. All the stateful services in the deployment (like SNAT/DNAT rules for the kubernetes PODs) are created in the Tier 0 gateway and hence they are deployed in an Active-Standby mode. With BGP configured as the routing protocol, we get only a single northbound path because ECMP can’t be leveraged in an Active-Standby T0 deployment. Also, as the Kubernetes clusters grows, the number of T1 objects created in NSX-T manager also grows which makes it look messy.
The second model is a simplified deployment option available since NSX-T v2.5 which uses a shared Tier 1 Gateway model. Here a dedicated Tier 1 Gateway is assigned to a Kubernetes cluster (rather than to a namespace) and all stateful services for that specific Kubernetes cluster is pushed down from T0 to it’s dedicated T1. This gives us the advantage and flexibility to have T0 Gateways deployed in Active-Active mode and leverage ECMP for North-South routing.
In both deployment models, we use BGP (or static if necessary) to advertise the NSX-T networks (Kubernetes node network, POD network, management network and other floating networks) based on the adopted topologies. To know mode about the supported topologies associated with the above said deployment models, have a look at the Pivotal reference below:
The PKS subnets are defined in NSX-T prior to deployment. Below are the subnets along with a description that I am using for this article.
Node Network – This is the network will be used by the Kubernetes nodes which are deployed by PKS BOSH Director. We will define a /16 pool here and PKS will carve a /24 block out of it and NSX-T manager will associate this to a dedicated Tier 1 Segment. The node network used in this setup is 172.31.0.0/16. This is a routable subnet.
POD Network – This is the network that will be used by each Kubernetes namespace. We will define a /16 pool here and PKS will carve a /24 block out of it and NSX-T manager will associate this to a dedicated Tier 1 Segment. All PODs on the same namespace attach to the same Logical Segment. The POD network used in this setup is 172.30.0.0/16. This is a non-routable subnet. So a NAT instance is needed for for the PODs for external access.
Floating Pool Network – This is a routable block that will be used for SNAT instances and Loadbalancer VIPs. We will define a /24 block and NSX-T will carve a /32 IP out of it. This is used whenever a Loadbalancer instance is required for Kubernetes (like LB for KubeAPI, Ingress Controller etc) as well as for SNAT instances for POD networks. The floating pool network used in this setup is 192.168.105.0/24
PKS Management Network – This is the management network to host PKS management and Control plane VMs. This sits on a dedicated T1 instance attached to the T0 Gateway. The management network is 192.168.101.0/24 and is manually created prior to PKS deployment
This is the architecture of the Active-Standby T0 Gateway PKS deployment model with dedicated T1 Gateways per Kubernetes namespace.
Observations from BGP advertisement
When a Kubernetes cluster is deployed by PKS, all the T1 Gateways for the Kubernetes nodes and PODs are created and configured automatically and they attach to the T0 Gateway. The subnets on the T1 Gateway are advertised (auto-plumbed routes) to the T0 Gateway which is redistributed into the BGP process and is received by the ToR Leaf Switches. The advertised networks include Kubernetes node network, POD network and the Floating networks used on the LB and NAT instances. Some observations:
- PKS will carve a /24 block out of the 172.30.0.0/16 pool for the POD network and the POD subnets are advertised in BGP as /24 networks. This creates a lot of entries in the BGP routing table as namespaces grows, which is unnecessary
- We don’t want the POD network to be on a routable subnet as per the initial requirement and use NAT instance instead. So we need to avoid advertising the POD subnet 172.30.0.0/16
- PKS will carve a /24 block out of the 172.X.0.0/16 pool for the Kubernetes node network and are advertised in BGP as /24 networks. This creates a lot of entries in the BGP routing table as kubernetes cluster grows, which is unnecessary
- PKS will carve a /32 IP out of the 192.168.105.0/24 pool for the Floating network used for LB and NAT instances and are advertised in BGP as /32 networks. This again makes the routing table entries messier.
Few Routing Decisions
- We will filter out the POD networks from getting advertised into BGP and place them on a non-routable POD subnet. SNAT rules on the T0 gateway enables outbound access for the PODs and DNAT rules enables inbound access.
- We should have the flexibility to allow specific POD namespaces (custom apps) to be routable if a requirement is met in the future.
- Each namespace should get a dedicated floating pool IP for SNAT to make granular policy decisions at a later point of time.
- The Kubernetes node network should be routable and summarized into a /16 block to keep the routing table entries short
- The Floating pool networks should be summarized into a /24 block to keep the routing table entries short
- PKS management network should also be on a routable subnet. This is created prior to PKS deployment.
Enterprise PKS Architecture (Active-Standby T0 Gateway)
This is the architecture (same as shown above) that we use for this blog post. This is the deployment model with Active-Standby T0 Gateway. The following procedure also works for Active-Active T0 with Shared T1 deployment model.
The T0 Gateway is deployed in Active-Standby mode with four uplinks over the two Edge nodes. BGP is enabled and neighborship is established with the two Leaf Switches in VLT. We have four eBGP peering sessions in total.
Each Edge node will peer to separate Leaf switches over two VLANs – VLAN 60 & 70. These are the T0 interfaces.
And the BGP neighborships with Leaf Switches
This is the Route redistribution criteria. All the segments, LB VIPs and NAT instances on T1 Gateway & T0 Gateway are redistributed into the BGP process which would be then seen by the Leaf peers.
This is the route advertisement criteria from a PKS created T1 Gateway for a Kubernetes namespace. The connected subnet is advertised to the T0 Gateway. Even though our POD network was intended to be on a non-routable subnet, the network is still advertised into BGP.
PKS Deployed NSX-T Objects and Subnets
Let’s look at the the PKS deployed T1 Gateways. I have deployed 3 Kubernetes clusters with quite a number of namespaces in it, so we should see a lot of T1s, all attached to the same T0 Gateway
- T1s of Kubernetes nodes
- T1s of Kubernetes namespaces (PODs)
- T1s of Loadbalancers deployed for KubeAPI, Kubernetes Ingress Controllers etc.
- T1 of PKS Management Network (manually created)
These are the SNAT instances for the Kubernetes POD Namespaces that are created on the T0 Gateway. Each POD namespace subnet maps to a /32 IP on the Floating pool 192.168.105.0/24
These are the Load balancer VIPs that will be advertised as /32 subnets
These are the PODs blocks which are currently in use and will be advertised as /24 subnets
These are the Kubernetes Nodes blocks which are currently in use and will be advertised as /24 subnets
This is the summary of leases from the Static Floating Pool and Dynamic Floating pools created by NSX-T manager.
A look at the BGP Table on Leaf Switches
This is what we see on the BGP table on the Leaf Switches. The table is flooded with /24 subnets from POD network, Kubernetes node network and /32 subnets from Floating networks.
BGP Route Summarization and Filtering
Let’s setup a Prefix-list “PKS_Out_Whitelist” to allow only networks that we need to advertise. In our case, it will be the Kubernetes node network and the Floating pool network (Whitelist)
We will now define a Route-map and match the Prefix-list that we created just before. There is an explicit “Deny” at the end of Route-map so we don’t need to define a separate deny Prefix-list.
Let’s summarize all the networks that PKS carved out from the initially defined block for Kubernetes node,POD and Floating networks at the original block boundary. We will use BGP Route Aggregation under the T0 BGP Routing process and enable the “Summary-Only” flag. The “Summary-only” flag will advertise only the aggregated routes and suppresses the more specific routes.
At this moment, we should see aggregated networks on the Leaf Switches. Note that we haven’t filtered the POD networks at this moment, but just aggregated it along with others.
Let’s verify the Aggregator attribute of one of the networks
As we see, both the T0 SR Constructs performs the Route aggregation and advertised it, but only the route to Active SR Construct will get installed in the leaf’s BGP table.
Now let’s apply the Rote-map that we defined earlier to the T0 Interfaces as an Outbound filter.
Now the POD networks are filtered out from the BGP route advertisements.
If we need to specifically advertise a POD network (eg: 172.30.20.0/24), simply add this network to the Prefix-list “PKS_Out_Whitelist” that we created earlier. We can also remove the Aggregator for the POD network, it is not needed as we filtered this network out.
Are you still there? If yes, please let me know via comments or Twitter whether the article was useful.
Thanks for reading