In my eternal quest to over-engineer my home network, I decided it was time to rebuild my k3s cluster with cilium CNI and dual-stack dynamic routing via BGP.
Network Topology
I’m fortunate in that my ISP offers routed public subnets for a reasonable monthly fee, meaning I have a /29 (8 public IPv4 IPs). However, everything about this setup can be done without it, you’ll just want to put your “public” IPv4 services on your LAN and either forward the necessary ports or run a reverse proxy on your router.
Note that the k3s service network will not be accessible from outside the cluster. Services should be exposed via either the public or internal service network, though the pod subnet will also be routed if you need to interact with them directly. Make sure you’ve properly secured your cluster with network policies!
Purpose | IPv4 CIDR | IPv6 CIDR |
---|---|---|
Public Service Network | 192.0.2.240/29 | 2001:db8:beef:aa01::/64 |
Internal Service Network | 172.31.0.0/16 | 2001:db8:beef:aa31::/64 |
Home Network | 172.16.2.0/24 | 2001:db8:beef:aa02::/64 |
K3s Node Network | 172.16.10.0/24 | 2001:db8:beef:aa10::/64 |
K3s Pod Network | 10.42.0.0/16 | 2001:db8:beef:aa42::/64 |
K3s Service Network | 10.43.0.0/16 | fddd:dead:beef::/64 |
BGP with FRR
I use a FreeBSD box as my router, so I’m going to get FRR installed with pkg install frr10
. There’s not much to the configuration – BGP is a simple protocol that pre-dates modern ideas of security. Here’s my frr.conf:
frr defaults traditional log syslog informational ! router bgp 64513 bgp router-id 172.16.10.1 no bgp ebgp-requires-policy bgp default ipv4-unicast bgp default ipv6-unicast neighbor CILIUM4 peer-group neighbor CILIUM4 remote-as 64512 neighbor CILIUM4 soft-reconfiguration inbound neighbor CILIUM6 peer-group neighbor CILIUM6 remote-as 64512 neighbor CILIUM6 soft-reconfiguration inbound neighbor 172.16.10.20 peer-group CILIUM4 neighbor 172.16.10.21 peer-group CILIUM4 neighbor 172.16.10.22 peer-group CILIUM4 neighbor 172.16.10.23 peer-group CILIUM4 neighbor 172.16.10.24 peer-group CILIUM4 neighbor 172.16.10.25 peer-group CILIUM4 neighbor 2001:db8:beef:aa10::20 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::21 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::22 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::23 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::24 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::25 peer-group CILIUM6 ! line vty
Installing K3s Server
Cilium will be taking over for several of the standard parts of the k3s stack, so we need to ensure to disable those bits and bobs at install time. Also, I have a local domain wired up to my DHCP server so I’m going to use a fully qualified domain name. Lastly, we need to ensure the bpf filesystem is mounted. This script will install the k3s server:
#!/bin/bash # Install k3s server export K3S_KUBECONFIG_MODE="644" export INSTALL_K3S_EXEC=" \ server \ --flannel-backend=none \ --disable-network-policy \ --disable-kube-proxy \ --disable servicelb \ --disable traefik \ --tls-san k3s-server.k3s.example.com \ --node-label bgp-enabled="true" \ --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64 \ --service-cidr=10.43.0.0/16,fddd:dead:beef::/112" curl -sfL https://get.k3s.io | sh -s - curl -k --resolve k3s-server.k3s.example.com:6443:127.0.0.1 https://k3s-server.k3s.example.com:6443/ping # Prep bpf filesystem sudo mount bpffs -t bpf /sys/fs/bpf sudo bash -c 'cat <<EOF >> /etc/fstab none /sys/fs/bpf bpf rw,relatime 0 0 EOF' sudo systemctl daemon-reload sudo systemctl restart local-fs.target
A quick rundown on the options:
--flannel-backend=none
We’ll be installing Cilium, we don’t want flannel--disable-network-policy
Cilium has it’s own network policy enforcement--disable-kube-proxy
While you can use kube-proxy with Cilium, that seems kinda pointless--disable-servicelb
Cilium has it’s own load balancer implementation (it used to use metallb, but no longer)--disable-traefik
This is personal taste. I prefer to use ingress-nginx, but you’re welcome to use traefik--tls-san k3s-server.k3s.example.com
Since I’ve got local DNS resolution I’m choosing to use it for the TLS cert--node-label bgp-enabled="true"
We use this node label to control which nodes will participate in BGP peering--cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64
This is the pod network range. This will be announced via BGP.--service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
This is the cluster internal service network range. This will NOT be announced via BGP
Once you’ve got your server deployed, you can get your agents deployed with this:
#!/bin/bash # Install k3s-agent export K3S_KUBECONFIG_MODE="644" export K3S_URL="https://$k3s-server.k3s.example.com:6443" export K3S_TOKEN=$(ssh k3s-server.k3s.example.com "sudo cat /var/lib/rancher/k3s/server/node-token") export INSTALL_K3S_EXEC='--node-label bgp-enabled="true"' curl -sfL https://get.k3s.io | sh - # Prep bpf filesystem sudo mount bpffs -t bpf /sys/fs/bpf sudo bash -c 'cat <<EOF >> /etc/fstab none /sys/fs/bpf bpf rw,relatime 0 0 EOF' sudo systemctl daemon-reload sudo systemctl restart local-fs.target
Note that it’s expected that your nodes will report as NotReady. They aren’t – not until we get Cilium deployed.
Install Cilium
Because cilium uses the new ebpf capabilities of recent Linux kernels, which allow network programming to run in kernel space, it is much more efficient than a lot the standard kubernetes network tools which run in user space. Hence, my goal with cilium is to leverage it for as much functionality as I can, such as replacing kube-proxy and load balancers. At the same time, I don’t want to make cilium do unnecessary work and instead leverage the native network to avoid encapsulation and network address translation.
I prefer to use helm where I can, so install the cilium helm repo with helm repo add cilium https://helm.cilium.io
. Cilium, and thus the helm chart, has a lot of knobs to twiddle, but here’s what I am using:
cni: exclusive: false operator: replicas: 1 kubeProxyReplacement: true k8sServiceHost: "k3s-server.k3s.example.com" k8sServicePort: 6443 bgpControlPlane: enabled: true ipv4: enabled: true ipv6: enabled: true ipam: moode: "cluster-pool" operator: clusterPoolIPv4PodCIDRList: "10.42.0.0/16" clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96" clusterPoolIPv4MaskSize: 24 clusterPoolIPv6MaskSize: 112 ipv4NativeRoutingCIDR: "10.42.0.0/16" ipv6NativeRoutingCIDR: "2001:db8:beef:aa42::/96" bpf: #datapathMode: "netkit" vlanBypass: - 0 - 10 - 20 enableIPv4Masquerade: false enableIPv6Masquerade: false externalIPs: enabled: true loadBalancer: mode: "dsr" routingMode: "native" autoDirectNodeRoutes: true hubble: relay: enabled: true ui: enabled: true extraConfig: enable-ipv6-ndp: "true"
The important configuration options are:
- Really the most critical one here, it enables BGP functionality:
bgpControlPlane.enabled: true
- Cilium can do kube-proxy’s job, and it can do it much faster:
kubeProxyReplacement: true
- By default, cilium will use encapsulation and NAT traffic as it leaves a node, but the whole premise here is we’re natively routing the kubernetes networks:
routingMode: "native"
EnableIPv4Masquerade: false
EnableIPv6Masquerade: false
autoDirectNodeRoutes: true
ipv4NativeRoutingCIDR: "10.42.0.0/16"
ipv6NativeRoutingCIDR: "2607:f2c0:f00e:eb42::/96"
- Cilium has a variety of IPAM options, but I want to use cluster scoped direct routing. This is all about the pod networks – service networks will be configured later. We want the CIDRs here to match what we’ve configured for native routing above.
ipam.mode: "cluster-pool"
ipam.operator.clusterPoolIPclusterPoolIPv4PodCIDRList: "10.42.0.0/16"
ipam.operator.clusterPoolIPv4MaskSize: 24
For IPv4 we’re breaking up the /16 into /24s which can be assigned to nodes. This means 254 pods per node. Plenty considering I’m using raspberry pis.ipam.operator.clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
ipam.operator.clusterPoolIPv6MaskSize: 112
For IPv6 we’re breaking up the /96 into /112s. I tried aligning this with the /64 I provided to the k3s installer, but Cilium errored out and wanted a smaller CIDR. I need to dig into this more at some point.
- There’s a few other options that are notable, but not directly relevant to this post:
loadBalancer.mode: "dsr"
Cilium supports several load balancing modes. I recently discovered that I had to adjust the MTU/MSS setting on my router due to issues with some IPv6 traffic, and I intend to test the “hybrid” mode soon and see if that resolves it without MTU changes on my network.bpf.datapathMode: "netkit"
This is commented out because I’m intending to test it, but I’m including it here because it sounds interesting. It replaces the usual veth device typically used for pod connectivity with the new netkit driver that lives in kernel space on the host. It should be more performant.bpf.vlanBypass
Only necessary if you’re using VLANs.cni.exclusive
I also deploy Multus, so I don’t want Cilium assuming I’m committedextraConfig.enable-ipv6-ndp: "true"
It’s always good to be neighbourly. This enables the NDP proxy feature which exposes pod IPv6 addresses on the LAN.
Deploy cilium and wait for it to finish installing:
helm upgrade --install cilium cilium/cilium \ --namespace kube-system \ --values cilium-helm.yaml kubectl wait --namespace kube-system \ --for=condition=ready pod \ --selector=app.kubernetes.io/name=cilium-operator \ --timeout=120s kubectl wait --namespace kube-system \ --for=condition=ready pod \ --selector=app.kubernetes.io/name=hubble-ui \ --timeout=120s
Configure Cilium
Now it’s time to configure the BGP peering by creating some CRDs. First up is the CiliumBGPClusterConfig, which is the keystone of this operation. Note that it uses the bgp-enabled: "true"
selector, which is why we labeled our nodes earlier.
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPClusterConfig metadata: name: cilium-bgp namespace: cilium spec: nodeSelector: matchLabels: bgp-enabled: "true" bgpInstances: - name: "64512" localASN: 64512 peers: - name: "peer-64513-ipv4" peerASN: 64513 peerAddress: "172.16.10.1" peerConfigRef: name: "cilium-peer4" - name: "peer-64513-ipv6" peerASN: 64513 peerAddress: "2001:db8:beef:aa10::1" peerConfigRef: name: "cilium-peer6"
Now we need to create a pair of CiliumBGPPeerConfigs, one for IPv4 and one for IPv6:
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPPeerConfig metadata: namespace: cilium name: cilium-peer4 spec: gracefulRestart: enabled: true restartTimeSeconds: 15 families: - afi: ipv4 safi: unicast advertisements: matchLabels: advertise: "bgp" --- apiVersion: cilium.io/v2alpha1 kind: CiliumBGPPeerConfig metadata: namespace: cilium name: cilium-peer6 spec: gracefulRestart: enabled: true restartTimeSeconds: 15 families: - afi: ipv6 safi: unicast advertisements: matchLabels: advertise: "bgp"
Next up is the CiliumBGPAdvertisement CRD, which tells Cilium what kind of resources we want to advertise over BGP. In this case, we’re going to advertise both the pods and services. Note however, that this won’t advertise standard services, which are deployed in the k3s internal service range (10.43.0.0/16).
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPAdvertisement metadata: namespace: cilum name: bgp-advertisements labels: advertise: bgp spec: advertisements: - advertisementType: "PodCIDR" - advertisementType: "Service" service: addresses: - ExternalIP - LoadBalancerIP selector: matchExpressions: - {key: somekey, operator: NotIn, values: ['never-used-value']} attributes: communities: standard: [ "64512:100" ]
Lastly, we have the CiliumLoadBalancerIPPool CRDs. These are IP pools that load balancers are services configured with externalIps can use, and these are the services Cilium will advertise:
apiVersion: cilium.io/v2alpha1 kind: CiliumLoadBalancerIPPool metadata: name: public-pool spec: blocks: - cidr: 192.0.2.240/29 - start: 2001:db8:beef:aa01::240 stop: 2001:db8:beef:aa01::247 serviceSelector: matchLabels: network: public --- apiVersion: cilium.io/v2alpha1 kind: CiliumLoadBalancerIPPool metadata: name: internal-pool spec: allowFirstLastIPs: "No" blocks: - cidr: 172.31.0.0/16 - cidr: 2001:db8:beef:aa31::/64 serviceSelector: matchLabels: network: internal
For the public-pool CRD, I’m using my public /29 and a small range of IPv6 addresses. Because Cilium will assign addresses sequentially, this ensures services will (generally) have the same final octet/hextet.
For the internal-pool, I’m setting allowFirstLastIPs: "No"
, mostly to avoid confusing other devices that would get confused accessing a service on a network IP [ed note: sometimes the author is one such device].
Deploying Ingress-Nginx
The last step is to deploy an external service or two. Just be sure to label your service with either network: "public"
or network: "internal"
and they’ll be assigned one of the IPs from the relevant pool and Cilium will announce it over BGP.
In my case, I’m primarily using ingress-nginx, so let’s deploy a pair of ’em, starting with the public one. Here’s my ingress-nginx-helm-public-values.yaml file (note service.labels.network: “public”):
fullnameOverride: public-nginx defaultBackend: enabled: false controller: ingressClass: public-nginx ingressClassResource: name: public-nginx controllerValue: "k8s.io/public-ingress-nginx" publishService: enabled: true metrics: enabled: true service: labels: network: "public" ipFamilyPolicy: PreferDualStack
And the internal one:
fullnameOverride: internal-nginx defaultBackend: enabled: false controller: ingressClass: internal-nginx ingressClassResource: name: internal-nginx controllerValue: "k8s.io/internal-ingress-nginx" publishService: enabled: true metrics: enabled: true service: labels: network: "internal" ipFamilyPolicy: PreferDualStack
And deploy them:
helm upgrade --install internal-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --namespace internal-nginx --create-namespace \ --values ingress-nginx-helm-internal.yaml kubectl wait --namespace internal-nginx \ --for=condition=ready pod \ --selector=app.kubernetes.io/component=controller \ --timeout=120s helm upgrade --install public-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --namespace public-nginx --create-namespace \ --values ingress-nginx-helm-public.yaml kubectl wait --namespace public-nginx \ --for=condition=ready pod \ --selector=app.kubernetes.io/component=controller \ --timeout=120s
And finally, you can confirm they’re deployed:
kubectl get service -A | grep LoadBalancer internal-nginx internal-nginx-controller LoadBalancer 10.43.78.168 172.31.0.1,2607:f2c0:f00e:eb43::1 80:32608/TCP,443:31361/TCP 5d9h public-nginx public-nginx-controller LoadBalancer 10.43.33.98 192.0.2.240,2001:db8:beef:aa01::240 80:31821/TCP,443:31611/TCP 5d9h