In my eternal quest to over-engineer my home network, I decided it was time to rebuild my k3s cluster with cilium CNI and dual-stack dynamic routing via BGP.
Network Topology
I’m fortunate in that my ISP offers routed public subnets for a reasonable monthly fee, meaning I have a /29 (8 public IPv4 IPs). However, everything about this setup can be done without it, you’ll just want to put your “public” IPv4 services on your LAN and either forward the necessary ports or run a reverse proxy on your router.
Note that the k3s service network will not be accessible from outside the cluster. Services should be exposed via either the public or internal service network, though the pod subnet will also be routed if you need to interact with them directly. Make sure you’ve properly secured your cluster with network policies!
Purpose | IPv4 CIDR | IPv6 CIDR |
---|---|---|
Public Service Network | 192.0.2.240/29 | 2001:db8:beef:aa01::/64 |
Internal Service Network | 172.31.0.0/16 | 2001:db8:beef:aa31::/64 |
Home Network | 172.16.2.0/24 | 2001:db8:beef:aa02::/64 |
K3s Node Network | 172.16.10.0/24 | 2001:db8:beef:aa10::/64 |
K3s Pod Network | 10.42.0.0/16 | 2001:db8:beef:aa42::/64 |
K3s Service Network | 10.43.0.0/16 | fddd:dead:beef::/64 |
BGP with FRR
I use a FreeBSD box as my router, so I’m going to get FRR installed with pkg install frr10
. There’s not much to the configuration – BGP is a simple protocol that pre-dates modern ideas of security. Here’s my frr.conf:
frr defaults traditional log syslog informational ! router bgp 64513 bgp router-id 172.16.10.1 no bgp ebgp-requires-policy bgp default ipv4-unicast bgp default ipv6-unicast neighbor CILIUM4 peer-group neighbor CILIUM4 remote-as 64512 neighbor CILIUM4 soft-reconfiguration inbound neighbor CILIUM6 peer-group neighbor CILIUM6 remote-as 64512 neighbor CILIUM6 soft-reconfiguration inbound neighbor 172.16.10.20 peer-group CILIUM4 neighbor 172.16.10.21 peer-group CILIUM4 neighbor 172.16.10.22 peer-group CILIUM4 neighbor 172.16.10.23 peer-group CILIUM4 neighbor 172.16.10.24 peer-group CILIUM4 neighbor 172.16.10.25 peer-group CILIUM4 neighbor 2001:db8:beef:aa10::20 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::21 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::22 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::23 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::24 peer-group CILIUM6 neighbor 2001:db8:beef:aa10::25 peer-group CILIUM6 ! line vty
Installing K3s Server
Cilium will be taking over for several of the standard parts of the k3s stack, so we need to ensure to disable those bits and bobs at install time. Also, I have a local domain wired up to my DHCP server so I’m going to use a fully qualified domain name. Lastly, we need to ensure the bpf filesystem is mounted. This script will install the k3s server:
#!/bin/bash # Install k3s server export K3S_KUBECONFIG_MODE="644" export INSTALL_K3S_EXEC=" \ server \ --flannel-backend=none \ --disable-network-policy \ --disable-kube-proxy \ --disable servicelb \ --disable traefik \ --tls-san k3s-server.k3s.example.com \ --node-label bgp-enabled="true" \ --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64 \ --service-cidr=10.43.0.0/16,fddd:dead:beef::/112" curl -sfL https://get.k3s.io | sh -s - curl -k --resolve k3s-server.k3s.example.com:6443:127.0.0.1 https://k3s-server.k3s.example.com:6443/ping # Prep bpf filesystem sudo mount bpffs -t bpf /sys/fs/bpf sudo bash -c 'cat <<EOF >> /etc/fstab none /sys/fs/bpf bpf rw,relatime 0 0 EOF' sudo systemctl daemon-reload sudo systemctl restart local-fs.target
A quick rundown on the options:
--flannel-backend=none
We’ll be installing Cilium, we don’t want flannel--disable-network-policy
Cilium has it’s own network policy enforcement--disable-kube-proxy
While you can use kube-proxy with Cilium, that seems kinda pointless--disable-servicelb
Cilium has it’s own load balancer implementation (it used to use metallb, but no longer)--disable-traefik
This is personal taste. I prefer to use ingress-nginx, but you’re welcome to use traefik--tls-san k3s-server.k3s.example.com
Since I’ve got local DNS resolution I’m choosing to use it for the TLS cert--node-label bgp-enabled="true"
We use this node label to control which nodes will participate in BGP peering--cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64
This is the pod network range. This will be announced via BGP.--service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
This is the cluster internal service network range. This will NOT be announced via BGP
Once you’ve got your server deployed, you can get your agents deployed with this:
#!/bin/bash # Install k3s-agent export K3S_KUBECONFIG_MODE="644" export K3S_URL="https://$k3s-server.k3s.example.com:6443" export K3S_TOKEN=$(ssh k3s-server.k3s.example.com "sudo cat /var/lib/rancher/k3s/server/node-token") export INSTALL_K3S_EXEC='--node-label bgp-enabled="true"' curl -sfL https://get.k3s.io | sh - # Prep bpf filesystem sudo mount bpffs -t bpf /sys/fs/bpf sudo bash -c 'cat <<EOF >> /etc/fstab none /sys/fs/bpf bpf rw,relatime 0 0 EOF' sudo systemctl daemon-reload sudo systemctl restart local-fs.target
Note that it’s expected that your nodes will report as NotReady. They aren’t – not until we get Cilium deployed.
Install Cilium
I prefer to use helm where I can, so install the helm repo with helm repo add cilium https://helm.cilium.io
. Cilium, and thus the helm chart, have a lot of knobs to twiddle, but here’s what I am using:
cni: exclusive: false operator: replicas: 1 kubeProxyReplacement: true k8sServiceHost: "k3s-server.k3s.example.com" k8sServicePort: 6443 bgpControlPlane: enabled: true ipv4: enabled: true ipv6: enabled: true ipam: moode: "cluster-pool" operator: clusterPoolIPv4PodCIDRList: "10.42.0.0/16" clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96" clusterPoolIPv4MaskSize: 24 clusterPoolIPv6MaskSize: 112 ipv4NativeRoutingCIDR: "10.42.0.0/16" ipv6NativeRoutingCIDR: "2001:db8:beef:aa42::/96" bpf: #datapathMode: "netkit" vlanBypass: - 0 - 10 - 20 enableIPv4Masquerade: false enableIPv6Masquerade: false externalIPs: enabled: true loadBalancer: mode: "dsr" routingMode: "native" autoDirectNodeRoutes: true hubble: relay: enabled: true ui: enabled: true extraConfig: enable-ipv6-ndp: "true" ipv6-mcast-device: "eth0"
The important configuration options are:
cni.exclusive
I also deploy Multus, so I don’t want Cilium assuming I’m committed.kubeProxyReplacement: true
Cilium can do kube-proxy’s job, and it can do it much faster.bgpControlPlane.enabled: true
Required for the whole BGP pieceipam.mode: "cluster-pool"
Cilium supports a variety of IPAM options, but this is the one that fits this use case.ipam.operator.clusterPoolIP*
These options control how IPs are assigned to pods in the cluster. For IPv4 we’re breaking up the /16 into /24s which can be assigned to nodes. This means 254 pods per node. Plenty considering I’m using raspberry pis. For IPv6 we’re breaking up the /96 into /112s. I tried aligning this with the /64 I provided to the k3s installer, but Cilium errored out and wanted a smaller CIDR. I need to dig into this more at some point.ipv(4|6)NativeRoutingCIDR: ("10.42.0.0/16"|"2607:f2c0:f00e:eb42::/96")
These need to match what we provided in the IPAM section, and tell Cilium to not encapsulate these networks – that the underlying network can route them properly.bpf.vlanBypass
Only necessary if you’re using VLANs.EnableIPv(4|6)Masquerade: false
Since we’ve got native routing both internally and externally to the cluster, we don’t want Cilium doing any NAT.routingMode: "native"
We want cilium using the underlying network for routing.extraConfig.enable-ipv6-ndp: "true"
Always good to be neighbourly.
Deploy cilium and wait for it to finish installing:
helm upgrade --install cilium cilium/cilium \ --namespace kube-system \ --values cilium-helm.yaml kubectl wait --namespace kube-system \ --for=condition=ready pod \ --selector=app.kubernetes.io/name=cilium-operator \ --timeout=120s kubectl wait --namespace kube-system \ --for=condition=ready pod \ --selector=app.kubernetes.io/name=hubble-ui \ --timeout=120s
Configure Cilium
Now it’s time to configure the BGP peering by creating some CRDs. First up is the CiliumBGPClusterConfig, which is the keystone of this operation. Note that it uses the bgp-enabled: "true"
selector, which is why we labeled our nodes earlier.
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPClusterConfig metadata: name: cilium-bgp namespace: cilium spec: nodeSelector: matchLabels: bgp-enabled: "true" bgpInstances: - name: "64512" localASN: 64512 peers: - name: "peer-64513-ipv4" peerASN: 64513 peerAddress: "172.16.10.1" peerConfigRef: name: "cilium-peer4" - name: "peer-64513-ipv6" peerASN: 64513 peerAddress: "2001:db8:beef:aa10::1" peerConfigRef: name: "cilium-peer6"
Now we need to create a pair of CiliumBGPPeerConfigs, one for IPv4 and one for IPv6:
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPPeerConfig metadata: namespace: cilium name: cilium-peer4 spec: gracefulRestart: enabled: true restartTimeSeconds: 15 families: - afi: ipv4 safi: unicast advertisements: matchLabels: advertise: "bgp" --- apiVersion: cilium.io/v2alpha1 kind: CiliumBGPPeerConfig metadata: namespace: cilium name: cilium-peer6 spec: gracefulRestart: enabled: true restartTimeSeconds: 15 families: - afi: ipv6 safi: unicast advertisements: matchLabels: advertise: "bgp"
Next up is the CiliumBGPAdvertisement CRD, which tells Cilium what kind of resources we want to advertise over BGP. In this case, we’re going to advertise both the pods and services. Note however, that this won’t advertise standard services, which are deployed in the k3s internal service range (10.43.0.0/16).
apiVersion: cilium.io/v2alpha1 kind: CiliumBGPAdvertisement metadata: namespace: cilum name: bgp-advertisements labels: advertise: bgp spec: advertisements: - advertisementType: "PodCIDR" - advertisementType: "Service" service: addresses: - ExternalIP - LoadBalancerIP selector: matchExpressions: - {key: somekey, operator: NotIn, values: ['never-used-value']} attributes: communities: standard: [ "64512:100" ]
Lastly, we have the CiliumLoadBalancerIPPool CRDs. These are IP pools that load balancers are services configured with externalIps can use, and these are the services Cilium will advertise:
apiVersion: cilium.io/v2alpha1 kind: CiliumLoadBalancerIPPool metadata: name: public-pool spec: blocks: - cidr: 192.0.2.240/29 - start: 2001:db8:beef:aa01::240 stop: 2001:db8:beef:aa01::247 serviceSelector: matchLabels: network: public --- apiVersion: cilium.io/v2alpha1 kind: CiliumLoadBalancerIPPool metadata: name: internal-pool spec: allowFirstLastIPs: "No" blocks: - cidr: 172.31.0.0/16 - cidr: 2001:db8:beef:aa31::/64 serviceSelector: matchLabels: network: internal
For the public-pool CRD, I’m using my public /29 and a small range of IPv6 addresses. Because Cilium will assign addresses sequentially, this ensures services will (generally) have the same final octet/hextet.
For the internal-pool, I’m setting allowFirstLastIPs: "No"
, mostly to avoid confusing other devices that would get confused accessing a service on a network IP [ed note: sometimes the author is one such device].
Deploying Ingress-Nginx
The last step is to deploy an external service or two. Just be sure to label your service with either network: "public"
or network: "internal"
and they’ll be assigned one of the IPs from the relevant pool and Cilium will announce it over BGP.
In my case, I’m primarily using ingress-nginx, so let’s deploy a pair of ’em, starting with the public one. Here’s my ingress-nginx-helm-public-values.yaml file (note service.labels.network: “public”):
fullnameOverride: public-nginx defaultBackend: enabled: false controller: ingressClass: public-nginx ingressClassResource: name: public-nginx controllerValue: "k8s.io/public-ingress-nginx" publishService: enabled: true metrics: enabled: true service: labels: network: "public" ipFamilyPolicy: PreferDualStack
And the internal one:
fullnameOverride: internal-nginx defaultBackend: enabled: false controller: ingressClass: internal-nginx ingressClassResource: name: internal-nginx controllerValue: "k8s.io/internal-ingress-nginx" publishService: enabled: true metrics: enabled: true service: labels: network: "internal" ipFamilyPolicy: PreferDualStack
And deploy them:
helm upgrade --install internal-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --namespace internal-nginx --create-namespace \ --values ingress-nginx-helm-internal.yaml kubectl wait --namespace internal-nginx \ --for=condition=ready pod \ --selector=app.kubernetes.io/component=controller \ --timeout=120s helm upgrade --install public-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --namespace public-nginx --create-namespace \ --values ingress-nginx-helm-public.yaml kubectl wait --namespace public-nginx \ --for=condition=ready pod \ --selector=app.kubernetes.io/component=controller \ --timeout=120s
And finally, you can confirm they’re deployed:
kubectl get service -A | grep LoadBalancer internal-nginx internal-nginx-controller LoadBalancer 10.43.78.168 172.31.0.1,2607:f2c0:f00e:eb43::1 80:32608/TCP,443:31361/TCP 5d9h public-nginx public-nginx-controller LoadBalancer 10.43.33.98 192.0.2.240,2001:db8:beef:aa01::240 80:31821/TCP,443:31611/TCP 5d9h