Dual Stack K3s With Cilium And BGP

rudimentary example network topology diagram for k3s cluster

In my eternal quest to over-engineer my home network, I decided it was time to rebuild my k3s cluster with cilium CNI and dual-stack dynamic routing via BGP.

Network Topology

I’m fortunate in that my ISP offers routed public subnets for a reasonable monthly fee, meaning I have a /29 (8 public IPv4 IPs). However, everything about this setup can be done without it, you’ll just want to put your “public” IPv4 services on your LAN and either forward the necessary ports or run a reverse proxy on your router.

Note that the k3s service network will not be accessible from outside the cluster. Services should be exposed via either the public or internal service network, though the pod subnet will also be routed if you need to interact with them directly. Make sure you’ve properly secured your cluster with network policies!

PurposeIPv4 CIDRIPv6 CIDR
Public Service Network192.0.2.240/292001:db8:beef:aa01::/64
Internal Service Network172.31.0.0/162001:db8:beef:aa31::/64
Home Network172.16.2.0/242001:db8:beef:aa02::/64
K3s Node Network172.16.10.0/242001:db8:beef:aa10::/64
K3s Pod Network10.42.0.0/162001:db8:beef:aa42::/64
K3s Service Network10.43.0.0/16fddd:dead:beef::/64

BGP with FRR

I use a FreeBSD box as my router, so I’m going to get FRR installed with pkg install frr10. There’s not much to the configuration – BGP is a simple protocol that pre-dates modern ideas of security. Here’s my frr.conf:

frr defaults traditional
log syslog informational
!
router bgp 64513
  bgp router-id 172.16.10.1
  no bgp ebgp-requires-policy
  bgp default ipv4-unicast
  bgp default ipv6-unicast
  neighbor CILIUM4 peer-group
  neighbor CILIUM4 remote-as 64512
  neighbor CILIUM4 soft-reconfiguration inbound
  neighbor CILIUM6 peer-group
  neighbor CILIUM6 remote-as 64512
  neighbor CILIUM6 soft-reconfiguration inbound
  neighbor 172.16.10.20 peer-group CILIUM4
  neighbor 172.16.10.21 peer-group CILIUM4
  neighbor 172.16.10.22 peer-group CILIUM4
  neighbor 172.16.10.23 peer-group CILIUM4
  neighbor 172.16.10.24 peer-group CILIUM4
  neighbor 172.16.10.25 peer-group CILIUM4
  neighbor 2001:db8:beef:aa10::20 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::21 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::22 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::23 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::24 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::25 peer-group CILIUM6
!
line vty

Installing K3s Server

Cilium will be taking over for several of the standard parts of the k3s stack, so we need to ensure to disable those bits and bobs at install time. Also, I have a local domain wired up to my DHCP server so I’m going to use a fully qualified domain name. Lastly, we need to ensure the bpf filesystem is mounted. This script will install the k3s server:

#!/bin/bash

# Install k3s server
export K3S_KUBECONFIG_MODE="644"
export INSTALL_K3S_EXEC=" \
    server \
    --flannel-backend=none \
    --disable-network-policy \
    --disable-kube-proxy \
    --disable servicelb \
    --disable traefik \
    --tls-san k3s-server.k3s.example.com \
    --node-label bgp-enabled="true" \
    --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64 \
    --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
curl -sfL https://get.k3s.io | sh -s -
curl -k --resolve k3s-server.k3s.example.com:6443:127.0.0.1 https://k3s-server.k3s.example.com:6443/ping

# Prep bpf filesystem
sudo mount bpffs -t bpf /sys/fs/bpf
sudo bash -c 'cat <<EOF >> /etc/fstab
none /sys/fs/bpf bpf rw,relatime 0 0
EOF'
sudo systemctl daemon-reload
sudo systemctl restart local-fs.target

A quick rundown on the options:

  • --flannel-backend=none
    We’ll be installing Cilium, we don’t want flannel
  • --disable-network-policy
    Cilium has it’s own network policy enforcement
  • --disable-kube-proxy
    While you can use kube-proxy with Cilium, that seems kinda pointless
  • --disable-servicelb
    Cilium has it’s own load balancer implementation (it used to use metallb, but no longer)
  • --disable-traefik
    This is personal taste. I prefer to use ingress-nginx, but you’re welcome to use traefik
  • --tls-san k3s-server.k3s.example.com
    Since I’ve got local DNS resolution I’m choosing to use it for the TLS cert
  • --node-label bgp-enabled="true"
    We use this node label to control which nodes will participate in BGP peering
  • --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64
    This is the pod network range. This will be announced via BGP.
  • --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
    This is the cluster internal service network range. This will NOT be announced via BGP

Once you’ve got your server deployed, you can get your agents deployed with this:

#!/bin/bash

# Install k3s-agent
export K3S_KUBECONFIG_MODE="644"
export K3S_URL="https://$k3s-server.k3s.example.com:6443"
export K3S_TOKEN=$(ssh k3s-server.k3s.example.com "sudo cat /var/lib/rancher/k3s/server/node-token")
export INSTALL_K3S_EXEC='--node-label bgp-enabled="true"'
curl -sfL https://get.k3s.io | sh -

# Prep bpf filesystem
sudo mount bpffs -t bpf /sys/fs/bpf
sudo bash -c 'cat <<EOF >> /etc/fstab
none /sys/fs/bpf bpf rw,relatime 0 0
EOF'
sudo systemctl daemon-reload
sudo systemctl restart local-fs.target

Note that it’s expected that your nodes will report as NotReady. They aren’t – not until we get Cilium deployed.

Install Cilium

I prefer to use helm where I can, so install the helm repo with helm repo add cilium https://helm.cilium.io. Cilium, and thus the helm chart, have a lot of knobs to twiddle, but here’s what I am using:

cni:
  exclusive: false
operator:
  replicas: 1
kubeProxyReplacement: true
k8sServiceHost: "k3s-server.k3s.example.com"
k8sServicePort: 6443
bgpControlPlane:
  enabled: true
ipv4:
  enabled: true
ipv6:
  enabled: true
ipam:
  moode: "cluster-pool"
  operator:
    clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
    clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
    clusterPoolIPv4MaskSize: 24
    clusterPoolIPv6MaskSize: 112
ipv4NativeRoutingCIDR: "10.42.0.0/16"
ipv6NativeRoutingCIDR: "2001:db8:beef:aa42::/96"
bpf:
  #datapathMode: "netkit"
  vlanBypass:
    - 0
    - 10
    - 20
enableIPv4Masquerade: false
enableIPv6Masquerade: false
externalIPs:
  enabled: true
loadBalancer:
  mode: "dsr"
routingMode: "native"
autoDirectNodeRoutes: true
hubble:
  relay:
    enabled: true
  ui:
    enabled: true
extraConfig:
  enable-ipv6-ndp: "true"
  ipv6-mcast-device: "eth0"

The important configuration options are:

  • cni.exclusive
    I also deploy Multus, so I don’t want Cilium assuming I’m committed.
  • kubeProxyReplacement: true
    Cilium can do kube-proxy’s job, and it can do it much faster.
  • bgpControlPlane.enabled: true
    Required for the whole BGP piece
  • ipam.mode: "cluster-pool"
    Cilium supports a variety of IPAM options, but this is the one that fits this use case.
  • ipam.operator.clusterPoolIP*
    These options control how IPs are assigned to pods in the cluster. For IPv4 we’re breaking up the /16 into /24s which can be assigned to nodes. This means 254 pods per node. Plenty considering I’m using raspberry pis. For IPv6 we’re breaking up the /96 into /112s. I tried aligning this with the /64 I provided to the k3s installer, but Cilium errored out and wanted a smaller CIDR. I need to dig into this more at some point.
  • ipv(4|6)NativeRoutingCIDR: ("10.42.0.0/16"|"2607:f2c0:f00e:eb42::/96")
    These need to match what we provided in the IPAM section, and tell Cilium to not encapsulate these networks – that the underlying network can route them properly.
  • bpf.vlanBypass
    Only necessary if you’re using VLANs.
  • EnableIPv(4|6)Masquerade: false
    Since we’ve got native routing both internally and externally to the cluster, we don’t want Cilium doing any NAT.
  • routingMode: "native"
    We want cilium using the underlying network for routing.
  • extraConfig.enable-ipv6-ndp: "true"
    Always good to be neighbourly.

Deploy cilium and wait for it to finish installing:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --values cilium-helm.yaml
kubectl wait --namespace kube-system \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/name=cilium-operator \
  --timeout=120s
kubectl wait --namespace kube-system \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/name=hubble-ui \
  --timeout=120s

Configure Cilium

Now it’s time to configure the BGP peering by creating some CRDs. First up is the CiliumBGPClusterConfig, which is the keystone of this operation. Note that it uses the bgp-enabled: "true" selector, which is why we labeled our nodes earlier.

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
  namespace: cilium
spec:
  nodeSelector:
    matchLabels:
      bgp-enabled: "true"
  bgpInstances:
  - name: "64512"
    localASN: 64512
    peers:
    - name: "peer-64513-ipv4"
      peerASN: 64513
      peerAddress: "172.16.10.1"
      peerConfigRef:
        name: "cilium-peer4"
    - name: "peer-64513-ipv6"
      peerASN: 64513
      peerAddress: "2001:db8:beef:aa10::1"
      peerConfigRef:
        name: "cilium-peer6"

Now we need to create a pair of CiliumBGPPeerConfigs, one for IPv4 and one for IPv6:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  namespace: cilium
  name: cilium-peer4
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  namespace: cilium
  name: cilium-peer6
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv6
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"

Next up is the CiliumBGPAdvertisement CRD, which tells Cilium what kind of resources we want to advertise over BGP. In this case, we’re going to advertise both the pods and services. Note however, that this won’t advertise standard services, which are deployed in the k3s internal service range (10.43.0.0/16).

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  namespace: cilum
  name: bgp-advertisements
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "PodCIDR"
    - advertisementType: "Service"
      service:
        addresses:
          - ExternalIP
          - LoadBalancerIP
      selector:
        matchExpressions:
          - {key: somekey, operator: NotIn, values: ['never-used-value']}
      attributes:
        communities:
          standard: [ "64512:100" ]

Lastly, we have the CiliumLoadBalancerIPPool CRDs. These are IP pools that load balancers are services configured with externalIps can use, and these are the services Cilium will advertise:

apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: public-pool
spec:
  blocks:
    - cidr: 192.0.2.240/29
    - start: 2001:db8:beef:aa01::240
      stop: 2001:db8:beef:aa01::247
  serviceSelector:
    matchLabels:
      network: public
---
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: internal-pool
spec:
  allowFirstLastIPs: "No"
  blocks:
    - cidr: 172.31.0.0/16
    - cidr: 2001:db8:beef:aa31::/64
  serviceSelector:
    matchLabels:
      network: internal

For the public-pool CRD, I’m using my public /29 and a small range of IPv6 addresses. Because Cilium will assign addresses sequentially, this ensures services will (generally) have the same final octet/hextet.

For the internal-pool, I’m setting allowFirstLastIPs: "No", mostly to avoid confusing other devices that would get confused accessing a service on a network IP [ed note: sometimes the author is one such device].

Deploying Ingress-Nginx

The last step is to deploy an external service or two. Just be sure to label your service with either network: "public" or network: "internal" and they’ll be assigned one of the IPs from the relevant pool and Cilium will announce it over BGP.

In my case, I’m primarily using ingress-nginx, so let’s deploy a pair of ’em, starting with the public one. Here’s my ingress-nginx-helm-public-values.yaml file (note service.labels.network: “public”):

fullnameOverride: public-nginx
defaultBackend:
  enabled: false
controller:
  ingressClass: public-nginx
  ingressClassResource:
    name: public-nginx
    controllerValue: "k8s.io/public-ingress-nginx"
  publishService:
    enabled: true
  metrics:
    enabled: true
  service:
    labels:
      network: "public"
    ipFamilyPolicy: PreferDualStack

And the internal one:

fullnameOverride: internal-nginx
defaultBackend:
  enabled: false
controller:
  ingressClass: internal-nginx
  ingressClassResource:
    name: internal-nginx
    controllerValue: "k8s.io/internal-ingress-nginx"
  publishService:
    enabled: true
  metrics:
    enabled: true
  service:
    labels:
      network: "internal"
    ipFamilyPolicy: PreferDualStack

And deploy them:

helm upgrade --install internal-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace internal-nginx --create-namespace \
  --values ingress-nginx-helm-internal.yaml
kubectl wait --namespace internal-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=120s
helm upgrade --install public-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace public-nginx --create-namespace \
  --values ingress-nginx-helm-public.yaml
kubectl wait --namespace public-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=120s

And finally, you can confirm they’re deployed:

kubectl get service -A | grep LoadBalancer
internal-nginx   internal-nginx-controller             LoadBalancer   10.43.78.168    172.31.0.1,2607:f2c0:f00e:eb43::1           80:32608/TCP,443:31361/TCP   5d9h
public-nginx     public-nginx-controller               LoadBalancer   10.43.33.98     192.0.2.240,2001:db8:beef:aa01::240   80:31821/TCP,443:31611/TCP   5d9h