Dual Stack K3s With Cilium And BGP

rudimentary example network topology diagram for k3s cluster

In my eternal quest to over-engineer my home network, I decided it was time to rebuild my k3s cluster with cilium CNI and dual-stack dynamic routing via BGP.

Network Topology

I’m fortunate in that my ISP offers routed public subnets for a reasonable monthly fee, meaning I have a /29 (8 public IPv4 IPs). However, everything about this setup can be done without it, you’ll just want to put your “public” IPv4 services on your LAN and either forward the necessary ports or run a reverse proxy on your router.

Note that the k3s service network will not be accessible from outside the cluster. Services should be exposed via either the public or internal service network, though the pod subnet will also be routed if you need to interact with them directly. Make sure you’ve properly secured your cluster with network policies!

PurposeIPv4 CIDRIPv6 CIDR
Public Service Network192.0.2.240/292001:db8:beef:aa01::/64
Internal Service Network172.31.0.0/162001:db8:beef:aa31::/64
Home Network172.16.2.0/242001:db8:beef:aa02::/64
K3s Node Network172.16.10.0/242001:db8:beef:aa10::/64
K3s Pod Network10.42.0.0/162001:db8:beef:aa42::/64
K3s Service Network10.43.0.0/16fddd:dead:beef::/64

BGP with FRR

I use a FreeBSD box as my router, so I’m going to get FRR installed with pkg install frr10. There’s not much to the configuration – BGP is a simple protocol that pre-dates modern ideas of security. Here’s my frr.conf:

frr defaults traditional
log syslog informational
!
router bgp 64513
  bgp router-id 172.16.10.1
  no bgp ebgp-requires-policy
  bgp default ipv4-unicast
  bgp default ipv6-unicast
  neighbor CILIUM4 peer-group
  neighbor CILIUM4 remote-as 64512
  neighbor CILIUM4 soft-reconfiguration inbound
  neighbor CILIUM6 peer-group
  neighbor CILIUM6 remote-as 64512
  neighbor CILIUM6 soft-reconfiguration inbound
  neighbor 172.16.10.20 peer-group CILIUM4
  neighbor 172.16.10.21 peer-group CILIUM4
  neighbor 172.16.10.22 peer-group CILIUM4
  neighbor 172.16.10.23 peer-group CILIUM4
  neighbor 172.16.10.24 peer-group CILIUM4
  neighbor 172.16.10.25 peer-group CILIUM4
  neighbor 2001:db8:beef:aa10::20 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::21 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::22 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::23 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::24 peer-group CILIUM6
  neighbor 2001:db8:beef:aa10::25 peer-group CILIUM6
!
line vty

Installing K3s Server

Cilium will be taking over for several of the standard parts of the k3s stack, so we need to ensure to disable those bits and bobs at install time. Also, I have a local domain wired up to my DHCP server so I’m going to use a fully qualified domain name. Lastly, we need to ensure the bpf filesystem is mounted. This script will install the k3s server:

#!/bin/bash

# Install k3s server
export K3S_KUBECONFIG_MODE="644"
export INSTALL_K3S_EXEC=" \
    server \
    --flannel-backend=none \
    --disable-network-policy \
    --disable-kube-proxy \
    --disable servicelb \
    --disable traefik \
    --tls-san k3s-server.k3s.example.com \
    --node-label bgp-enabled="true" \
    --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64 \
    --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
curl -sfL https://get.k3s.io | sh -s -
curl -k --resolve k3s-server.k3s.example.com:6443:127.0.0.1 https://k3s-server.k3s.example.com:6443/ping

# Prep bpf filesystem
sudo mount bpffs -t bpf /sys/fs/bpf
sudo bash -c 'cat <<EOF >> /etc/fstab
none /sys/fs/bpf bpf rw,relatime 0 0
EOF'
sudo systemctl daemon-reload
sudo systemctl restart local-fs.target

A quick rundown on the options:

  • --flannel-backend=none
    We’ll be installing Cilium, we don’t want flannel
  • --disable-network-policy
    Cilium has it’s own network policy enforcement
  • --disable-kube-proxy
    While you can use kube-proxy with Cilium, that seems kinda pointless
  • --disable-servicelb
    Cilium has it’s own load balancer implementation (it used to use metallb, but no longer)
  • --disable-traefik
    This is personal taste. I prefer to use ingress-nginx, but you’re welcome to use traefik
  • --tls-san k3s-server.k3s.example.com
    Since I’ve got local DNS resolution I’m choosing to use it for the TLS cert
  • --node-label bgp-enabled="true"
    We use this node label to control which nodes will participate in BGP peering
  • --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64
    This is the pod network range. This will be announced via BGP.
  • --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
    This is the cluster internal service network range. This will NOT be announced via BGP

Once you’ve got your server deployed, you can get your agents deployed with this:

#!/bin/bash

# Install k3s-agent
export K3S_KUBECONFIG_MODE="644"
export K3S_URL="https://$k3s-server.k3s.example.com:6443"
export K3S_TOKEN=$(ssh k3s-server.k3s.example.com "sudo cat /var/lib/rancher/k3s/server/node-token")
export INSTALL_K3S_EXEC='--node-label bgp-enabled="true"'
curl -sfL https://get.k3s.io | sh -

# Prep bpf filesystem
sudo mount bpffs -t bpf /sys/fs/bpf
sudo bash -c 'cat <<EOF >> /etc/fstab
none /sys/fs/bpf bpf rw,relatime 0 0
EOF'
sudo systemctl daemon-reload
sudo systemctl restart local-fs.target

Note that it’s expected that your nodes will report as NotReady. They aren’t – not until we get Cilium deployed.

Install Cilium

Because cilium uses the new ebpf capabilities of recent Linux kernels, which allow network programming to run in kernel space, it is much more efficient than a lot the standard kubernetes network tools which run in user space. Hence, my goal with cilium is to leverage it for as much functionality as I can, such as replacing kube-proxy and load balancers. At the same time, I don’t want to make cilium do unnecessary work and instead leverage the native network to avoid encapsulation and network address translation.

I prefer to use helm where I can, so install the cilium helm repo with helm repo add cilium https://helm.cilium.io. Cilium, and thus the helm chart, has a lot of knobs to twiddle, but here’s what I am using:

cni:
  exclusive: false
operator:
  replicas: 1
kubeProxyReplacement: true
k8sServiceHost: "k3s-server.k3s.example.com"
k8sServicePort: 6443
bgpControlPlane:
  enabled: true
ipv4:
  enabled: true
ipv6:
  enabled: true
ipam:
  moode: "cluster-pool"
  operator:
    clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
    clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
    clusterPoolIPv4MaskSize: 24
    clusterPoolIPv6MaskSize: 112
ipv4NativeRoutingCIDR: "10.42.0.0/16"
ipv6NativeRoutingCIDR: "2001:db8:beef:aa42::/96"
bpf:
  #datapathMode: "netkit"
  vlanBypass:
    - 0
    - 10
    - 20
enableIPv4Masquerade: false
enableIPv6Masquerade: false
externalIPs:
  enabled: true
loadBalancer:
  mode: "dsr"
routingMode: "native"
autoDirectNodeRoutes: true
hubble:
  relay:
    enabled: true
  ui:
    enabled: true
extraConfig:
  enable-ipv6-ndp: "true"

The important configuration options are:

  • Really the most critical one here, it enables BGP functionality:
    • bgpControlPlane.enabled: true
  • Cilium can do kube-proxy’s job, and it can do it much faster:
    • kubeProxyReplacement: true
  • By default, cilium will use encapsulation and NAT traffic as it leaves a node, but the whole premise here is we’re natively routing the kubernetes networks:
    • routingMode: "native"
    • EnableIPv4Masquerade: false
    • EnableIPv6Masquerade: false
    • autoDirectNodeRoutes: true
    • ipv4NativeRoutingCIDR: "10.42.0.0/16"
    • ipv6NativeRoutingCIDR: "2607:f2c0:f00e:eb42::/96"
  • Cilium has a variety of IPAM options, but I want to use cluster scoped direct routing. This is all about the pod networks – service networks will be configured later. We want the CIDRs here to match what we’ve configured for native routing above.
    • ipam.mode: "cluster-pool"
    • ipam.operator.clusterPoolIPclusterPoolIPv4PodCIDRList: "10.42.0.0/16"
    • ipam.operator.clusterPoolIPv4MaskSize: 24
      For IPv4 we’re breaking up the /16 into /24s which can be assigned to nodes. This means 254 pods per node. Plenty considering I’m using raspberry pis.
    • ipam.operator.clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
    • ipam.operator.clusterPoolIPv6MaskSize: 112
      For IPv6 we’re breaking up the /96 into /112s. I tried aligning this with the /64 I provided to the k3s installer, but Cilium errored out and wanted a smaller CIDR. I need to dig into this more at some point.
  • There’s a few other options that are notable, but not directly relevant to this post:
    • loadBalancer.mode: "dsr"
      Cilium supports several load balancing modes. I recently discovered that I had to adjust the MTU/MSS setting on my router due to issues with some IPv6 traffic, and I intend to test the “hybrid” mode soon and see if that resolves it without MTU changes on my network.
    • bpf.datapathMode: "netkit"
      This is commented out because I’m intending to test it, but I’m including it here because it sounds interesting. It replaces the usual veth device typically used for pod connectivity with the new netkit driver that lives in kernel space on the host. It should be more performant.
    • bpf.vlanBypass
      Only necessary if you’re using VLANs.
    • cni.exclusive
      I also deploy Multus, so I don’t want Cilium assuming I’m committed
    • extraConfig.enable-ipv6-ndp: "true"
      It’s always good to be neighbourly. This enables the NDP proxy feature which exposes pod IPv6 addresses on the LAN.

Deploy cilium and wait for it to finish installing:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --values cilium-helm.yaml
kubectl wait --namespace kube-system \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/name=cilium-operator \
  --timeout=120s
kubectl wait --namespace kube-system \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/name=hubble-ui \
  --timeout=120s

Configure Cilium

Now it’s time to configure the BGP peering by creating some CRDs. First up is the CiliumBGPClusterConfig, which is the keystone of this operation. Note that it uses the bgp-enabled: "true" selector, which is why we labeled our nodes earlier.

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
  namespace: cilium
spec:
  nodeSelector:
    matchLabels:
      bgp-enabled: "true"
  bgpInstances:
  - name: "64512"
    localASN: 64512
    peers:
    - name: "peer-64513-ipv4"
      peerASN: 64513
      peerAddress: "172.16.10.1"
      peerConfigRef:
        name: "cilium-peer4"
    - name: "peer-64513-ipv6"
      peerASN: 64513
      peerAddress: "2001:db8:beef:aa10::1"
      peerConfigRef:
        name: "cilium-peer6"

Now we need to create a pair of CiliumBGPPeerConfigs, one for IPv4 and one for IPv6:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  namespace: cilium
  name: cilium-peer4
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  namespace: cilium
  name: cilium-peer6
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 15
  families:
    - afi: ipv6
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "bgp"

Next up is the CiliumBGPAdvertisement CRD, which tells Cilium what kind of resources we want to advertise over BGP. In this case, we’re going to advertise both the pods and services. Note however, that this won’t advertise standard services, which are deployed in the k3s internal service range (10.43.0.0/16).

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  namespace: cilum
  name: bgp-advertisements
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "PodCIDR"
    - advertisementType: "Service"
      service:
        addresses:
          - ExternalIP
          - LoadBalancerIP
      selector:
        matchExpressions:
          - {key: somekey, operator: NotIn, values: ['never-used-value']}
      attributes:
        communities:
          standard: [ "64512:100" ]

Lastly, we have the CiliumLoadBalancerIPPool CRDs. These are IP pools that load balancers are services configured with externalIps can use, and these are the services Cilium will advertise:

apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: public-pool
spec:
  blocks:
    - cidr: 192.0.2.240/29
    - start: 2001:db8:beef:aa01::240
      stop: 2001:db8:beef:aa01::247
  serviceSelector:
    matchLabels:
      network: public
---
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: internal-pool
spec:
  allowFirstLastIPs: "No"
  blocks:
    - cidr: 172.31.0.0/16
    - cidr: 2001:db8:beef:aa31::/64
  serviceSelector:
    matchLabels:
      network: internal

For the public-pool CRD, I’m using my public /29 and a small range of IPv6 addresses. Because Cilium will assign addresses sequentially, this ensures services will (generally) have the same final octet/hextet.

For the internal-pool, I’m setting allowFirstLastIPs: "No", mostly to avoid confusing other devices that would get confused accessing a service on a network IP [ed note: sometimes the author is one such device].

Deploying Ingress-Nginx

The last step is to deploy an external service or two. Just be sure to label your service with either network: "public" or network: "internal" and they’ll be assigned one of the IPs from the relevant pool and Cilium will announce it over BGP.

In my case, I’m primarily using ingress-nginx, so let’s deploy a pair of ’em, starting with the public one. Here’s my ingress-nginx-helm-public-values.yaml file (note service.labels.network: “public”):

fullnameOverride: public-nginx
defaultBackend:
  enabled: false
controller:
  ingressClass: public-nginx
  ingressClassResource:
    name: public-nginx
    controllerValue: "k8s.io/public-ingress-nginx"
  publishService:
    enabled: true
  metrics:
    enabled: true
  service:
    labels:
      network: "public"
    ipFamilyPolicy: PreferDualStack

And the internal one:

fullnameOverride: internal-nginx
defaultBackend:
  enabled: false
controller:
  ingressClass: internal-nginx
  ingressClassResource:
    name: internal-nginx
    controllerValue: "k8s.io/internal-ingress-nginx"
  publishService:
    enabled: true
  metrics:
    enabled: true
  service:
    labels:
      network: "internal"
    ipFamilyPolicy: PreferDualStack

And deploy them:

helm upgrade --install internal-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace internal-nginx --create-namespace \
  --values ingress-nginx-helm-internal.yaml
kubectl wait --namespace internal-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=120s
helm upgrade --install public-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace public-nginx --create-namespace \
  --values ingress-nginx-helm-public.yaml
kubectl wait --namespace public-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=120s

And finally, you can confirm they’re deployed:

kubectl get service -A | grep LoadBalancer
internal-nginx   internal-nginx-controller             LoadBalancer   10.43.78.168    172.31.0.1,2607:f2c0:f00e:eb43::1           80:32608/TCP,443:31361/TCP   5d9h
public-nginx     public-nginx-controller               LoadBalancer   10.43.33.98     192.0.2.240,2001:db8:beef:aa01::240   80:31821/TCP,443:31611/TCP   5d9h