Category: Uncategorized

  • Dual Stack K3s With Cilium And BGP

    Dual Stack K3s With Cilium And BGP

    In my eternal quest to over-engineer my home network, I decided it was time to rebuild my k3s cluster with cilium CNI and dual-stack dynamic routing via BGP.

    Network Topology

    I’m fortunate in that my ISP offers routed public subnets for a reasonable monthly fee, meaning I have a /29 (8 public IPv4 IPs). However, everything about this setup can be done without it, you’ll just want to put your “public” IPv4 services on your LAN and either forward the necessary ports or run a reverse proxy on your router.

    Note that the k3s service network will not be accessible from outside the cluster. Services should be exposed via either the public or internal service network, though the pod subnet will also be routed if you need to interact with them directly. Make sure you’ve properly secured your cluster with network policies!

    PurposeIPv4 CIDRIPv6 CIDR
    Public Service Network192.0.2.240/292001:db8:beef:aa01::/64
    Internal Service Network172.31.0.0/162001:db8:beef:aa31::/64
    Home Network172.16.2.0/242001:db8:beef:aa02::/64
    K3s Node Network172.16.10.0/242001:db8:beef:aa10::/64
    K3s Pod Network10.42.0.0/162001:db8:beef:aa42::/64
    K3s Service Network10.43.0.0/16fddd:dead:beef::/64

    BGP with FRR

    I use a FreeBSD box as my router, so I’m going to get FRR installed with pkg install frr10. There’s not much to the configuration – BGP is a simple protocol that pre-dates modern ideas of security. Here’s my frr.conf:

    frr defaults traditional
    log syslog informational
    !
    router bgp 64513
      bgp router-id 172.16.10.1
      no bgp ebgp-requires-policy
      bgp default ipv4-unicast
      bgp default ipv6-unicast
      neighbor CILIUM4 peer-group
      neighbor CILIUM4 remote-as 64512
      neighbor CILIUM4 soft-reconfiguration inbound
      neighbor CILIUM6 peer-group
      neighbor CILIUM6 remote-as 64512
      neighbor CILIUM6 soft-reconfiguration inbound
      neighbor 172.16.10.20 peer-group CILIUM4
      neighbor 172.16.10.21 peer-group CILIUM4
      neighbor 172.16.10.22 peer-group CILIUM4
      neighbor 172.16.10.23 peer-group CILIUM4
      neighbor 172.16.10.24 peer-group CILIUM4
      neighbor 172.16.10.25 peer-group CILIUM4
      neighbor 2001:db8:beef:aa10::20 peer-group CILIUM6
      neighbor 2001:db8:beef:aa10::21 peer-group CILIUM6
      neighbor 2001:db8:beef:aa10::22 peer-group CILIUM6
      neighbor 2001:db8:beef:aa10::23 peer-group CILIUM6
      neighbor 2001:db8:beef:aa10::24 peer-group CILIUM6
      neighbor 2001:db8:beef:aa10::25 peer-group CILIUM6
    !
    line vty

    Installing K3s Server

    Cilium will be taking over for several of the standard parts of the k3s stack, so we need to ensure to disable those bits and bobs at install time. Also, I have a local domain wired up to my DHCP server so I’m going to use a fully qualified domain name. Lastly, we need to ensure the bpf filesystem is mounted. This script will install the k3s server:

    #!/bin/bash
    
    # Install k3s server
    export K3S_KUBECONFIG_MODE="644"
    export INSTALL_K3S_EXEC=" \
        server \
        --flannel-backend=none \
        --disable-network-policy \
        --disable-kube-proxy \
        --disable servicelb \
        --disable traefik \
        --tls-san k3s-server.k3s.example.com \
        --node-label bgp-enabled="true" \
        --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64 \
        --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
    curl -sfL https://get.k3s.io | sh -s -
    curl -k --resolve k3s-server.k3s.example.com:6443:127.0.0.1 https://k3s-server.k3s.example.com:6443/ping
    
    # Prep bpf filesystem
    sudo mount bpffs -t bpf /sys/fs/bpf
    sudo bash -c 'cat <<EOF >> /etc/fstab
    none /sys/fs/bpf bpf rw,relatime 0 0
    EOF'
    sudo systemctl daemon-reload
    sudo systemctl restart local-fs.target

    A quick rundown on the options:

    • --flannel-backend=none
      We’ll be installing Cilium, we don’t want flannel
    • --disable-network-policy
      Cilium has it’s own network policy enforcement
    • --disable-kube-proxy
      While you can use kube-proxy with Cilium, that seems kinda pointless
    • --disable-servicelb
      Cilium has it’s own load balancer implementation (it used to use metallb, but no longer)
    • --disable-traefik
      This is personal taste. I prefer to use ingress-nginx, but you’re welcome to use traefik
    • --tls-san k3s-server.k3s.example.com
      Since I’ve got local DNS resolution I’m choosing to use it for the TLS cert
    • --node-label bgp-enabled="true"
      We use this node label to control which nodes will participate in BGP peering
    • --cluster-cidr=10.42.0.0/16,2001:db8:beef:aa42::/64
      This is the pod network range. This will be announced via BGP.
    • --service-cidr=10.43.0.0/16,fddd:dead:beef::/112"
      This is the cluster internal service network range. This will NOT be announced via BGP

    Once you’ve got your server deployed, you can get your agents deployed with this:

    #!/bin/bash
    
    # Install k3s-agent
    export K3S_KUBECONFIG_MODE="644"
    export K3S_URL="https://$k3s-server.k3s.example.com:6443"
    export K3S_TOKEN=$(ssh k3s-server.k3s.example.com "sudo cat /var/lib/rancher/k3s/server/node-token")
    export INSTALL_K3S_EXEC='--node-label bgp-enabled="true"'
    curl -sfL https://get.k3s.io | sh -
    
    # Prep bpf filesystem
    sudo mount bpffs -t bpf /sys/fs/bpf
    sudo bash -c 'cat <<EOF >> /etc/fstab
    none /sys/fs/bpf bpf rw,relatime 0 0
    EOF'
    sudo systemctl daemon-reload
    sudo systemctl restart local-fs.target

    Note that it’s expected that your nodes will report as NotReady. They aren’t – not until we get Cilium deployed.

    Install Cilium

    Because cilium uses the new ebpf capabilities of recent Linux kernels, which allow network programming to run in kernel space, it is much more efficient than a lot the standard kubernetes network tools which run in user space. Hence, my goal with cilium is to leverage it for as much functionality as I can, such as replacing kube-proxy and load balancers. At the same time, I don’t want to make cilium do unnecessary work and instead leverage the native network to avoid encapsulation and network address translation.

    I prefer to use helm where I can, so install the cilium helm repo with helm repo add cilium https://helm.cilium.io. Cilium, and thus the helm chart, has a lot of knobs to twiddle, but here’s what I am using:

    cni:
      exclusive: false
    operator:
      replicas: 1
    kubeProxyReplacement: true
    k8sServiceHost: "k3s-server.k3s.example.com"
    k8sServicePort: 6443
    bgpControlPlane:
      enabled: true
    ipv4:
      enabled: true
    ipv6:
      enabled: true
    ipam:
      moode: "cluster-pool"
      operator:
        clusterPoolIPv4PodCIDRList: "10.42.0.0/16"
        clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
        clusterPoolIPv4MaskSize: 24
        clusterPoolIPv6MaskSize: 112
    ipv4NativeRoutingCIDR: "10.42.0.0/16"
    ipv6NativeRoutingCIDR: "2001:db8:beef:aa42::/96"
    bpf:
      #datapathMode: "netkit"
      vlanBypass:
        - 0
        - 10
        - 20
    enableIPv4Masquerade: false
    enableIPv6Masquerade: false
    externalIPs:
      enabled: true
    loadBalancer:
      mode: "dsr"
    routingMode: "native"
    autoDirectNodeRoutes: true
    hubble:
      relay:
        enabled: true
      ui:
        enabled: true
    extraConfig:
      enable-ipv6-ndp: "true"

    The important configuration options are:

    • Really the most critical one here, it enables BGP functionality:
      • bgpControlPlane.enabled: true
    • Cilium can do kube-proxy’s job, and it can do it much faster:
      • kubeProxyReplacement: true
    • By default, cilium will use encapsulation and NAT traffic as it leaves a node, but the whole premise here is we’re natively routing the kubernetes networks:
      • routingMode: "native"
      • EnableIPv4Masquerade: false
      • EnableIPv6Masquerade: false
      • autoDirectNodeRoutes: true
      • ipv4NativeRoutingCIDR: "10.42.0.0/16"
      • ipv6NativeRoutingCIDR: "2607:f2c0:f00e:eb42::/96"
    • Cilium has a variety of IPAM options, but I want to use cluster scoped direct routing. This is all about the pod networks – service networks will be configured later. We want the CIDRs here to match what we’ve configured for native routing above.
      • ipam.mode: "cluster-pool"
      • ipam.operator.clusterPoolIPclusterPoolIPv4PodCIDRList: "10.42.0.0/16"
      • ipam.operator.clusterPoolIPv4MaskSize: 24
        For IPv4 we’re breaking up the /16 into /24s which can be assigned to nodes. This means 254 pods per node. Plenty considering I’m using raspberry pis.
      • ipam.operator.clusterPoolIPv6PodCIDRList: "2001:db8:beef:aa42::/96"
      • ipam.operator.clusterPoolIPv6MaskSize: 112
        For IPv6 we’re breaking up the /96 into /112s. I tried aligning this with the /64 I provided to the k3s installer, but Cilium errored out and wanted a smaller CIDR. I need to dig into this more at some point.
    • There’s a few other options that are notable, but not directly relevant to this post:
      • loadBalancer.mode: "dsr"
        Cilium supports several load balancing modes. I recently discovered that I had to adjust the MTU/MSS setting on my router due to issues with some IPv6 traffic, and I intend to test the “hybrid” mode soon and see if that resolves it without MTU changes on my network.
      • bpf.datapathMode: "netkit"
        This is commented out because I’m intending to test it, but I’m including it here because it sounds interesting. It replaces the usual veth device typically used for pod connectivity with the new netkit driver that lives in kernel space on the host. It should be more performant.
      • bpf.vlanBypass
        Only necessary if you’re using VLANs.
      • cni.exclusive
        I also deploy Multus, so I don’t want Cilium assuming I’m committed
      • extraConfig.enable-ipv6-ndp: "true"
        It’s always good to be neighbourly. This enables the NDP proxy feature which exposes pod IPv6 addresses on the LAN.

    Deploy cilium and wait for it to finish installing:

    helm upgrade --install cilium cilium/cilium \
      --namespace kube-system \
      --values cilium-helm.yaml
    kubectl wait --namespace kube-system \
      --for=condition=ready pod \
      --selector=app.kubernetes.io/name=cilium-operator \
      --timeout=120s
    kubectl wait --namespace kube-system \
      --for=condition=ready pod \
      --selector=app.kubernetes.io/name=hubble-ui \
      --timeout=120s

    Configure Cilium

    Now it’s time to configure the BGP peering by creating some CRDs. First up is the CiliumBGPClusterConfig, which is the keystone of this operation. Note that it uses the bgp-enabled: "true" selector, which is why we labeled our nodes earlier.

    apiVersion: cilium.io/v2alpha1
    kind: CiliumBGPClusterConfig
    metadata:
      name: cilium-bgp
      namespace: cilium
    spec:
      nodeSelector:
        matchLabels:
          bgp-enabled: "true"
      bgpInstances:
      - name: "64512"
        localASN: 64512
        peers:
        - name: "peer-64513-ipv4"
          peerASN: 64513
          peerAddress: "172.16.10.1"
          peerConfigRef:
            name: "cilium-peer4"
        - name: "peer-64513-ipv6"
          peerASN: 64513
          peerAddress: "2001:db8:beef:aa10::1"
          peerConfigRef:
            name: "cilium-peer6"

    Now we need to create a pair of CiliumBGPPeerConfigs, one for IPv4 and one for IPv6:

    apiVersion: cilium.io/v2alpha1
    kind: CiliumBGPPeerConfig
    metadata:
      namespace: cilium
      name: cilium-peer4
    spec:
      gracefulRestart:
        enabled: true
        restartTimeSeconds: 15
      families:
        - afi: ipv4
          safi: unicast
          advertisements:
            matchLabels:
              advertise: "bgp"
    ---
    apiVersion: cilium.io/v2alpha1
    kind: CiliumBGPPeerConfig
    metadata:
      namespace: cilium
      name: cilium-peer6
    spec:
      gracefulRestart:
        enabled: true
        restartTimeSeconds: 15
      families:
        - afi: ipv6
          safi: unicast
          advertisements:
            matchLabels:
              advertise: "bgp"

    Next up is the CiliumBGPAdvertisement CRD, which tells Cilium what kind of resources we want to advertise over BGP. In this case, we’re going to advertise both the pods and services. Note however, that this won’t advertise standard services, which are deployed in the k3s internal service range (10.43.0.0/16).

    apiVersion: cilium.io/v2alpha1
    kind: CiliumBGPAdvertisement
    metadata:
      namespace: cilum
      name: bgp-advertisements
      labels:
        advertise: bgp
    spec:
      advertisements:
        - advertisementType: "PodCIDR"
        - advertisementType: "Service"
          service:
            addresses:
              - ExternalIP
              - LoadBalancerIP
          selector:
            matchExpressions:
              - {key: somekey, operator: NotIn, values: ['never-used-value']}
          attributes:
            communities:
              standard: [ "64512:100" ]

    Lastly, we have the CiliumLoadBalancerIPPool CRDs. These are IP pools that load balancers are services configured with externalIps can use, and these are the services Cilium will advertise:

    apiVersion: cilium.io/v2alpha1
    kind: CiliumLoadBalancerIPPool
    metadata:
      name: public-pool
    spec:
      blocks:
        - cidr: 192.0.2.240/29
        - start: 2001:db8:beef:aa01::240
          stop: 2001:db8:beef:aa01::247
      serviceSelector:
        matchLabels:
          network: public
    ---
    apiVersion: cilium.io/v2alpha1
    kind: CiliumLoadBalancerIPPool
    metadata:
      name: internal-pool
    spec:
      allowFirstLastIPs: "No"
      blocks:
        - cidr: 172.31.0.0/16
        - cidr: 2001:db8:beef:aa31::/64
      serviceSelector:
        matchLabels:
          network: internal

    For the public-pool CRD, I’m using my public /29 and a small range of IPv6 addresses. Because Cilium will assign addresses sequentially, this ensures services will (generally) have the same final octet/hextet.

    For the internal-pool, I’m setting allowFirstLastIPs: "No", mostly to avoid confusing other devices that would get confused accessing a service on a network IP [ed note: sometimes the author is one such device].

    Deploying Ingress-Nginx

    The last step is to deploy an external service or two. Just be sure to label your service with either network: "public" or network: "internal" and they’ll be assigned one of the IPs from the relevant pool and Cilium will announce it over BGP.

    In my case, I’m primarily using ingress-nginx, so let’s deploy a pair of ’em, starting with the public one. Here’s my ingress-nginx-helm-public-values.yaml file (note service.labels.network: “public”):

    fullnameOverride: public-nginx
    defaultBackend:
      enabled: false
    controller:
      ingressClass: public-nginx
      ingressClassResource:
        name: public-nginx
        controllerValue: "k8s.io/public-ingress-nginx"
      publishService:
        enabled: true
      metrics:
        enabled: true
      service:
        labels:
          network: "public"
        ipFamilyPolicy: PreferDualStack

    And the internal one:

    fullnameOverride: internal-nginx
    defaultBackend:
      enabled: false
    controller:
      ingressClass: internal-nginx
      ingressClassResource:
        name: internal-nginx
        controllerValue: "k8s.io/internal-ingress-nginx"
      publishService:
        enabled: true
      metrics:
        enabled: true
      service:
        labels:
          network: "internal"
        ipFamilyPolicy: PreferDualStack

    And deploy them:

    helm upgrade --install internal-nginx ingress-nginx \
      --repo https://kubernetes.github.io/ingress-nginx \
      --namespace internal-nginx --create-namespace \
      --values ingress-nginx-helm-internal.yaml
    kubectl wait --namespace internal-nginx \
      --for=condition=ready pod \
      --selector=app.kubernetes.io/component=controller \
      --timeout=120s
    helm upgrade --install public-nginx ingress-nginx \
      --repo https://kubernetes.github.io/ingress-nginx \
      --namespace public-nginx --create-namespace \
      --values ingress-nginx-helm-public.yaml
    kubectl wait --namespace public-nginx \
      --for=condition=ready pod \
      --selector=app.kubernetes.io/component=controller \
      --timeout=120s

    And finally, you can confirm they’re deployed:

    kubectl get service -A | grep LoadBalancer
    internal-nginx   internal-nginx-controller             LoadBalancer   10.43.78.168    172.31.0.1,2607:f2c0:f00e:eb43::1           80:32608/TCP,443:31361/TCP   5d9h
    public-nginx     public-nginx-controller               LoadBalancer   10.43.33.98     192.0.2.240,2001:db8:beef:aa01::240   80:31821/TCP,443:31611/TCP   5d9h