• kube-proxy networking
    DevOps,  Kubernetes

    kube-proxy explained

    Table of Contents

    Intro

    kube-proxy is a cluster component responsible for network traffic routing. Because of that, 1 instance is running on each cluster node.

    It is responsible for routing traffic between cluster components but also for traffic incoming from outside the cluster.

    It essentially implements rules part of Service(s). A Service represents a rule which is then implemented by kube-proxy.

    kube-proxy operating modes

    kube-proxy can implement network traffic rules 3 different ways:

    • iptables (default)
    • userspace (old, deprecated)
    • IPVS (IP Virtual Server)

    This page focuses on iptables mode.

    kube-proxy – iptables mode

    By using iptables mode, whenever a Service is created, related iptables rules are created on each node by kube-proxy.

    Such rules are part of PREROUTING chain: This means that traffic is forwarded as soon as it gets into the kernel.

    Listing all iptables PREROUTING chains

    sudo iptables -t nat -L PREROUTING | column -t

    Example:

    root@test:~# sudo iptables -t nat -L PREROUTING | column -t
    Chain            PREROUTING  (policy  ACCEPT)
    target           prot        opt      source    destination
    cali-PREROUTING  all         --       anywhere  anywhere     /*        cali:6gwbT8clXdHdC1b1  */
    KUBE-SERVICES    all         --       anywhere  anywhere     /*        kubernetes             service   portals  */
    DOCKER           all         --       anywhere  anywhere     ADDRTYPE  match                  dst-type  LOCAL

    Listing all rules part of a given chain

    sudo iptables -t nat -L KUBE-SERVICES -n  | column -t

    For a better understanding, let’s consider the following example:

    A new NodePort Service has been created with the following command:

    kubectl expose deployment prometheus-grafana --type=NodePort --name=grafana-example-service -n monitoring

    By executing the command above, a new Service got created:

    [test@test ~]$ kubectl get svc/grafana-example-service -n monitoring
    NAME                      TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
    grafana-example-service   NodePort   10.111.189.177   <none>        3000:31577/TCP   100m

    We did not specify any specific node port, therefore a random one between 30000 and 32767 has been automatically assigned: 31577.

    Yaml manifest of Service object created with the command above:

    kind: Service
    metadata:
      labels:
        app.kubernetes.io/instance: prometheus
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: grafana
        app.kubernetes.io/version: 9.2.4
        helm.sh/chart: grafana-6.43.5
      name: grafana-example-service
      namespace: monitoring
    spec:
      clusterIP: 10.111.189.177
      clusterIPs:
      - 10.111.189.177
      externalTrafficPolicy: Cluster
      internalTrafficPolicy: Cluster
      ipFamilies:
      - IPv4
      ipFamilyPolicy: SingleStack
      ports:
      - nodePort: 31577
        port: 3000
        protocol: TCP
        targetPort: 3000
      selector:
        app.kubernetes.io/instance: prometheus
        app.kubernetes.io/name: grafana
      sessionAffinity: None
      type: NodePort

    Assigning a custom nodePort

    If you want to expose the Service on a custom node port, patch/edit the Service object by changing value of spec.ports.nodePort

    Once the Service got created, we were able to reach grafana with the following URL: http://<NODE_IP_ADDRESS>:31577

    This is made possible by kube-proxy

    When Service grafana-example-service got created, kube-proxy has actually created iptables rules within KUBE_SERVICES chain which belongs to PREROUTING group, as well as a chain which collects all rules related to all NodePorts services:

    sudo iptables -t nat -L KUBE-SERVICES -n  | column -t
    Chain                      KUBE-SERVICES  (2   references)
    target                     prot           opt  source       destination
    KUBE-SVC-MDD5UT6CKUVXRUP3  tcp            --   0.0.0.0/0    10.98.226.44    /*  loki/loki-write:http-metrics                                      cluster   IP          */     tcp   dpt:3100
    KUBE-SVC-FJOCBQUA67AJTJ4Y  tcp            --   0.0.0.0/0    10.103.120.150  /*  loki/loki-read:grpc                                               cluster   IP          */     tcp   dpt:9095
    KUBE-SVC-GWDJ4KONO5OOHRT4  tcp            --   0.0.0.0/0    10.106.191.67   /*  loki/loki-gateway:http                                            cluster   IP          */     tcp   dpt:80
    KUBE-SVC-XBIRSKPJDNCMT43V  tcp            --   0.0.0.0/0    10.111.129.177  /*  metallb-system/webhook-service                                    cluster   IP          */     tcp   dpt:443
    KUBE-SVC-UZFDVIVO2N6QXLRQ  tcp            --   0.0.0.0/0    10.103.243.43   /*  monitoring/prometheus-kube-prometheus-operator:https              cluster   IP          */     tcp   dpt:443
    KUBE-SVC-L5JLFDCUFDUOSAFE  tcp            --   0.0.0.0/0    10.96.126.22    /*  monitoring/prometheus-grafana:http-web                            cluster   IP          */     tcp   dpt:80
    KUBE-SVC-NPX46M4PTMTKRN6Y  tcp            --   0.0.0.0/0    10.96.0.1       /*  default/kubernetes:https                                          cluster   IP          */     tcp   dpt:443
    KUBE-SVC-OIUYAK75OI4PJHUN  tcp            --   0.0.0.0/0    10.111.189.177  /*  monitoring/grafana-example-service                                cluster   IP          */     tcp   dpt:3000
    KUBE-SVC-FP56U3IB7O2NDDFT  tcp            --   0.0.0.0/0    10.108.50.82    /*  monitoring/prometheus-kube-prometheus-alertmanager:http-web       cluster   IP          */     tcp   dpt:9093
    KUBE-SVC-TCOU7JCQXEZGVUNU  udp            --   0.0.0.0/0    10.96.0.10      /*  kube-system/kube-dns:dns                                          cluster   IP          */     udp   dpt:53
    KUBE-SVC-JD5MR3NA4I4DYORP  tcp            --   0.0.0.0/0    10.96.0.10      /*  kube-system/kube-dns:metrics                                      cluster   IP          */     tcp   dpt:9153
    KUBE-NODEPORTS             all            --   0.0.0.0/0    0.0.0.0/0       /*  kubernetes                                                        service   nodeports;

    iptables applies all rules subsequently.

    Rules must be interpreted like this:

    • target: What to do whenever a given packet is matching all entry conditions (can be another rule or an action)
    • prot: The protocol
    • source: Source IP address of packet
    • destination: Destination IP address of packet
    • dpt: Destination port of packet

    Example:

    Consider the following rule:

    target                     prot           opt  source       destination
    KUBE-SVC-OIUYAK75OI4PJHUN  tcp            --   0.0.0.0/0    10.111.189.177  /*  monitoring/grafana-example-service                                cluster   IP          */     tcp   dpt:3000

    Interpreting the rule

    IF transmission protocol = tcp AND
    whatever source IP address (0.0.0.0/0 = ANY) AND
    destination IP address is 10.111.189.177 AND
    destination port is 3000
    THEN
    apply rule KUBE-SVC-OIUYAK75OI4PJHUN

    Moving on with our sample Service, when it got created, the 2 following rules have been instantiated by kube-proxy:

    Chain                      KUBE-SERVICES  (2   references)
    target                     prot           opt  source       destination
    KUBE-SVC-OIUYAK75OI4PJHUN  tcp            --   0.0.0.0/0    10.111.189.177  /*  monitoring/grafana-example-service                                cluster   IP          */     tcp   dpt:3000
    KUBE-NODEPORTS             all            --   0.0.0.0/0    0.0.0.0/0       /*  kubernetes                                                        service   nodeports;

    1st rule listed above consists of the following items:

    [test@test~]$ sudo iptables -t nat -L KUBE-SVC-OIUYAK75OI4PJHUN -n  | column -t
    Chain                      KUBE-SVC-OIUYAK75OI4PJHUN  (2   references)
    target                     prot                       opt  source          destination
    KUBE-MARK-MASQ             tcp                        --   !10.244.0.0/16  10.111.189.177  /*  monitoring/grafana-example-service  cluster  IP                 */  tcp  dpt:3000
    KUBE-SEP-LAT64KIID4KEQMCP  all                        --   0.0.0.0/0       0.0.0.0/0       /*  monitoring/grafana-example-service  ->       10.244.0.115:3000  */

    1st item (KUBE-MARK-MASQ) marks the TCP packed as “must go through IP masquerading” whenever the source IP address does NOT belong to 10.244.0.0/16 (in short words, whenever it is not internal traffic among cluster Pods part of the current node) AND if destination address is 10.111.189.177 AND if destination port is 3000.

    Then, rule KUBE-SEP-LAT64KIID4KEQMCP is applied.

    Rule KUBE-SEP-LAT64KIID4KEQMCP consists of the following items:

    [test@test ~]$ sudo iptables -t nat -L KUBE-SEP-LAT64KIID4KEQMCP -n  | column -t
    Chain           KUBE-SEP-LAT64KIID4KEQMCP  (1   references)
    target          prot                       opt  source        destination
    KUBE-MARK-MASQ  all                        --   10.244.0.115  0.0.0.0/0    /*  monitoring/grafana-example-service  */
    DNAT            tcp                        --   0.0.0.0/0     0.0.0.0/0    /*  monitoring/grafana-example-service  */  tcp  to:10.244.0.115:3000

    Which means:

    IF source address is 10.244.0.115, regardless of the destination IP address, mark packet as to go through IP masquerading.
    THEN, execute DNAT (Destination Network Address Translation) and forward it to 10.244.0.115:3000

    Traffic from source IP addresses which do NOT belong to cluster internal network would indeed get discarded, that explains why IP masquerading is required in this case.

    Whenever the 1st rule is not matching, which means that the source IP address already belongs to internal cluster network, no IP masquerading is required, and in this case the 2nd rule above will be applied (rule KUBE-SEP-LAT64KIID4KEQMCP).

    This rule simply forwards the packet to 10.244.0.155:3000 which relates to grafana Pod’s IP address:

    [test@test ~]$ kubectl describe pods/prometheus-grafana-5f848c4987-btg95 -n monitoring
    Name:             prometheus-grafana-5f848c4987-btg95
    Namespace:        monitoring
    Priority:         0
    Service Account:  prometheus-grafana
    Node:             cc-sauron/172.25.50.60
    Start Time:       Wed, 16 Nov 2022 15:10:34 +0100
    Labels:           app.kubernetes.io/instance=prometheus
                      app.kubernetes.io/name=grafana
                      pod-template-hash=5f848c4987
    Annotations:      checksum/config: b9e953e845ac788d3c1ac8894062e8234ed2fd5b5ca91d5908976c4daf5c4bb8
                      checksum/dashboards-json-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
                      checksum/sc-dashboard-provider-config: fbdb192757901cdc4f977c611f5a1dceb959a1aa2df9f92542a0c410ce3be49d
                      checksum/secret: 12768ec288da87f3603cb2ca6c39ebc1ce1c2f42e0cee3d9908ba1463576782a
    Status:           Running
    IP:               10.244.0.115

    The traffic therefore eventually reaches the Pod either by going through IP masquerading (re-mapping source IP address) or directly, depending on the initial source IP address.

    Because of such rule, whenever a client connects to http://<NODE_IP_ADDRESS>:31577 even though there are no LISTENING sockets on the node, traffic is forwarded to the grafana Pod.

    Should any process open a socket and bind it to the same port (31577, in this case), the Pod would still receive all traffic directed to that port since iptables rules are applied as soon as the packet reaches the kernel.

    We can summarise the traffic flow – from external systems – like this:

    External references

    The following pages helped a lot: