The Ranty Programmer
CV

Bootstrapping a Kubernetes Cluster (The DevOps Journey - 2)

Setting up a K3S cluster.

Throughout this series I'm going to be using K3S. This is a "lightweight" Kubernetes distribution. It's almost fully compatible with Kubernetes and it should be more than enough for this project. It can handle a pretty big number of nodes and you can actually choose which backend to use to store the cluster state as opposed to being tied to etcd1 . That might come to be of interest later but, for now, the embedded etcd will be more than enough.

Ansible

Obviously I'm not going to be manually setting up K3S in each VM each time. That would speak really poorly of the DevOps label. So here is where Ansible comes in. Ansible is basically an SSH based configuration and deployment tool. It allows running groups of actions called playbooks on groups of nodes.

The plumbing

So far, I've automated the VM creation with Pulumi. Now, in order to be able to manipulate these machines with Ansible, we need to know how to reach and control them.

Gladly we always know the IPs of the nodes in advance since they are statically set. Thanks to cloud-init, the SSH keys for accessing the nodes are prebaked. So all we have to do is build the Ansible inventory file inside pulumi and make it available as an export.

As a deployment grows you will start having quite a few secrets. You obviously don't want your secrets exposed in plain text in the repo, but you also don't want to be maintaining a whole bunch of secrets elsewhere. This makes moving across secret managers a pain and makes the whole maintenance of secrets such a chore that you start to compromise in places.

Instead we are going to make use of local encryption to store the values we need and decrypt them at runtime. Both Pulumi and Ansible support encrypting values using a passphrase. In Pulumi you have a pretty clear cut isolation of the variables in what they call stacks.

In Ansible you have vaults. You can create as many vaults as you like and identify them using vault ids. This way you can keep granular access control inside the repo by creating multiple vault-ids with their passwords.

While Pulumi does support a bunch of secret providers I stick to using passphrases. And why you might ask? Well, for the same reason I'm spawning VMs, I don't like being tied to any one provider.

So once you've generated your passphrases with your pseudo-ramdon generator of choice, you will want to store them in whatever secret manager you are most comfortable with.

I will be using Infisical. I've been using this secret manager for a few years now and it's pretty great. It has all the pro access control you might need(believe it or not I am my worst enemy). And the CLI experience for injecting secrets is just fenomenal.

In Pulumi you inject the passphrase via an environment variable and in infisical you use passphrase files. Now don't be writing your passphrase to these files. Instead, Ansible supports actually running the passphrase files if they are marked as executable.

This is how one of my vaultpk.sh files look like.

#!/usr/bin/env bash
#

set -euo pipefail

infisical secrets --path /ansible get VAULT_PASSWORD --plain

Any other files I need encrypted that do not belong to one of these tools will be encrypted with age.

But what exactly is the value of this? Well imagine I wake up tomorrow deciding I no longer want to use Infisical. Had I put all my secrets in there, migrating to my new secrets platform would involve a tedious work of porting potentially hundreds of variables from one provider to another. Instead, all I have to do is a few passphrases and update the scripts that provide them. Portability and platform independence are an underrated superpower.

Installing K3S

Installing K3S is very straightforward.

Once we're running in production we do not want any single control node to be our point of failure. This is exactly what etcd is designed to solve based on the Raft consensus algorithm. The minimum requirement for a working cluster is having at least 3 nodes. You need to have an odd number of nodes in your cluster because otherwise you run into the risk of reaching a state of network partition were both sub segments of the network are left with the same number of nodes making it impossible for any one of them to elect a new leader. This is why, in order for a Raft implementation to keep accepting writes, your cluster needs to hold at least N/2+1 nodes.

First we need to initialize the cluster on one of the control nodes. We cannot initialize multiple control nodes at the same time because they would all try to initialize their own etcd databases and we'd run into a split brain from the get go.

Taking advantage of Ansible groups we can run the cluster initialization only for one of the nodes in the control group and then join the rest of the control nodes to the cluster.

[control]
control-0 ansible_host=10.0.1.1 ansible_user=debian
control-1 ansible_host=10.0.1.2 ansible_user=debian
control-2 ansible_host=10.0.1.3 ansible_user=debian

[worker]
worker-0 ansible_host=10.0.2.1 ansible_user=debian
worker-1 ansible_host=10.0.2.2 ansible_user=debian
### k3s-setup.yaml
- name: Bootstrap cluster
  hosts: control[0]
  tasks:
                - become: true
      ansible.builtin.command: mkdir -p /etc/rancher/k3s
                    - name: Write configuration
      become: true
      ansible.builtin.template:
        dest: /etc/rancher/k3s/config.yaml
        src: ../tmpl/rancher/k3s/config.yaml
      notify:
                                - Install k3s service

  handlers:
                - name: Install k3s service
      become: true
      environment:
        INSTALL_K3S_EXEC: server
        INSTALL_K3S_FORCE_RESTART: "true"
        INSTALL_K3S_VERSION: "" #This should be a fixed version for real production scenarios
      ansible.builtin.shell: curl -sfL https://get.k3s.io |  sh -s - --cluster-init
    - name: Join extra control nodes
  hosts: control[1:]
  tasks:
                - become: true
      ansible.builtin.command: mkdir -p /etc/rancher/k3s
                    - name: Write configuration
      become: true
      ansible.builtin.template:
        dest: /etc/rancher/k3s/config.yaml
        src: ../tmpl/rancher/k3s/config.yaml

  handlers:
                - name: Install k3s service
      become: true
      environment:
        INSTALL_K3S_EXEC: server
        INSTALL_K3S_FORCE_RESTART: "true"
        K3S_URL: "{{ k3s_url }}"
        K3S_TOKEN: "{{ k3s_token }}"
        INSTALL_K3S_VERSION: "" #This should be a fixed version for real production scenarios
      ansible.builtin.shell: curl -sfL https://get.k3s.io |  sh -s -
    ---
### tmpl/rancher/k3s/config.yaml
token: "{{ k3s_token }}"
agent-token: "{{ k3s_agent_token }}"
write-kubeconfig-mode: "0644"
bind-address: "{{ ansible_host }}"

Joining the agent nodes it's pretty much the same process, just changing a few parameters in the install script.

- name: Join worker nodes
  hosts: worker
  tasks:
                - name: Install k3s service
      become: true
      environment:
        INSTALL_K3S_EXEC: agent
        INSTALL_K3S_FORCE_RESTART: ""
        K3S_URL: "{{ k3s_url }}"
        K3S_TOKEN: "{{ k3s_agent_token }}"
        INSTALL_K3S_VERSION: "" #This should be a fixed version for real production scenarios
      ansible.builtin.shell: curl -sfL https://get.k3s.io |  sh -s -

It's important when designing playbooks to make them idempotent. We want to be able to run this playbook any time the node count or the parameters of the cluster change.

During development I won't really be needing an HA cluster. After checking the Raft is working correctly after nuking control nodes I scaled back the count to 1 to free some of my juicy RAM.

The kubeconfig of the cluster can be extracted from any of the control nodes. You should now be able to check your cluster is running and you can apply deployments to it.

ssh [email protected] cat /etc/rancher/k3s/k3s.yaml > ~/.kube/config
kubectl get nodes

Once you've gotten auth going you can try scheduling stuff in it. One thing that caugh me off guard when running the pods was that, by default, K3S will happily schedule workloads on the control plane. The reasoning is solid: K3S is designed to live on the edge and low power platforms where you want to maximize the resource usage of these nodes. If you don't like dealing with this behaviour you can taint the control plane nodes with a NoSchedule2.

# /etc/rancher/k3s/config.yaml on server
node-taint:
  - "node-role.kubernetes.io/control-plane:NoSchedule"

Conclusions

By the end of this chapter you should be able to reliably destroy all of your VMs and bootstrap the cluster back from zero in 4 steps.

# Destroy all the nodes
pulumi down -y

# Recreate the cluster
pulumi up -y

# Start all the nodes
virsh -c qemu:///system list --all   | grep k3s-   | awk '{print $2}'   | xargs -n1 virsh -c qemu:///system start

# Wait a few seconds for the initial image setup
ansible-playbok k3s-setup.yaml

I apologize if my descriptions are not super helpful at getting you to that last mile. After all, this is a journal, not a tutorial and I'm glossing over all the glue specific to my own setup like secret injection and inventory generation.

References