Have you ever thought about a fast, reliable and scalable database for your development and staging environment that is easy to deploy and manage? We at convit did. As Cassandra as our primary database and our shift to Kubernetes we wanted to explore how to combine both of these powerful assets. In this article we are going to take a look at how we have realised this for our use case. We hope to give you a concise overview of the caveats and pitfalls we experienced on our way to accomplish combining both technologies.
What is Kubernetes?
Kubernetes or k8s (kubernetes) is an open-source container orchestration system for automating deployment, scaling, and managing of applications, formerly developed by Google and now maintained by the Cloud Native Computing Foundation. It is widely adopted by many large tech companies to ease their deployment and reduce failures in productive operation. It also allows decoupling of hardware and infrastructure specific tasks like server maintenance and provisioning of dependencies (storage, network and computing resources) from other tasks like development of applications and their deployment. It plays very well with Docker and other containerisation technologies and therefore is a valuable addition to all DevOps related tasks.
What is Cassandra?
Cassandra is a distributed NoSql database application which allows a scalable, fault-tolerant and perfomant storage of large amounts of data. We are using its capabilities in lots of projects for fast data retrieval, storage and mining tasks.
In the past the introduction of a Cassandra cluster has been kind of disruptive in common operation scenarios due to its distributed nature and its way of maintaining healthiness and durability of a cluster. A common pattern for recovering a node failure is to kill the node and bootstrap a new one somewhere else. Performing a backup is not done by dumping all data to a huge disk, but more like an upscaling of available nodes or setting up a new logical cluster replicating the existing data.
Apparently setting up a manageable Cassandra cluster sounds like a lot of work. Possibly a task for some orchestration system?
Bear in mind that all upcoming statements, configurations and examples are not directly recommended for production use! This is a case study to show how it can be done. But we are omitting the more complex topics like recovery, failover or scaling for conciseness. So be careful when using this directly in production systems. We recommend this setup only for staging deployments for internal development and testing. To take this into a production ready state more steps have to be taken!
Although we explained k8s and Cassandra above, this case study needs some basic understanding of k8s, Docker and Cassandra. You should be familiar with k8s API objects like pods, deployments and such. You should have built and set up some docker containers on your own and know how to install a Cassandra and check its state to fully understand the next parts of this article.
Cassandra and Docker on k8s – A match „made in heaven“…
Cassandra and Docker on k8s are a good match at first sight. A Cassandra installation is made for redundancy and scalability. Therefore it is splitted into smaller instances – the more the merrier. Each instance is given the same startup parameters, connects to its cluster on its own and handles all election, loadbalancing and partitioning tasks of its data independently. So basically each instance should be started as a container and connects to all other cassandra containers via an overlay network. Each container is therefore eligible to handle incoming connections and serve data from the containerised cassandra cluster to clients. If a container dies, it will be replaced within seconds by a new one due to availability constraints enforced by k8s – with all data for this node being replicated in a bootstrapping process. So why even bother, if it is that easy?
… or better „from hell“?
Because apparently it is not at second glance. Due to its high I/O throughput Cassandra is utilizing the underlying disk to its maximum extent regardless of any other application trying to perform their own I/O. The performance of Cassandra is tightly bound to the disk speed and the amount of time it can use it without interruption.
This can lead to serious problems in a containerised environment if there are more than one Cassandra instances deployed on the same host sharing the same disk underneath. This can even lead to a complete server freeze which you can possibly only resolve by a restart of the machine. Even worse – an orchestration system might be set up to automatically move failing containers to a different node in the network and therefore try to deploy the now missing cassandra instances on the next machine – which will ultimately end in freezing more and more servers. After all there will be many crying customers and developers calling you because everything is in a broken state. Tragically it will take your cluster administrator hours to repair all the mess.
It doesn’t even stop there, there are more pitfalls: To cope with failing instances (or even failing clusters), long bootstrap times and the problem mentioned above, it sometimes makes sense to setup Cassandra to store the database files in a persistent manner on some specific, isolated and non-volatile storage. But unfortunately Cassandra is kind of „picky“ regarding its filesystem and its stored database files. It also stores its network topology inside of its data to handle replicated and original data sets („rows and columns“). If a container fails and your server orchestration system begins to handle this by spawning a new instance with the same filesystem and existing data, it will just crash again with errors because of the old inconsistent data. But why?
Each container has its own network identity in a cluster. This identity is (most of the time) unique for a container and will be used by Cassandra to spin up its node and to store its topology inside the data files. To make things even worse, Cassandra is not using hostnames but ip addresses as its identity. If you recreate the container, its ip address will change and no longer match the topology of its data it has already stored. This will ultimately lead to a corruption of the node on each startup rendering the database in the filesystem as useless.
And where do we go from here?
We just have to respect the above constraints and everything is fine.
- Each Cassandra instance should have its own disk
- Each Cassandra instance should have a fixed ip
What might sound easy is not an easy task using containers. Anyone who works regularly with (Docker) containers knows that the above constraints are the hardest to accomplish and are working against the flexibility we gain from using an orchestration system. Although this is kind of an anti-pattern, it is worth to take a look at this approach using k8s especially if you do not have a full-fletched 1.000-nodes cluster. In such a cluster you are only killing and spawning volatile Cassandra containers due to the small sizes of their data partitions and fast bootstraping times. Our small-scale approach allows easy upgrading of Cassandra versions and maintaining a minimal Cassandra cluster with a non-volatile storage for testing and staging purposes.
Preparing the servers
At convit we have a small development cluster of three servers (server-2, server-7 and server-8). Those three servers are provisioned using a Rancher 2 k8s Cluster setup. All servers have several HDDs, some CPUs (16 – 32) and some RAM (64 – 128 GB). For the setup we assign a separate HDD for each server as our new Cassandra „local storage“. We create a folder on each server (/var/lib/cassandra-dev) and symlink them onto their respecting HDDs.
1. „Each Cassandra instance should have its own disk“
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: cassandra-local-storage provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer
This storage class will be our link between our upcoming volume and container. Be aware, that local binding is a quite new feature in k8s (v1.14). The volumeBindingMode: WaitForFirstConsumer helps us to delay binding of volume claims so all constraints like pod affinities are evaluated beforehand. This way a pod is created before a volume is assigned to its volume claim. Due to this delay the volume claims only bind to volumes living on the corresponding k8s cluster node.
Now we can create the volume pointing to our local host paths. We have to be very specific here and hard-wire our volumes to all servers we want to use. This is the part where we lose some of our flexibility we once gained by using k8s.
Server-2/7/8 local host path volumes:
apiVersion: v1 kind: PersistentVolume metadata: name: cassandra-local-storage-server-2 # (1) spec: capacity: storage: 100Gi # (2) accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain # (3) storageClassName: cassandra-local-storage # (4) hostPath: path: /var/lib/cassandra-dev # (5) type: DirectoryOrCreate nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server-2 # (6) --- apiVersion: v1 kind: PersistentVolume metadata: name: cassandra-local-storage-server-7 spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: cassandra-local-storage hostPath: path: /var/lib/cassandra-dev type: DirectoryOrCreate nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server-7 --- apiVersion: v1 kind: PersistentVolume metadata: name: cassandra-local-storage-server-8 spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: cassandra-local-storage hostPath: path: /var/lib/cassandra-dev type: DirectoryOrCreate nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - server-8
We are creating three separate volumes which are all quite similar. Some remarks:
- Name of our volume, will be used later
- Storage capacity, here 200 gigabytes
- Reclaim policy „retain“ means, we want to keep the data when its unbound.
- Our StorageClass we created earlier
- Our symlinked path on the host
- The k8s cluster node name. This is the most important part! We bind each volume to a fixed server to control where the data is stored. This is part of our „disk <-> network identity“ binding.
2. „Each Cassandra instance should have a fixed (network) identity“
Whenever a container is created, the container engine assigns a container identity, which serves as the hostname of the container and as a general identity within the cluster. This name is volatile and will change each time the container is recreated. To accomplish a fixed network identity we have to tell k8s to use a fixed ip for our containers. The easiest way to achieve a fixed ip is to use the host network of the cluster node. This comes with some caveats:
- The host network is not isolated and not controlled by the cluster but by the underlying server.
- Ports on this network are not controlled by the cluster either and may be occupied by other cluster external resources.
- DNS resolution on the host level is kind of a blackbox for the cluster and you may need to take actions to cope with it.
So you should keep that in mind when deciding which ports you are going to expose and how to set firewall rules on the server.
For the deployment we use a StatefulSet. A StatefulSet, as the name implies, enables us to maintain a deployment state in case of changes. Additionally it allows for an ordered startup and shutdown of containers plus rolling updates of pods. This is especially important for a Cassandra cluster to prevent data loss or downtimes.
To use a StatefulSet we define a headless service to assign fixed identities inside of the cluster to each instance of the set:
apiVersion: v1 kind: Service metadata: annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" labels: app: cassandra-dev pod: cassandra-svc name: cassandra-dev spec: clusterIP: None ports: - port: 9042 selector: app: cassandra-dev pod: cassandra
We add an annotation to tolerate unready endpoints. Basically this means that all containers are available via the given network name even in a Notready state. It is necessary to enable discovery of cluster members for Cassandra. But now the most interesting part, the StatefulSet.
apiVersion: apps/v1 kind: StatefulSet metadata: name: cassandra-dev labels: app: cassandra-dev pod: cassandra spec: serviceName: cassandra-dev replicas: 3 (1) selector: matchLabels: app: cassandra-dev pod: cassandra template: metadata: labels: app: cassandra-dev pod: cassandra spec: hostNetwork: true (2) hostAliases: - ip: "192.168.0.2" hostnames: - "server-2" - ip: "192.168.0.7" hostnames: - "server-7" - ip: "192.168.0.8" hostnames: - "server-8" dnsPolicy: ClusterFirstWithHostNet affinity: podAntiAffinity: (3) requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - cassandra-dev - key: pod operator: In values: - cassandra topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 1800 containers: - name: cassandra image: cassandra:3.11 imagePullPolicy: Always ports: - containerPort: 7000 hostPort: 7000 name: intra-node - containerPort: 7001 hostPort: 7001 name: tls-intra-node - containerPort: 7199 hostPort: 7199 name: jmx - containerPort: 9042 hostPort: 9042 name: cql resources: limits: cpu: 2.0 memory: 10Gi securityContext: capabilities: add: - IPC_LOCK lifecycle: (4) preStop: exec: command: - /bin/sh - -c - nodetool drain env: (5) - name: CASSANDRA_SEEDS value: "cassandra-dev-0.cassandra-dev.infra.svc.cluster.local" - name: CASSANDRA_CLUSTER_NAME value: "Test Cluster" - name: CASSANDRA_DC value: "Test DC" - name: CASSANDRA_RACK value: "Test Rack" - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP readinessProbe: exec: command: - /bin/bash - -c - 'if [[ $(nodetool status | grep $POD_IP) == *"UN"* ]]; then exit 0; else exit 1; fi' initialDelaySeconds: 15 timeoutSeconds: 5 volumeMounts: (6) - mountPath: /var/lib/cassandra name: cassandra-dev-local-volume volumeClaimTemplates: (7) - metadata: name: cassandra-dev-local-volume spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: cassandra-local-storage
Again some remarks:
- We have three servers, so we set the replicas to 3. Adding more replicas will actually not result in more instances because of the anti-affinity rules (see 3.).
- Containers should use the host network to get a fixed identity and additional we have added some hostname aliases to help Cassandra finding other nodes.
- We apply anti-affinity to prevent multiple instances on one cluster node.
- An important part in the lifecycle of a Cassandra node is a graceful shutdown, which is controlled by defining a drain-command before the container stops.
- This is a basic configuration of environment variables. Please note, that we use our cluster-wide network identity for the seed node of Cassandra, which has been created with the headless service above. The StatefulSet forces pods to have fixed names with incrementing counters in their names like „cassandra-dev-0.cassandra-dev.infra.svc.cluster.local“. „infra“ is the name of the namespace where our StatefulSet is deployed. „svc.cluster.local“ is a constant name in each k8s cluster.
- We want to mount the Cassandra data directory to a volume which is referenced in (7).
- The PersistentVolumeClaim template is the final part where we tie the volume of the data directory to a PersistentVolume – that we created in advance. The match will be made via our custom StorageClass and the affinities and anti-affinities of the volume and the pods.
We have successfully set up a Cassandra cluster in k8s using StorageClasses, local PersistentVolumes and a StatefulSet using host network which enables us to fixate network identities and storage locations to server identities. Cassandra can be upgraded and restarted in a rolling manner preserving all stored data and its integrity. This is a valid approach for staging and development environment but lacks flexibility for production use and is therefore discouraged for productive systems and environment.
This article is greatly inspired by the original Google Guide for deploying Cassandra as StatefulSet and has been taken further to adapt it to our needs and server topology. We shed some light on the caveats of using Cassandra in k8s and dockerized environments in general and we hope you can take some inspiration for yourself on future projects using Cassandra on k8s.
content by Timm Kißels