DatabasesKubernetesStarter Templates

ClickHouse at Scale: Deploying OLAP on Kubernetes (Helm Chart + Best Practices)

UUnknown

2026-02-06

11 min read

Hands-on guide to deploying ClickHouse on Kubernetes—Helm charts, storageClass choices, scaling strategies, and backup/restore patterns for 2026.

Hook — Why ClickHouse on Kubernetes still hurts (and how to fix it)

You need sub-second OLAP queries, independent teams rolling analytics features, and a reliable pipeline that recovers fast when nodes or disks fail. But running ClickHouse on Kubernetes can be deceptively brittle: wrong storageClass choices, naive scaling, and poor backup patterns create long restore times, noisy neighbours, and unpredictable performance.

In 2026 the pressure is louder: ClickHouse adoption exploded after major funding and product momentum in 2025, and cloud providers and vendors shipped new integrations for object storage and container-native orchestration. If you're responsible for analytics at scale, this guide gives an opinionated, hands-on deployment path—Helm charts, storageClass recommendations, safe scaling patterns, and backup/restore patterns built for production.

Executive summary — What you'll get

Concrete Helm + Kubernetes patterns for ClickHouse clusters using operators and StatefulSets.
StorageClass guidance (local SSD, NVMe, networked SSD, object storage) and tradeoffs for MergeTree workloads.
Scaling strategies that respect ClickHouse statefulness: node autoscaling, replica management, and stateless query front ends.
Backup & restore patterns combining object storage, CSI snapshots, and point-in-time strategies.
Observability & operations checklists—Prometheus, SLOs, and common runtime tunables.

Context: What's changed by 2026?

Two important facts shape the recommended approach in 2026:

ClickHouse continues rapid enterprise adoption (notably after late‑2025 funding and investment), and the project itself has improved cluster coordination—ClickHouse Keeper has matured and reduced dependency on external ZooKeeper clusters.
Kubernetes ecosystem features matured: CSI snapshots and volume cloning are broadly available across cloud providers, and Kubernetes autoscaler integrations (node autoscaler, KEDA for stateless workloads) are battle tested. Object-storage backed MergeTree patterns are more common.

“In late 2025 ClickHouse saw significant investment and broader enterprise adoption—expect teams to run larger, multi-shard clusters on Kubernetes in 2026.”

Design principles (short)

Separate state from stateless: put the query/proxy layer in Deployments you can autoscale and keep storage-heavy shards in StatefulSets managed by an operator.
Use the operator for lifecycle: the ClickHouse operator handles DDL propagation, replica recovery, and rolling updates better than handwritten StatefulSets.
Plan for fast rebuilds: choose storage and network choices that reduce replica rebuild time (fast IOPS, high bandwidth).
Automate backups to object storage: volume snapshots are useful but S3-compatible backups are essential for cross-cluster restore and long-term retention.

Which Helm chart / operator should you use?

Options in 2026:

Altinity / clickhouse-operator (Helm chart maintained by Altinity and community forks): production-focused operator with support for ClickHouseInstallation CRDs, sharding, replicas, and easy upgrades. See practical DevOps playbooks for running operators and related infra in production: Building and Hosting Micro‑Apps: A Pragmatic DevOps Playbook.
Percona / community charts: some teams use smaller charts for single-node clusters or dev environments.
Hand-rolled StatefulSets: only for tiny or experimental deployments—avoid in production.

Recommendation: for clusters beyond a handful of nodes, use the ClickHouse Operator via its Helm chart. It integrates with Kubernetes primitives (PVC templates, PodDisruptionBudget) and supports complex cluster topologies.

Quick Helm install (operator)

Example steps (replace chart repo with your provider's stable source):

helm repo add altinity https://altinity.github.io/ch-operator-helm-chart
helm repo update
helm install clickhouse-operator altinity/clickhouse-operator -n clickhouse --create-namespace \
  --set image.tag=latest

After installation, deploy a ClickHouseInstallation CR that declares shards and replicas. The operator transforms that into StatefulSets and Services.

StorageClass choices — latency, rebuild time, and cost tradeoffs

ClickHouse MergeTree workloads are I/O bound: heavy writes during merges and reads during queries. Storage choice impacts query latency, replica rebuild time, and cost. Choose per workload.

1) Local NVMe / local-ssd (best performance)

Pros: excellent IOPS and latency, fastest replica rebuilds.
Cons: no live volume migration; pod scheduling must consider node affinity and anti-affinity. Node failure may force full replica rebuild elsewhere.
Use when: sub-10ms OLAP queries required, dataset fits across nodes, and you can tolerate node-level rebuilds.

2) Networked SSD (EBS gp3, GCP PD-SSD, Azure managed disks)

Pros: reliable, snapshot support, predictable performance.
Cons: network latency higher than local NVMe; rebuilds slightly slower.
Use when: you need snapshots and easier migration across nodes.

3) Distributed object storage for cold data (S3/MinIO)

Pros: cheap, durable, great for long-term retention and backups; ClickHouse supports S3 tables and external storage integration.
Cons: high latency, not suitable for hot MergeTree parts.
Use when: you tier cold partitions to object storage using TTL and partitioning — this ties into broader data fabric and tiering patterns.

4) Distributed block stores with CSI (Ceph/Rook, Longhorn)

Pros: flexible, supports snapshots and clones, node-failure resilience.
Cons: operational complexity; performance depends on cluster health.
Use when: you need Kubernetes-native storage with snapshotting and can operate the storage layer.

Storage strategy (recommended):

Hot tiers on local NVMe for MergeTree active parts or fast networked SSDs.
Warm tiers on networked SSD or distributed block store.
Cold tiers and backups on object storage (S3).

Values.yaml snippet — storageClass and PVC templates

# values.yaml (excerpt)
clickhouseInstallation:
  # data path uses a specific storageClass
  storage:
    data:
      volumeClaimTemplate:
        spec:
          storageClassName: nvme-local-sc
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Ti
  # separate metadata (keeper) uses durable but smaller disk
  keeper:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd-sc
        resources:
          requests:
            storage: 100Gi

Scaling ClickHouse safely

ClickHouse is stateful. Scaling strategies fall into two categories:

Scale storage nodes (shards/replicas) — changes to shards/replicas are heavy operations: you must rebalance or allow replicas to rebuild parts. Use the operator's declarative CR to add replicas and monitor rebuilding.
Scale query front ends — add stateless query proxy or shard-router pods (Deployments) behind a Service and autoscale them with HPA or KEDA. This absorbs query spikes without touching data nodes.

Recommended autoscale architecture

Keep data nodes in a fixed-size StatefulSet per shard and manage replica count changes during maintenance windows.
Run stateless query front-ends (eg. clickhouse-server configured as query router or a lightweight proxy) in a Deployment. Autoscale these with HPA based on CPU, memory, or custom Prometheus queries (qps).
Use Kubernetes Cluster Autoscaler to add nodes when PVC/Pod scheduling fails due to resource pressure.
Use PodDisruptionBudgets and strict anti-affinity to avoid correlated failures and avoid autoscaler evictions causing rebuild storms.

When to add a shard vs replica

If read latency is high and nodes are overloaded: add replicas (improves read throughput).
If write throughput or storage capacity is exhausted: add shards (requires resharding/partitioning of data and careful migration).

Backup & restore patterns (practical)

Don't treat backups as an afterthought. Use a two-pronged approach:

Regular object-storage backups using clickhouse-backup or built-in backup tools to S3 (or S3-compatible like MinIO). Keep daily fulls with incremental diffs and lifecycle rules.
Fast recoverability via CSI snapshots for short RPOs: schedule VolumeSnapshot backups for critical PVCs to speed restores of entire pods.

Using clickhouse-backup (practical steps)

clickhouse-backup (community tool) remains the de-facto utility to export table parts to S3. Run it as a CronJob for daily and hourly schedules:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ch-backup-daily
  namespace: clickhouse
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: ghcr.io/altinity/clickhouse-backup:latest
            env:
            - name: S3_ENDPOINT
              valueFrom:
                secretKeyRef:
                  name: s3-creds
                  key: endpoint
            # mount clickhouse config & certs if needed
          restartPolicy: OnFailure

Key checkpoints:

Use bucket lifecycle rules to expire old backups, and cross-region replication for DR.
Keep both object backups and a catalog of parts so you can restore consistent sets of tables.
Encrypt credentials with Kubernetes Secrets and limit access via RBAC.

Restores and cluster bootstrap

Restore into an operator-managed new cluster by provisioning an empty ClickHouseInstallation and then restore S3 parts using clickhouse-backup restore commands.
For partial restores, use CREATE TABLE ... AS SELECT from backed-up S3 files or external table engines.
Use replicas to re-replicate parts when possible instead of full restores; operator can coordinate this.

Observability & SLOs

A production ClickHouse cluster must be observable at multiple levels:

Metrics: deploy clickhouse-exporter / built-in metrics and scrape with Prometheus. Key metrics: query_duration_ms, merges_in_progress, parts_count, insert_rows, select_rows, memory_usage.
Logs: centralize clickhouse-server logs to a log store (Loki, ELK) for query failures and merge errors.
Tracing & APM: instrument frontend services and proxies. OpenTelemetry adoption has increased in 2025–26; propagate spans from client apps through query proxies. Observability-as-code and lightweight, cache-first dashboards are becoming common — see approaches for front-end tooling and observability automation (Edge-Powered, Cache-First PWAs).

Example Prometheus rules & alerts

Alert on sustained high merges_in_progress or long average query duration.
Alert on replica rebuilds longer than baseline (indicates disk/network pressure).
Alert on low free disk (use percentage to avoid sudden evictions).

Operational checklist before go-live

Enable PodDisruptionBudgets for each shard and replica set.
Set anti-affinity rules to distribute replicas across AZs/nodes.
Provision node pools with matching taints/labels for storage types.
Run a capacity test that simulates merges and rebuilds. Measure rebuild time on disk failure.
Test full restore from object storage into a new cluster at least quarterly.
Limit per-query resources (max_memory_usage, max_threads) to avoid noisy neighbour effects.

Common pitfalls and how to avoid them

Using slow storage for active MergeTree parts: causes high query latency. Move active parts to faster storage or partition/tier data aggressively.
Scaling replicas without monitoring rebuild rate: you can saturate the cluster with rebuild traffic. Stagger replica additions and monitor network IO.
No backup verification: backups that never get restored are useless. Automate restore tests in a sandbox regularly — tie this into broader pipeline automation and composable capture/restore flows (Composable Capture Pipelines for Micro‑Events).
Mixing stateful and stateless roles in the same pods: keep query frontends separate so you can autoscale them freely.

Advanced topics & 2026 trends

Emerging best practices in 2026:

Hybrid object/buffered MergeTree: use tiered storage with local NVMe for active parts and offload older partitions to S3-like stores—this reduces cost while maintaining fast hot queries. These patterns are part of the broader discussion around data fabric and tiered storage.
Operator-driven resharding: tools and patterns for online resharding improved in 2025. Expect operator-managed resharding and rebalancing flows to be standard in 2026.
Cloud-managed ClickHouse options: many customers use cloud providers' ClickHouse managed services for simplified ops; however, Kubernetes gives full control over storage and cost optimization.
Observability as code: pre-built Grafana dashboards and Prometheus recording rules are now common in Helm charts—use them as a baseline and extend with your SLOs. For front-end and visualization patterns consider on-device and edge visualization approaches (On‑Device Data Visualization).

Example: Minimal production checklist and commands

Install operator via Helm (see earlier Helm snippet).
Define ClickHouseInstallation with 2 shards x 3 replicas, local NVMe storage for data, and keeper replicas of 3.
Deploy stateless query routers as a Deployment and set HPA target using Prometheus adapter: scale with queries/sec metric.
Install clickhouse-backup and schedule Daily/Hourly jobs to S3. Enable lifecycle rules and cross-region replication for DR.
Configure Prometheus + Grafana dashboards, and add alerts for disk usage, merges_in_progress, and replica rebuilds.

Case study (short)

A fintech analytics team migrated a 20TB ClickHouse cluster to Kubernetes in late 2025. Their pattern matched the guidance above: hot partitions on local NVMe, warm tiers on network SSD, daily S3 backups with lifecycle rules, and stateless query proxies autoscaled via KEDA. Result: 3x faster query tail-latency and recovery time from node failures cut from 6 hours to 45 minutes thanks to faster rebuilds and automated restores. For adjacent use cases like storing experimental research and large scientific datasets, see work on when to use ClickHouse-like OLAP (Storing Quantum Experiment Data).

Actionable takeaways

Use the ClickHouse operator via Helm for lifecycle, sharding, and replica management—avoid hand-rolling StatefulSets for production. See a pragmatic DevOps playbook: Building and Hosting Micro‑Apps.
Choose storage tiers: local NVMe for hot parts, network SSD for general purpose, object storage for cold/backups.
Separate stateless query frontends so you can autoscale queries without touching data nodes.
Backup to S3 + snapshot — combine clickhouse-backup with CSI snapshots for fast, reliable restores. Consider how this fits into larger pipeline automation and composable backup/restore patterns (composable capture pipelines).
Instrument everything: Prometheus metrics, Grafana dashboards, and restore drills are non-negotiable.

Final notes and links

Running ClickHouse at scale on Kubernetes in 2026 is a mature, well-understood pattern—when you respect the stateful nature of MergeTree, pick the right storage for each tier, and separate concerns between storage and query layers. The operator + Helm pattern removes much of the manual orchestration pain, and modern CSI/snapshot tooling closes the gap for recoverability and migrations. Be mindful of tool sprawl when you adopt multiple backup, snapshot, and observability tools — rationalize early.

Call to action

If you’re evaluating ClickHouse on Kubernetes for production, start with a small pilot following this guide: install the operator via Helm, deploy a 3-replica test cluster with local NVMe or fast-SSD storageClass, and run a restore drill from S3. Want a ready-to-run values.yaml and observability dashboards tuned for analytics workloads? Get our downloadable Helm values templates, dashboard pack, and a checklist for a 2-hour pilot—request it now and we’ll walk you through a live runbook.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.