DatabasesStoragePerformance

Storage Choices for Analytics: When to Use Local NVMe vs Cloud Block Storage for ClickHouse

UUnknown

2026-02-05

10 min read

Decide NVMe, cloud block, or PLC SSD for ClickHouse in 2026—balance latency, throughput, endurance, and cost with practical, actionable guidance.

Storage Choices for Analytics in 2026: stop guessing and optimize for ClickHouse

Hook: If your ClickHouse cluster stalls during background merges, your p95 query latency balloons, or your monthly storage bill keeps surprising finance — the storage layer is likely the culprit. Choosing between local NVMe, cloud block storage, or cheaper PLC SSDs isn't just a cost decision: it directly impacts latency, throughput, endurance, and your ability to scale.

Quick answer (read this first)

For ClickHouse-style OLAP in 2026: use local NVMe for hot, merge-heavy workloads and latency-sensitive reads; use cloud block storage when durability, multi-zone availability and flexible sizing matter; consider PLC SSDs for capacity-dominated, read-mostly cold tiers. Mix them with object storage (S3-compatible) for cold data and backups, and use ClickHouse storage policies to tier automatically.

Why this matters in 2026

Several trends changed the storage decision calculus in late 2024–2026:

ClickHouse adoption exploded as enterprises chase analytics performance; the company’s rapid funding and product growth (2024–2026) pushed many teams to operate increasingly large, multi-PB clusters.
Flash supply and pricing started shifting — SK Hynix and others advanced PLC (5-bit/6-bit per cell research and improved cell designs) that promise much lower $/GB. PLC SSDs are becoming viable for cold and capacity-focused layers by 2026.
Cloud providers now offer NVMe-oF and higher-performance block services; local NVMe instance storage remains the lowest latency option but is ephemeral on many clouds.
Managed ClickHouse offerings abstract parts of storage decisions — but for self-managed clusters, storage remains a dominant operational cost and risk.

Fundamentals: what ClickHouse workloads demand from storage

Before vendor RFPs, profile your workload. ClickHouse's I/O behavior for MergeTree families is distinctive:

Background merges are long-running sequential reads and writes; they benefit from high sustained throughput and consistent write performance.
Ad-hoc and analytical queries are often wide but read-heavy; they rely on throughput for full-scan performance and low tail latencies for interactive dashboards.
Small writes and inserts can be bursty (especially with micro-batches) and affect latency when competing with merges.
Compactions create large temporary IO spikes, so peak throughput matters almost as much as average.
Replication in ReplicatedMergeTree provides durability and availability but means extra network traffic during inserts and recovery.

Storage options and their tradeoffs

1) Local NVMe (ephemeral or attached)

What it is: NVMe SSDs physically attached to the host (instance local storage) or locally attached NVMe drives on bare metal.

Pros:

Lowest latency and highest IOps per dollar for hot data.
Excellent sustained throughput for merges and full scans.
Better tail latency consistency compared to networked block.
Lower CPU overhead for IO (NVMe driver efficiencies).

Cons:

Often ephemeral: VM host failure = data loss unless you replicate (ClickHouse ReplicatedMergeTree).
Scaling requires instance resizing or rebalancing; snapshotting and backups can be harder.
Higher cost for durable capacity per GB compared to some cloud block tiers.

2) Cloud block storage (EBS, GCP PD, Azure Managed Disks)

What it is: Network-attached virtual block devices provided by cloud vendors with durability and snapshot features.

Pros:

Durable and multi-zone options; generally survive instance failures.
Snapshotting and volume resizing without instance reprovisioning.
Predictable billing and delegated operational complexity.

Cons:

Higher latency vs local NVMe; tail latency variability from network layers.
Throughput caps per-volume and per-instance limits (you must size both).
IOPS or throughput often billed separately (cost tradeoffs).

3) PLC SSDs (Programmable / Penta-Level Cell, high-density NAND)

What it is: SSDs using more bits per memory cell to increase density and lower $/GB at the cost of endurance and often write performance.

Pros:

Substantially lower storage $/GB for capacity tiers — attractive for archival and warm tables.
When used behind good controllers and firmware, they can be effective for read-mostly analytics tiers.

Cons:

Lower program/erase (P/E) cycles → lower endurance for write-heavy workloads.
Write performance and latency under sustained heavy writes can degrade.
Require firmware-aware wear-leveling and careful overprovisioning.

4) Networked NVMe (NVMe over Fabrics, NVMe-oF)

What it is: Remote NVMe that exposes NVMe semantics over RDMA or TCP; promises near-local performance over the network.

Pros:

Better performance than generic block over TCP; lower latency and higher QD efficiency.
Enables shared NVMe pools without the durability/ephemeral tradeoffs of local NVMe.

Cons:

Requires network architecture investment and provider support.
Performance is sensitive to network tuning and can still be worse than local NVMe for tail latency.

Typical latency & throughput expectations (order-of-magnitude guidance)

These are typical ranges in production (2026). Measure in your environment — provider, instance type, and firmware matter.

Local NVMe: single-digit to low hundreds of microseconds for simple IO; excellent sustained throughput (several GB/s per drive in sequential).
Cloud block: mid 0.5–5 ms for many managed block services; throughput limited by per-volume and per-instance caps.
NVMe-oF: ~200–800 microseconds depending on RDMA/TCP and network — can approach local NVMe in tuned setups. (See also edge NVMe patterns.)
PLC SSDs: higher tail latency on sustained writes; read latencies similar to TLC/QLC when controller buffers are healthy.

How to choose: a practical decision matrix

Use this matrix as a baseline. Your SLA, data size, access patterns, and budget will push you towards one mix or another.

Hot (real-time dashboards, sub-100ms SLO): Local NVMe; use replication to tolerate host failures. Prefer fast instances with NVMe and large DRAM for caches.
Warm (hourly queries, tolerant of ms-level latency): High-performance cloud block volumes (io2/io2 BlockExpress, equivalent) or NVMe-oF if available. Snapshots and durability are valuable.
Cold (monthly/quarterly analytics, low concurrency): PLC SSD-backed volumes or dense cloud HDD/cold options; prefer object storage tiering for lowest $/GB.
Large slices of archival data: Move to S3-compatible object storage via ClickHouse storage policies and use cloud block or NVMe for index/active partitions.
Small clusters and dev environments: Block storage is easiest operationally; local NVMe is overkill unless testing production-like loads.

Operational patterns and best practices

Design storage tiers using ClickHouse storage policies

ClickHouse supports multiple disks and storage policies. Typical setup:

Disk A (local NVMe): fast, for recent partitions and indices.
Disk B (cloud block): for stable medium-age partitions and snapshots.
Disk C (S3): cold data, moved via move_partition or automated policies.

Example (simplified) ClickHouse disk/policy snippet:

<storage_configuration>
  <disks>
    <disk name="fast" type="local" path="/var/lib/clickhouse/fast/" />
    <disk name="bulk" type="local" path="/mnt/block/bulk/" />
    <disk name="s3cold" type="s3" endpoint="https://s3.example/" bucket="clickhouse-cold" />
  </disks>
  <policies>
    <policy name="tiered">
      <volumes>
        <volume><disk>fast</disk></volume>
        <volume><disk>bulk</disk></volume>
        <volume><disk>s3cold</disk></volume>
      </volumes>
    </policy>
  </policies>
</storage_configuration>

Separate concerns: data, logs, tmp, backups

Put ClickHouse data, system logs, and /tmp on separate disks to avoid IO interference. For example, keep /var/lib/clickhouse on fast NVMe, but write OS logs and container overlays elsewhere. Also, treat backups as first-class citizens in your DR plan when you design recovery runbooks.

Tune the OS and filesystem

Use XFS or ext4 with large inode/extent settings for big files; XFS often wins for large sequential workloads.
Mount with noatime,nodiratime to reduce metadata writes.
Tune vm.dirty_ratio and vm.dirty_background_ratio so background writes don't cause sudden IO storms during merges.
On NVMe, increase queue_depth and ensure multiqueue (blk-mq) is enabled.

Benchmark before you commit

Run both synthetic IO benchmarks and ClickHouse-style workload tests.

Example fio for sequential write throughput:

fio --name=seqwrite --filename=/mnt/fast/testfile --rw=write --bs=1M --size=20G --numjobs=4 --iodepth=32 --direct=1

Example ClickHouse micro-benchmark: generate data and run realistic queries with clickhouse-benchmark or CH's query log playback. Measure p50/p95/p99 and background merge throughput while queries run.

PLC SSDs: when they make sense (and when they don't)

By 2026, PLC SSDs are attractive for the capacity layer because they reduce storage $/GB — but they change operational rules:

Use PLC for cold and read-dominant partitions where background merge write pressure is low.
Avoid using PLC for high-turnover partitions or for nodes doing heavy compactions.
Overprovision and monitor SMART/wear metrics closely. Schedule migrations before SSD endurance limits are hit.
Leverage ClickHouse TTLs and storage policies to migrate data to PLC-backed volumes automatically.

Note: PLC endurance varies dramatically by vendor and firmware designs. Validate real-world P/E cycles and controller-level write amplification numbers before wide use.

Resilience and recovery trade-offs

If you pick local NVMe for performance, accept the need for replication. ClickHouse ReplicatedMergeTree and ZooKeeper/ClickHouse Keeper are your friends: they allow you to rely on fast local storage while ensuring durability across host failures. For a broader operational view on service resilience, see SRE beyond uptime.

If you pick cloud block storage for durability, be aware of performance variability and implement multi-volume strategies to avoid per-volume limits. Also plan cross-AZ or cross-region disaster recovery using snapshots and S3 backups.

Cost modeling — practical approach

Don’t rely solely on $/GB. Build a simple model with these axes:

Capacity ($/GB/month)
IOPS & throughput requirements (cost to reach required IOPS/throughput)
Operational costs (rebalancing, snapshots, backups)
Failure & recovery costs (downtime, restore time objectives)

Create realistic scenarios: e.g., peak merge throughput of 8 GB/s for cluster, average scan throughput 2 GB/s, replication factor 3. For each storage option, calculate the number of nodes, drive types, and expected monthly bill. Run sensitivity for 20% higher merges or 2× more data growth.

Benchmark-driven migration playbook (step-by-step)

Profile: use iostat, sar, and ClickHouse query_log/metric_exporter to capture real IO patterns for 2–4 weeks.
Baseline: run fio tests and ClickHouse synthetic workloads against candidate storage types under realistic concurrency.
Plan: decide tiering — hot on NVMe, warm on block, cold on S3/PLC.
Pilot: migrate a non-critical dataset and measure merge rates, query latencies, and failure recovery times.
Automate: implement ClickHouse storage policies, automated TTL moves, and monitoring/alerts for wear and latency.
Iterate: revisit sizing quarterly as PLC pricing and NVMe-oF adoption change economics.

Short operational checklist

Use local NVMe for hot partitions; replicate across hosts.
Keep medium-term partitions on resilient cloud block with snapshots.
Tier cold data to S3 or PLC-backed volumes with clear TTLs.
Benchmark using both fio and ClickHouse traffic patterns.
Monitor SSD wear, IO latency, and merge throughput continuously.

Real-world example (anonymized)

A fintech analytics team (multi-tenant dashboards, peak concurrency 600 queries/s) moved from gp3 block storage to a mixed architecture in 2025–2026: hot partitions to local NVMe on fast instances, medium partitions to io2 block volumes (for snapshots and AZ durability), and archival data to S3. As a result they saw:

40% lower p95 read latency on active dashboards.
~18% reduction in monthly storage+IO cost with auto-migration to S3 and denser PLC-based archival nodes for cold data.
Faster recovery from node failures due to smaller local NVMe replicas and better merge parallelism.

They accomplished this without sacrificing durability by using ReplicatedMergeTree and automated storage policies.

2026 & beyond: what to watch

Greater PLC adoption: expect PLC to push down $/GB for cold tiers; watch firmware and endurance metrics closely.
Managed NVMe pools: cloud providers will widen NVMe-oF and regional NVMe services; these may blur the lines between local and remote NVMe.
ClickHouse features: expect deeper native tiering and smarter policies in managed ClickHouse offerings that automate much of this work.
Cost & carbon optimizations: density-efficient PLC drives can reduce data center footprint — an interesting angle for green engineering teams.

Actionable takeaways

Start by profiling: know your merge and read throughput before buying drives.
Use local NVMe for hot, high-throughput and low-latency needs; mitigate volatility with ClickHouse replication.
Choose cloud block when you need snapshots, resizing, or zone durability without replications complexity.
Use PLC SSDs for cold, read-mostly archives; automate moves with ClickHouse storage policies.
Benchmark continuously — vendor firmware and cloud network layers evolve fast, especially in 2026.

Final checklist before you commit

Have you measured merges under realistic data volumes?
Do you know per-volume and per-instance IOPS/throughput limits on your cloud provider?
Is your ClickHouse replication and keeper/zookeeper setup ready for local NVMe failures?
Have you validated PLC endurance numbers with vendor SLAs or pilot nodes?

Call to action

If you want help turning this into a concrete migration plan for your ClickHouse cluster — including custom benchmarking scripts, storage policy templates, and a cost model — reach out to our team at untied.dev. We'll audit your workload, run a targeted pilot, and deliver a capacity/IO plan tied to your SLAs and budget. Let’s stop guessing and make storage a competitive advantage.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.