Storage Choices for Analytics: When to Use Local NVMe vs Cloud Block Storage for ClickHouse
DatabasesStoragePerformance

Storage Choices for Analytics: When to Use Local NVMe vs Cloud Block Storage for ClickHouse

UUnknown
2026-02-05
10 min read
Advertisement

Decide NVMe, cloud block, or PLC SSD for ClickHouse in 2026—balance latency, throughput, endurance, and cost with practical, actionable guidance.

Storage Choices for Analytics in 2026: stop guessing and optimize for ClickHouse

Hook: If your ClickHouse cluster stalls during background merges, your p95 query latency balloons, or your monthly storage bill keeps surprising finance — the storage layer is likely the culprit. Choosing between local NVMe, cloud block storage, or cheaper PLC SSDs isn't just a cost decision: it directly impacts latency, throughput, endurance, and your ability to scale.

Quick answer (read this first)

For ClickHouse-style OLAP in 2026: use local NVMe for hot, merge-heavy workloads and latency-sensitive reads; use cloud block storage when durability, multi-zone availability and flexible sizing matter; consider PLC SSDs for capacity-dominated, read-mostly cold tiers. Mix them with object storage (S3-compatible) for cold data and backups, and use ClickHouse storage policies to tier automatically.

Why this matters in 2026

Several trends changed the storage decision calculus in late 2024–2026:

  • ClickHouse adoption exploded as enterprises chase analytics performance; the company’s rapid funding and product growth (2024–2026) pushed many teams to operate increasingly large, multi-PB clusters.
  • Flash supply and pricing started shifting — SK Hynix and others advanced PLC (5-bit/6-bit per cell research and improved cell designs) that promise much lower $/GB. PLC SSDs are becoming viable for cold and capacity-focused layers by 2026.
  • Cloud providers now offer NVMe-oF and higher-performance block services; local NVMe instance storage remains the lowest latency option but is ephemeral on many clouds.
  • Managed ClickHouse offerings abstract parts of storage decisions — but for self-managed clusters, storage remains a dominant operational cost and risk.

Fundamentals: what ClickHouse workloads demand from storage

Before vendor RFPs, profile your workload. ClickHouse's I/O behavior for MergeTree families is distinctive:

  • Background merges are long-running sequential reads and writes; they benefit from high sustained throughput and consistent write performance.
  • Ad-hoc and analytical queries are often wide but read-heavy; they rely on throughput for full-scan performance and low tail latencies for interactive dashboards.
  • Small writes and inserts can be bursty (especially with micro-batches) and affect latency when competing with merges.
  • Compactions create large temporary IO spikes, so peak throughput matters almost as much as average.
  • Replication in ReplicatedMergeTree provides durability and availability but means extra network traffic during inserts and recovery.

Storage options and their tradeoffs

1) Local NVMe (ephemeral or attached)

What it is: NVMe SSDs physically attached to the host (instance local storage) or locally attached NVMe drives on bare metal.

Pros:

  • Lowest latency and highest IOps per dollar for hot data.
  • Excellent sustained throughput for merges and full scans.
  • Better tail latency consistency compared to networked block.
  • Lower CPU overhead for IO (NVMe driver efficiencies).

Cons:

  • Often ephemeral: VM host failure = data loss unless you replicate (ClickHouse ReplicatedMergeTree).
  • Scaling requires instance resizing or rebalancing; snapshotting and backups can be harder.
  • Higher cost for durable capacity per GB compared to some cloud block tiers.

2) Cloud block storage (EBS, GCP PD, Azure Managed Disks)

What it is: Network-attached virtual block devices provided by cloud vendors with durability and snapshot features.

Pros:

  • Durable and multi-zone options; generally survive instance failures.
  • Snapshotting and volume resizing without instance reprovisioning.
  • Predictable billing and delegated operational complexity.

Cons:

  • Higher latency vs local NVMe; tail latency variability from network layers.
  • Throughput caps per-volume and per-instance limits (you must size both).
  • IOPS or throughput often billed separately (cost tradeoffs).

3) PLC SSDs (Programmable / Penta-Level Cell, high-density NAND)

What it is: SSDs using more bits per memory cell to increase density and lower $/GB at the cost of endurance and often write performance.

Pros:

  • Substantially lower storage $/GB for capacity tiers — attractive for archival and warm tables.
  • When used behind good controllers and firmware, they can be effective for read-mostly analytics tiers.

Cons:

  • Lower program/erase (P/E) cycles → lower endurance for write-heavy workloads.
  • Write performance and latency under sustained heavy writes can degrade.
  • Require firmware-aware wear-leveling and careful overprovisioning.

4) Networked NVMe (NVMe over Fabrics, NVMe-oF)

What it is: Remote NVMe that exposes NVMe semantics over RDMA or TCP; promises near-local performance over the network.

Pros:

  • Better performance than generic block over TCP; lower latency and higher QD efficiency.
  • Enables shared NVMe pools without the durability/ephemeral tradeoffs of local NVMe.

Cons:

  • Requires network architecture investment and provider support.
  • Performance is sensitive to network tuning and can still be worse than local NVMe for tail latency.

Typical latency & throughput expectations (order-of-magnitude guidance)

These are typical ranges in production (2026). Measure in your environment — provider, instance type, and firmware matter.

  • Local NVMe: single-digit to low hundreds of microseconds for simple IO; excellent sustained throughput (several GB/s per drive in sequential).
  • Cloud block: mid 0.5–5 ms for many managed block services; throughput limited by per-volume and per-instance caps.
  • NVMe-oF: ~200–800 microseconds depending on RDMA/TCP and network — can approach local NVMe in tuned setups. (See also edge NVMe patterns.)
  • PLC SSDs: higher tail latency on sustained writes; read latencies similar to TLC/QLC when controller buffers are healthy.

How to choose: a practical decision matrix

Use this matrix as a baseline. Your SLA, data size, access patterns, and budget will push you towards one mix or another.

  • Hot (real-time dashboards, sub-100ms SLO): Local NVMe; use replication to tolerate host failures. Prefer fast instances with NVMe and large DRAM for caches.
  • Warm (hourly queries, tolerant of ms-level latency): High-performance cloud block volumes (io2/io2 BlockExpress, equivalent) or NVMe-oF if available. Snapshots and durability are valuable.
  • Cold (monthly/quarterly analytics, low concurrency): PLC SSD-backed volumes or dense cloud HDD/cold options; prefer object storage tiering for lowest $/GB.
  • Large slices of archival data: Move to S3-compatible object storage via ClickHouse storage policies and use cloud block or NVMe for index/active partitions.
  • Small clusters and dev environments: Block storage is easiest operationally; local NVMe is overkill unless testing production-like loads.

Operational patterns and best practices

Design storage tiers using ClickHouse storage policies

ClickHouse supports multiple disks and storage policies. Typical setup:

  • Disk A (local NVMe): fast, for recent partitions and indices.
  • Disk B (cloud block): for stable medium-age partitions and snapshots.
  • Disk C (S3): cold data, moved via move_partition or automated policies.

Example (simplified) ClickHouse disk/policy snippet:

<storage_configuration>
  <disks>
    <disk name="fast" type="local" path="/var/lib/clickhouse/fast/" />
    <disk name="bulk" type="local" path="/mnt/block/bulk/" />
    <disk name="s3cold" type="s3" endpoint="https://s3.example/" bucket="clickhouse-cold" />
  </disks>
  <policies>
    <policy name="tiered">
      <volumes>
        <volume><disk>fast</disk></volume>
        <volume><disk>bulk</disk></volume>
        <volume><disk>s3cold</disk></volume>
      </volumes>
    </policy>
  </policies>
</storage_configuration>

Separate concerns: data, logs, tmp, backups

Put ClickHouse data, system logs, and /tmp on separate disks to avoid IO interference. For example, keep /var/lib/clickhouse on fast NVMe, but write OS logs and container overlays elsewhere. Also, treat backups as first-class citizens in your DR plan when you design recovery runbooks.

Tune the OS and filesystem

  • Use XFS or ext4 with large inode/extent settings for big files; XFS often wins for large sequential workloads.
  • Mount with noatime,nodiratime to reduce metadata writes.
  • Tune vm.dirty_ratio and vm.dirty_background_ratio so background writes don't cause sudden IO storms during merges.
  • On NVMe, increase queue_depth and ensure multiqueue (blk-mq) is enabled.

Benchmark before you commit

Run both synthetic IO benchmarks and ClickHouse-style workload tests.

Example fio for sequential write throughput:

fio --name=seqwrite --filename=/mnt/fast/testfile --rw=write --bs=1M --size=20G --numjobs=4 --iodepth=32 --direct=1

Example ClickHouse micro-benchmark: generate data and run realistic queries with clickhouse-benchmark or CH's query log playback. Measure p50/p95/p99 and background merge throughput while queries run.

PLC SSDs: when they make sense (and when they don't)

By 2026, PLC SSDs are attractive for the capacity layer because they reduce storage $/GB — but they change operational rules:

  • Use PLC for cold and read-dominant partitions where background merge write pressure is low.
  • Avoid using PLC for high-turnover partitions or for nodes doing heavy compactions.
  • Overprovision and monitor SMART/wear metrics closely. Schedule migrations before SSD endurance limits are hit.
  • Leverage ClickHouse TTLs and storage policies to migrate data to PLC-backed volumes automatically.
Note: PLC endurance varies dramatically by vendor and firmware designs. Validate real-world P/E cycles and controller-level write amplification numbers before wide use.

Resilience and recovery trade-offs

If you pick local NVMe for performance, accept the need for replication. ClickHouse ReplicatedMergeTree and ZooKeeper/ClickHouse Keeper are your friends: they allow you to rely on fast local storage while ensuring durability across host failures. For a broader operational view on service resilience, see SRE beyond uptime.

If you pick cloud block storage for durability, be aware of performance variability and implement multi-volume strategies to avoid per-volume limits. Also plan cross-AZ or cross-region disaster recovery using snapshots and S3 backups.

Cost modeling — practical approach

Don’t rely solely on $/GB. Build a simple model with these axes:

  • Capacity ($/GB/month)
  • IOPS & throughput requirements (cost to reach required IOPS/throughput)
  • Operational costs (rebalancing, snapshots, backups)
  • Failure & recovery costs (downtime, restore time objectives)

Create realistic scenarios: e.g., peak merge throughput of 8 GB/s for cluster, average scan throughput 2 GB/s, replication factor 3. For each storage option, calculate the number of nodes, drive types, and expected monthly bill. Run sensitivity for 20% higher merges or 2× more data growth.

Benchmark-driven migration playbook (step-by-step)

  1. Profile: use iostat, sar, and ClickHouse query_log/metric_exporter to capture real IO patterns for 2–4 weeks.
  2. Baseline: run fio tests and ClickHouse synthetic workloads against candidate storage types under realistic concurrency.
  3. Plan: decide tiering — hot on NVMe, warm on block, cold on S3/PLC.
  4. Pilot: migrate a non-critical dataset and measure merge rates, query latencies, and failure recovery times.
  5. Automate: implement ClickHouse storage policies, automated TTL moves, and monitoring/alerts for wear and latency.
  6. Iterate: revisit sizing quarterly as PLC pricing and NVMe-oF adoption change economics.

Short operational checklist

  • Use local NVMe for hot partitions; replicate across hosts.
  • Keep medium-term partitions on resilient cloud block with snapshots.
  • Tier cold data to S3 or PLC-backed volumes with clear TTLs.
  • Benchmark using both fio and ClickHouse traffic patterns.
  • Monitor SSD wear, IO latency, and merge throughput continuously.

Real-world example (anonymized)

A fintech analytics team (multi-tenant dashboards, peak concurrency 600 queries/s) moved from gp3 block storage to a mixed architecture in 2025–2026: hot partitions to local NVMe on fast instances, medium partitions to io2 block volumes (for snapshots and AZ durability), and archival data to S3. As a result they saw:

  • 40% lower p95 read latency on active dashboards.
  • ~18% reduction in monthly storage+IO cost with auto-migration to S3 and denser PLC-based archival nodes for cold data.
  • Faster recovery from node failures due to smaller local NVMe replicas and better merge parallelism.

They accomplished this without sacrificing durability by using ReplicatedMergeTree and automated storage policies.

2026 & beyond: what to watch

  • Greater PLC adoption: expect PLC to push down $/GB for cold tiers; watch firmware and endurance metrics closely.
  • Managed NVMe pools: cloud providers will widen NVMe-oF and regional NVMe services; these may blur the lines between local and remote NVMe.
  • ClickHouse features: expect deeper native tiering and smarter policies in managed ClickHouse offerings that automate much of this work.
  • Cost & carbon optimizations: density-efficient PLC drives can reduce data center footprint — an interesting angle for green engineering teams.

Actionable takeaways

  • Start by profiling: know your merge and read throughput before buying drives.
  • Use local NVMe for hot, high-throughput and low-latency needs; mitigate volatility with ClickHouse replication.
  • Choose cloud block when you need snapshots, resizing, or zone durability without replications complexity.
  • Use PLC SSDs for cold, read-mostly archives; automate moves with ClickHouse storage policies.
  • Benchmark continuously — vendor firmware and cloud network layers evolve fast, especially in 2026.

Final checklist before you commit

  • Have you measured merges under realistic data volumes?
  • Do you know per-volume and per-instance IOPS/throughput limits on your cloud provider?
  • Is your ClickHouse replication and keeper/zookeeper setup ready for local NVMe failures?
  • Have you validated PLC endurance numbers with vendor SLAs or pilot nodes?

Call to action

If you want help turning this into a concrete migration plan for your ClickHouse cluster — including custom benchmarking scripts, storage policy templates, and a cost model — reach out to our team at untied.dev. We'll audit your workload, run a targeted pilot, and deliver a capacity/IO plan tied to your SLAs and budget. Let’s stop guessing and make storage a competitive advantage.

Advertisement

Related Topics

#Databases#Storage#Performance
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-22T03:15:40.681Z