KubernetesEdge ComputingCI/CD

Scaling Edge Inference: Deploying Tiny LLMs with k3s on Raspberry Pi Clusters

uuntied

2026-01-25

10 min read

Practical guide to scale tiny LLMs on Raspberry Pi 5 with k3s: orchestration, storage, device plugins, autoscaling, and CI-driven model updates.

Hook: Turn your Raspberry Pi 5 farm into a horizontally scalable inference cluster

Pain point: Development teams struggle to deploy and update generative AI models at the edge because devices are heterogeneous, updates are brittle, and orchestration tools are too heavyweight for tiny devices. In 2026, you can build a repeatable, production-grade edge inference platform by combining k3s, Raspberry Pi 5 nodes (with new NPUs like the AI HAT+ 2), and a GitOps CI workflow for model delivery.

Executive summary — most important first

This guide shows how to cluster Raspberry Pi 5 devices using k3s to horizontally scale small LLM inference workloads. You'll get concrete, actionable advice for: packing tiny LLMs into efficient containers, choosing reliable storage (local SSDs + replicated block storage), scheduling NPUs with device plugins, autoscaling using HPA + KEDA, and implementing a CI -> artifact store -> GitOps pipeline for safe model updates and canary rollouts. The patterns below are based on real-world constraints seen in late 2025 and early 2026: improved ARM-optimized quantized models, the emergence of affordable NPUs for Pi 5 (AI HAT+ 2), and wider adoption of GitOps on the edge.

Why this matters in 2026

Edge inference is moving from proof-of-concept to production: models optimized for ARM and new tiny-NPU boards (AI HAT+ 2 and successors) make on-device generative AI viable. Teams want predictable performance and low-latency inference while avoiding cloud vendor lock-in and network costs. k3s remains the best lightweight Kubernetes for constrained nodes, providing fast bootstrap, ARM container runtime support, and plug-in compatibility for device scheduling and storage.

Topline trends

Quantized tiny LLMs (2–4 bit) and GGML-style runtimes have matured for ARM CPUs and NPUs (late 2025 improvements). For hands-on Pi 5 LLMs and pocket inference nodes see Run Local LLMs on a Raspberry Pi 5.
Affordable NPUs on Pi 5 boards unlock local acceleration; device plugins and vendor drivers are now stable enough for production.
GitOps and model registries are standard practice for managing large artifacts at the edge.

"Edge-first deployments need orchestration that respects constrained hardware — k3s plus containerized tiny LLM runtimes give us that control."

Architecture overview

High-level components you will deploy:

k3s control plane — one or two control nodes (can be Pi-class or a small x86 host). Use external etcd for HA or k3s embedded datastore for small setups.
Worker pool — Raspberry Pi 5 nodes with attached SSDs and optional AI HAT+ 2 NPUs.
Model artifact store — S3-compatible (MinIO) or OCI registry with ORAS for model blobs; for edge storage patterns see Edge Storage for Small SaaS.
Ingress & LB — Traefik (built-in) or MetalLB for external IPs.
CI/GitOps — CI pipelines that produce quantized model artifacts and GitOps agent (Argo CD or Flux) to deploy model updates. If you want orchestration context beyond k3s, review lightweight orchestrators and automation tooling like FlowWeave.
Monitoring — Prometheus + Grafana, node-exporter, and custom metrics exporter for inference latency and request queue depth.

Hardware & storage recommendations

Raspberry Pi 5 is a significant step up in CPU and I/O; combine that with SSDs for model storage and you avoid microSD wear-out and I/O bottlenecks.

Node hardware

Use Raspberry Pi 5 with at least 8GB RAM for heavier quantized models; 4GB is ok for tiny tasks. See the Pi 5 LLM guide: Run Local LLMs on a Raspberry Pi 5.
Attach a USB3/PCIe NVMe SSD for the OS and model cache — prefer booting from SSD where possible. Edge storage guidance is available at Edge Storage for Small SaaS.
Consider AI HAT+ 2 (late 2025 hardware) if you need NPU acceleration; ensure vendor drivers are container-compatible.

Storage choices (pluggable based on expected workload)

Local SSD per node (recommended): Fast and simple. Use k3s local-path-provisioner for per-node PVs and ensure your inference pods prefer the node-local PV using nodeAffinity. Best for low-latency workloads where model replicas live across nodes.
Longhorn (replicated block storage): Works on ARM and gives replicated persistent volumes across Pi nodes. Suitable for small clusters (3–5 nodes). Avoid if nodes are highly resource-constrained.
MinIO (S3-compatible) for large models: Store canonical model artifacts outside the cluster; use initContainers to download models to local SSD on startup. This keeps the k8s datastore small and lets CI push large artifacts without rebuilding images. See edge storage patterns for caching and bandwidth strategies.
Rook/Ceph: Powerful but heavy. Only for larger edge clusters with nodes that can tolerate the overhead.

Containerizing tiny LLM runtimes for ARM

Best practice is to keep the runtime minimal and separate model artifact downloads from the container image. Use multi-arch builds (linux/arm64) and keep images stateless where possible.

Example: container layout

Base image: ubuntu:22.04 or debian slim with ARM64 binaries.
Runtime: llama.cpp or GGML-based binary compiled for ARM64+neon + NPU bindings if available (see Pi 5 LLM guide: Run Local LLMs on a Raspberry Pi 5).
Entrypoint: small server (FastAPI, actix, or tiny HTTP server) that loads the model from /models at start and exposes a gRPC/HTTP API.
InitContainer: downloads the quantized model from MinIO/OCI to /models before the main container starts.

# Deployment skeleton (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tiny-llm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tiny-llm
  template:
    metadata:
      labels:
        app: tiny-llm
    spec:
      initContainers:
      - name: fetch-model
        image: minio/mc
        command: ["/bin/sh","-c","mc cp s3/models/quantized-ggml.bin /models/quantized.bin"]
        volumeMounts:
        - name: models
          mountPath: /models
      containers:
      - name: server
        image: myregistry/tiny-llm:arm64
        resources:
          requests:
            cpu: "500m"
            memory: "1024Mi"
          limits:
            cpu: "1500m"
            memory: "2048Mi"
        volumeMounts:
        - name: models
          mountPath: /models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-pvc

Scheduling NPUs and device plugins

To use AI HAT+ 2 or similar NPUs, expose the device via a Kubernetes device plugin or use node labels and a small wrapper that interfaces with the vendor runtime. Treat the NPU as a scarce resource and schedule pods using resourceRequests and nodeSelector/affinity.

Pattern

Install the vendor device plugin (or a community plugin) as a DaemonSet so k3s exposes a resource like npu.ai/1.
Add resources.requests: npu.ai/1: 1 to pods that need hardware acceleration.
Label NPU nodes: kubectl label node pi5-01 npu=true and use nodeAffinity to prefer those nodes.

Autoscaling and load management

Edge workloads have bursty local traffic. Combine Kubernetes HPA (CPU/memory) with KEDA for event-driven scaling (MQTT, Kafka, Redis queues). HPA handles request-based scaling; KEDA can scale to zero for cost savings when idle.

Recommendations

Expose a lightweight inference gateway (Envoy or Traefik) to handle connection pooling and rate limiting.
Use HPA with custom metrics (e.g., inference latency SLO breach rate) — push metrics to Prometheus and use the Prometheus adapter. For advanced latency and observability patterns, see Intraday Edge: Latency & Observability.
For bursty event-driven loads (e.g., queued audio transcriptions), use KEDA to scale based on queue depth.

CI and model deployment workflow

Model artifacts are large — treat them like code but keep them out of container images. Use CI to build quantized models, store artifacts in S3-compatible storage or an OCI artifact repo, and deploy via GitOps with a canary strategy.

Pipeline stages (recommended)

Model training/fine-tuning (cloud or beefy machine): produce quantized artifact (ggml, ONNX, or .pt QLoRA output).
CI job: verify the artifact (unit tests, sample inferences), produce checksums, push to MinIO or ORAS-compliant registry.
Update a Git repository (manifest or Helm values) with the new model version and checksum.
GitOps agent (Argo CD/Flux) notices the change and applies a canary roll-out (Argo Rollouts recommended) to the k3s cluster. If you want to consider alternative orchestration and automation tooling for CI/CD look at FlowWeave.
Canary runs smoke tests (latency, hallucination rate) and promotes rollout if successful; otherwise rollback automatically.

Example GitHub Actions snippet (build & push model to MinIO)

name: Build and Push Quantized Model
on:
  workflow_dispatch:
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4
    - name: Download raw model
      run: ./scripts/download_base_model.sh
    - name: Quantize model
      run: ./scripts/quantize.sh --input base.bin --output quantized-ggml.bin
    - name: Upload to MinIO
      env:
        MC_HOST_s3: http://minio:9000
      run: |
        mc alias set s3 $MC_HOST_s3 $MINIO_ACCESS $MINIO_SECRET
        mc cp quantized-ggml.bin s3/models/quantized-${{ github.sha }}.bin
    - name: Create model manifest
      run: |
        echo "model: quantized-${{ github.sha }}.bin" > model-version.yaml
        git add model-version.yaml && git commit -m "model: ${GITHUB_SHA}" && git push

Deployment patterns and canary strategies

Canary deployments are essential: a bad model can degrade UX quickly. Use Argo Rollouts or a simple two-phase Deployment and health probes. When updating models via an initContainer, use a sidecar that performs validation on model load and exposes readiness only after validation passes.

Safety checklist for model updates

Checksum verification of model artifacts.
Automated smoke tests on a sample dataset.
Canary rollout with traffic shadowing before full promotion.
Automated rollback on SLO breach.

Monitoring, logging, and debugging

Observability is your safety net. Track system metrics and model-level metrics separately.

Essential metrics

Node CPU, memory, disk I/O, and temperature (Pi devices can throttle).
Inference latency p50/p95/p99 and throughput (requests/sec).
Model load times and model cache hit/miss rates.
Queue depth (if using async task queues) and TPU/NPU utilization.

Tools

Prometheus + Grafana for metrics and dashboards. For deeper latency and observability playbooks see Intraday Edge: Latency & Observability.
Loki for logs, with structured logs from your inference container.
Jaeger or OpenTelemetry for tracing requests through gateway -> inference service.

Common pitfalls and how to avoid them

MicroSD rootfs: Avoid running inference from microSD — use SSDs to prevent I/O bottlenecks and reliability issues. Edge storage recommendations are summarized at Edge Storage for Small SaaS.
Insufficient swap or oom kills: Set appropriate resource requests/limits and use eviction thresholds to keep control-plane healthy.
Model transfer times: Use local caching and incremental updates; don't push huge models inside container images. For strategies on artifact distribution and caching see Edge Storage for Small SaaS.
Overloaded NPUs: Rate-limit requests and use request queues to smooth bursts.
Network partition: Design for degraded mode — allow nodes to serve from local cache if central MinIO is unreachable. Patterns for offline/edge-first kiosk and field deployments are covered in on-device proctoring & offline-first kiosks and in offline app playbooks like offline-first field service apps.

Real-world case study (compact)

In late 2025 a retail analytics team deployed a 5-node Pi 5 cluster with k3s and AI HAT+ 2 boards to serve real-time product description generation in-store. Using MinIO for artifacts, Longhorn for PVs, and Argo CD for GitOps, they achieved:

Average inference latency of 120ms for a 3-bit quantized LLM on the NPU.
Zero downtime model updates with Argo Rollouts canaries and automated rollback.
Reduced cloud costs by 70% for inference traffic previously routed to cloud GPUs.

Actionable checklist — implement this in your first week

Set up a minimal k3s cluster (1 control + 3 Pi 5 workers). Boot from SSD on workers. For Pi 5 LLM-specific notes see Run Local LLMs on a Raspberry Pi 5.
Install device plugin for AI HAT+ 2 or add node labels for NPU nodes.
Deploy MinIO and set up a CI job that uploads a small quantized model.
Deploy a containerized llama.cpp inference server using an initContainer to pull the model from MinIO.
Add Prometheus + Grafana and create a dashboard for inference latency and NPU utilization. See observability playbooks at Intraday Edge: Latency & Observability.
Automate a canary rollout using Argo Rollouts and validate rollback behavior. Consider automation tool compatibility (e.g., FlowWeave) when designing CI workflows.

Future predictions (2026 and beyond)

Edge inference on small devices will continue to improve due to better quantization methods (3-bit dynamic ranges), standardized device plugin interfaces, and richer model registries supporting OCI artifacts. Expect model delta updates and ORAS-based model distribution to become mainstream, making incremental updates across a fleet fast and bandwidth-efficient.

Key takeaways

k3s + Pi 5 is a practical, production-capable stack for scaled edge inference of tiny LLMs in 2026.
Treat models as large artifacts: keep them in S3/OCI registries and download them at pod start rather than baking into images. See edge storage patterns.
Use device plugins and node labels to schedule NPUs, and prefer SSD-backed storage over microSD.
Combine HPA and KEDA for responsive scaling, and add GitOps + canary rollouts for safe model updates. Review orchestration and automation tooling like FlowWeave when designing pipelines.
Invest in observability: model-level metrics and automated rollback are non-negotiable for production readiness. For practical latency and observability techniques, see Intraday Edge.

Next steps (call to action)

Ready to prototype? Start with a 3-node Pi 5 k3s cluster and a single quantized model artifact — follow the checklist above. If you want a jumpstart, download our k3s-for-edge starter repo (includes manifests, Argo CD setup, and a CI workflow for quantized models) and open an issue with your hardware profile — we’ll help you tailor the deployment to your constraints.

Want the starter kit? Visit the project repo, clone the k3s-pi5-edge-llm template, and run the bootstrap script. You’ll have a working inference cluster with GitOps in under an hour.

untied

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.