ObservabilityIncident ResponseSRE

Debugging Real-World Outages: What Telemetry to Collect to Recreate an X/Cloudflare Incident

UUnknown

2026-02-12

10 min read

Practical telemetry checklist—edge logs, DNS traces, TCP/UDP metrics—and storage/query patterns to reconstruct CDN/DNS outages fast.

Hook: why you need the right telemetry to reconstruct a CDN/DNS outage

Outages at the edge—CDN or DNS failures—are brutal. They stop traffic at scale, scramble teams, and stretch postmortems for days. If your observability is fragmented, you’ll spend the first 12–48 hours hunting for raw data instead of fixing the root cause. The good news: with a small set of targeted telemetry, consistent timestamping, and an ingestion/archival plan, you can reconstruct incidents like the Jan 2026 X/Cloudflare disruption in hours, not days.

Executive summary (most important first)

Collect these four classes of telemetry: edge logs, DNS traces, network/TCP metrics (including packet captures on demand), and control-plane / routing telemetry (BGP, CDN control API events). Store high-cardinality logs hot for 72 hours, roll up metrics for 30+ days, and archive raw captures to compressed object storage with indexed metadata. Instrument logs and traces with a stable trace-id and millisecond UTC timestamps. Build a short incident query playbook that runs across logs, metrics, and DNS traces simultaneously.

Why this matters in 2026

By 2026, CDNs and DNS providers have pushed more compute to the edge (edge workers, edge functions, DoH/DoT at scale), and DNS privacy adoption (DoH/DoT) changed where traces are visible. At the same time, providers like Cloudflare and AWS expose richer real-time logs, and eBPF-based collection tools have matured. That means teams can—and should—collect higher-fidelity telemetry at the edge and correlate it with traditional metrics to shorten RCAs.

Concrete telemetry to collect (field-by-field)

Below is a checklist you can implement in logging agents, CDN logpush rules, or edge workers. Each item has a brief reason and recommended retention tier.

1) Edge logs (HTTP/HTTPS requests on CDN)

Timestamp (ISO8601, ms UTC) — baseline for correlation. Retention: hot (72h), warm (30d), cold (1y+).
Request ID / Trace ID (W3C traceparent) — inject at ingress and propagate to origin and workers.
Edge PoP / Edge ID / Region / Data center — map failures to PoP-level issues.
Client IP (hashed or truncated) — for volumetrics and abuse patterns. Anonymize for compliance.
Client ASN / Geo (country, city) — helps identify regional outages.
TLS details — SNI, TLS version, cipher, ALPN, handshake errors.
HTTP protocol and connection state — HTTP/1.1 vs HTTP/2 vs HTTP/3, QUIC details, connection durations.
Request method, host, path, query — full request context; consider hashing PII paths.
Response status and upstream status — 5xx/4xx, origin response codes, origin latency.
Cache status and cache key — HIT/MISS/STALE, cache TTLs.
Worker/edge function logs — exceptions, runtime durations, memory usage. If you need a short comparison of serverless runtimes for EU-sensitive micro-apps, see this free-tier face-off.
WAF / rate-limit events — blocked requests and rule IDs.

2) DNS telemetry

Authoritative server logs — query/response pairs, timestamps, client IPs (resolver IPs), query name (QNAME), QTYPE, RCODE, response size, TTL returned.
Recursive resolver logs (where possible) — for inbound queries to your authoritative servers, capture volume spikes and unusual query patterns.
DNS-over-HTTPS / DoT logs — DoH/DoT termination times, TLS errors, client SNI for DoH where applicable (respect privacy).
EDNS and ECS (EDNS Client Subnet) data — EDNS flags, ECS presence and truncated addresses can change answers by region.
Truncation indicators and TCP fallback — when UDP replies set the TC bit, clients switch to TCP; record these events.
Zone-change events and DNS control-plane events — who pushed a change and when (API audit logs, GitOps commits).

3) Network & TCP/UDP metrics

TCP metrics per PoP / backend — SYN, SYN-ACK, RST, retransmits, handshake durations, time to first byte (TTFB).
UDP metrics — packet loss rates, ICMP errors, UDP packet sizes and fragmentation counts (important for DNS).
TLS/TCP handshake histograms — latency percentiles (p50/p95/p99) for handshake and full request time.
Connection concurrency and ephemeral port exhaustion counters — connection queue lengths and accept failures.
Packet captures (pcap) — ephemeral — ring-buffer pcaps triggered by alerts; push captures to object storage for later analysis. If you need compact capture tooling or edge bundle examples, see this edge bundle review.
eBPF traces — socket-level latencies, drop counts, flow-level summaries without full pcap (low overhead).

4) Control-plane & routing telemetry

CDN control-plane events — service deploys, config changes, purge events, edge rule updates, health-check states.
BGP updates and RIB snapshots — route changes that could affect reachability to PoPs or your authoritative nameservers.
Provider status messages and API errors — Cloudflare/AWS/GCP status updates and API error rates.

How to collect without being overwhelmed

High-cardinality logs explode costs fast. Use a tiered approach:

Hot window (0–72 hours): Full-resolution logs, high-cardinality labels, and short-term pcaps for triggered captures. Keep these easily searchable for fast postmortems.
Warm window (3–90 days): Aggregate and roll up logs to reduce cardinality—store percentiles and counts, and keep indexes for common dimensions (service, region, status).
Cold archive (months–years): Store compressed raw logs and pcaps in object storage (Parquet/ORC) with lightweight indexes for on-demand rehydration. For tooling and marketplace options that help with archive and query, consult a recent tools & marketplaces roundup.

Storage technology recommendations (2026)

Hot logs: Grafana Loki (labels-based), Elasticsearch (with curated mappings), or ClickHouse for very high-throughput logs.
Metrics: Prometheus remote_write + Thanos/Cortex for long-term, or M3DB/InfluxDB depending on scale.
Trace data: OpenTelemetry with an OTLP collector to a tracing backend (Jaeger/Tempo/Datadog) supporting high-cardinality links.
Raw captures and archived logs: S3/Blob storage in Parquet with a manifest index; query via Athena/BigQuery/Presto when needed.
PCAP indexing: Store pcap objects with metadata in a searchable DB (ClickHouse/Elasticsearch) and reference S3 keys for rehydration.

How to tag and index logs for fast queries

Fast postmortems require predictable, consistent fields. Choose a naming standard and stick to it.

service= (eg. api, web, auth)
env= (prod, staging)
domain= (primary host header)
edge_pop= (POPID or region)
cache_status= (HIT/MISS/STALE)
trace_id= and request_id=
origin_health= (OK, slow, down)

These labels enable sub-second queries like: count of 5xx by edge_pop for the last 15 min, filtered by domain. Pre-create those dashboards and queries for incident playbooks.

Sample queries and playbook for immediate triage

During an outage have a short, ordered set of queries your SREs run automatically via a notebook or runbook automation.

Initial triage (first 5 minutes)

Check provider status (Cloudflare/AWS dashboard & API) and BGP route collectors. If provider reports a disruption, capture timestamps and link to alerts.
Run an edge-level 5xx heatmap (by PoP):

Kibana: GET /logs/_search { "query": { "bool": { "filter": [ { "term": { "status": "5xx" }}, { "range": { "@timestamp": { "gte": "now-15m" }}} ]}}}

Or Loki: {job="edge"} |= "5" | json | count_over_time({job="edge"}[15m]) by (edge_pop)

DNS triage (5–15 minutes)

Authoritative server query rate spike: SELECT count(*) FROM dns_logs WHERE time > now()-10m GROUP BY rcode, qname, resolver_ip
Run live dig traces from multiple points: dig +trace yourdomain.com @8.8.8.8 and at PoPs. Check for SERVFAIL, NXDOMAIN, or long response latencies.
Check TC (truncation) and whether recursion is failing—clients will switch to TCP; look for increased TCP connections to port 53.

Network correlation (15–30 minutes)

Check SYN/SYN-ACK ratio and RST counts by PoP to detect packet drops or firewall issues.
Trigger an ephemeral pcap on the affected PoP with a 100MB ring buffer and push to object storage for forensic analysis. If you need compact tooling or edge-device examples, consult an edge bundle review for guidance.

Control-plane correlation (30–60 minutes)

Review recent config deploys and DNS zone changes (GitOps commits) in the 1–2 hours before incident window.
Check CDN config invalidations, WAF rules updates, and any sudden increase in rate-limit hits.

Correlation techniques that save hours

Sticky trace identifiers: ensure the same trace-id is visible in edge logs, origin logs, and worker logs.
Time-sync rigor: enforce NTP/PTP and compare clock skew; millisecond accuracy matters when correlating TCP handshakes with log events.
Cross-source joins: store minimal metadata in a columnar index: (timestamp, trace_id, object_key) so you can rehydrate raw logs and pcaps by trace ID quickly.
Prebuilt notebooks: store parametric queries (time window, domain, PoP) that SREs can run with one click. For runbook automation patterns and IaC templates to ensure these checks are applied in CI, see IaC templates.

Privacy, compliance and cost controls

Be deliberate about what you store. Mask PII (paths with emails, auth tokens). Apply hashing to client IPs where regulations require it. Use rate-based sampling on high-volume endpoints and conditional full-capture for error paths. In 2026, privacy-friendly telemetry (reducing exact client IP retention) is standard practice while still preserving RCA ability.

Automation: put this into your CI/CD and runbooks

Observability is code. Add these as part of your platform templates and CI pipelines:

Infrastructure-as-code definitions for logpush (Cloudflare Logpush or CloudFront real-time logs) and OTLP collectors.
Automated tests that assert logs include required fields after deploys. If you're building edge-first systems, the architecture patterns in resilient cloud-native architectures are a helpful reference.
Runbook automation that executes the triage queries and assembles a single incident timeline document.

Case study: how richer telemetry short-circuited RCA on the Jan 2026 X/Cloudflare incident

During the Jan 16, 2026 incident many teams initially saw global 5xx volume and long TTL DNS failures. Teams that had implemented the checklist above were able to:

Quickly separate CDN edge-side cache misses from authoritative DNS SERVFAILs by running simultaneous edge-log and DNS-query heatmaps.
Correlate a spike of UDP truncation events to increased EDNS responses that exceeded MTU in a subset of PoPs, revealing an MTU-related fragmentation issue exacerbated by a config push. Ephemeral pcaps captured the truncated DNS UDP payloads and proved the path MTU problem.
Use control-plane event logs to show a certificate reload coincident with TLS handshake failures in a specific region.

Without packet captures, EDNS fields, and the CDN control-plane events indexed, these teams would have spent days chasing symptoms.

Practical checklist to implement this week

Inventory where edge logs, DNS server logs, and BGP feeds currently land.
Standardize log fields and trace-id propagation across edge workers and origins.
Set up a 72-hour hot storage window with searchable logs and a 30-day metrics rollup.
Implement conditional pcap capture with a ring buffer and a trigger (high 5xx rate, WAF anomaly, or DNS SERVFAIL spike).
Create prebuilt incident queries and automate them into your runbook tool (PagerDuty/Playbooks/GitHub Actions).

Future-proofing: trends to watch in 2026–2028

Edge-first observability—expect CDNs to expose richer, real-time telemetry APIs and push mechanisms; integrate them directly into your hot window. See edge-first patterns applied to other domains in edge-first trading workflows.
eBPF at scale—eBPF-based collectors will replace many agents for low-overhead socket and flow telemetry.
Privacy-focused DNS—DoH/DoT adoption will keep growing; make sure your telemetry strategy accounts for fewer client-level DNS signals and relies more on authoritative and resolver telemetry.
AI-assisted RCA—by late 2026 automated anomaly correlation will be standard in platforms; structure your metadata so models can reason across sources. For considerations about running models and compliant infra, see running large models on compliant infrastructure and the role of autonomous agents in toolchains.

Collecting the right telemetry isn't about capturing everything—it's about capturing the right signals, richly indexed and queryable when you need them.

Final takeaways

Prioritize: edge logs, DNS traces, TCP/UDP metrics, and control-plane events.
Plan storage: hot (72h), warm (30–90d), cold (archive) with Parquet for rehydration.
Instrument: ensure trace IDs and ms UTC timestamps are present across all sources.
Automate: prebuilt queries and triggered captures turn chaotic investigations into repeatable RCAs.

Call to action

Start by adding five fields to your edge and DNS logs this week: timestamp, trace_id, edge_pop, cache_status, and origin_latency_ms. Automate a 72-hour hot window and create one incident notebook with the triage queries above. If you want a ready-made checklist and query set for Grafana Loki, ClickHouse, or BigQuery, download our observability runbook (link in your platform) and adapt it to your CDN/DNS stack—your next postmortem will thank you.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.