Zero Unmanaged Copy + SCDs: Keep All the History, Lose the Bloat

Data teams often face a false choice: either keep rich Slowly Changing Dimensions (SCDs) and accept a sprawl of duplicate tables, or keep the warehouse lean and give up on audit‑ready history. You don’t have to choose. With a zero unmanaged copy approach, you can keep full history and maintain predictable performance—without littering the lakehouse with ad‑hoc exports and orphaned datasets.

This post explains the idea in plain English, shows how mirrored, CDC, historical (SCD2), and snapshot tables fit together, and lays out four deployment options. We’ll also cover the performance guardrails—partitioning, clustering, idempotent merges, micro‑batches, workload isolation, and observability—so your SLAs stay green as data scales.

What “Zero Unmanaged Copy” Really Means

“Zero unmanaged copy” doesn’t mean “never copy data.” It means: if a physical derivative exists, it is declared (owner, purpose, lineage), reproducible (how to rebuild), and governed (SLA/SLO, retention, refresh). Anything else—CSV dumps, sidecar exports, untracked sandboxes—is unmanaged and should not exist.

This posture reduces storage, tightens governance, and—counterintuitively—improves performance because compute funnels through a few well‑optimized paths.

The Four Table Types (and why each exists)

1) Mirrored Tables — “safe visibility of the source”

A mirror is a near‑real‑time replica of an operational system, ingested with minimal transformation. Its job is to decouple analytics from production and provide low‑latency access to source data. It’s not your system of record for change; it’s your landing zone with strong SLAs and schema control.

2) CDC Tables — “every change, in order”

A Change Data Capture (CDC) table is an append‑only event log that records inserts, updates, and deletes with timestamps (and often a sequence to break ties). This is the single source of truth for downstream state transitions. Treat it like a ledger: immutable, ordered, and easy to replay.

3) Historical (SCD2) Tables — “what was true, when”

An SCD2 table converts events into states with effective_from, effective_to, and is_current. That lets you do point‑in‑time joins (“what did we believe on March 5th?”) and full audit trails without re‑parsing raw events on every query.

4) Snapshot Tables — “freeze time on purpose”

A snapshot is an immutable as‑of extract of a logical state (often “current rows only”). Snapshots serve finance, compliance, and reproducibility. They should be created by policy (daily, monthly) and tracked in a manifest—not as random exports.

Short version: Mirrors protect the source, CDC captures events, SCD2 models state, Snapshots freeze time.

How They Work Together (one path that makes sense)

Ingest from the operational DB into a mirror to keep load off production and normalize schema.
Generate or consume CDC so you have an ordered, append‑only log of changes.
Apply an idempotent MERGE from CDC into SCD2 dimensions and relevant facts, opening/closing validity windows.
Cut snapshots on a schedule for stable reporting and reproducibility (often “current only,” sometimes full state).
Enforce contracts (SLA/SLO, retention, rebuild steps) for each table so nothing becomes unmanaged.

Four Deployment Options (pick the simplest that fits your SLA)

Option A — Mirror‑First, CDC‑Driven SCD (sensible default)

Use when: You need low latency (minutes), strict source isolation, and clear modeling boundaries.
How it works: Mirror → derive/ingest CDC → MERGE into SCD2 → schedule snapshots.
Why it’s good: Explicit separation of concerns; mirrors help exploratory diagnostics; CDC is authoritative for change.

Option B — Native CDC to SCD, Mirror Optional (event‑native shops)

Use when: You already have reliable CDC (binlog/Kafka/outbox) and don’t need a mirror for analysis.
How it works: CDC → SCD2 → snapshots; keep a lightweight mirror only for sanity checks if desired.
Why it’s good: Fewer moving parts; CDC remains the ledger.

Option C — Snapshot‑First with Derived CDC (batchy, legacy sources)

Use when: No CDC is available and daily/hourly cadence is fine.
How it works: Build snapshots on a cadence → compute diffs between snapshots to synthesize CDC → SCD2.
Why it’s good: Works anywhere; reproducible; you can upgrade to real CDC later.

Option D — Temporal/Time‑Travel First (modern warehouses)

Use when: Your platform supports temporal tables/time‑travel efficiently.
How it works: Query history via the platform → materialize SCD2 only where dimensional models demand it → snapshots via metadata/clone features.
Why it’s good: Minimal physical copies; fast backfills; strong governance.

Performance Guarantees by Design (not by wishful thinking)

Freshness: bounded work you can finish

Process micro‑batches (by time or row count) and track a watermark—typically (commit_ts, seq)—so every run picks up exactly where the last one left off. This keeps latency predictable and retries safe.

Speed: prune I/O before you touch it

Lay out big CDC/SCD tables for pruning:

Partition by ingest or commit date to avoid scanning cold data.
Cluster/sort by business key (and effective_from for SCD2) so point lookups and interval joins skip most files.
Compact periodically to prevent small‑file drag and optimize scan efficiency.

Reliability: isolate noisy neighbors and measure everything

Workload isolation: Put ingest/mirror jobs, CDC→SCD merges, and BI/ad‑hoc queries on separate compute pools.
Data quality gates: no duplicate (bk, effective_from), no overlapping validity windows per key, and no future‑dated commit_ts beyond a small tolerance.
Observability: publish p95 merge durations, rows processed, partitions pruned, and compaction debt. Alert on breaches, don’t just log them.

How SCD2 Feels in Practice (text, not theory)

When a customer’s city changes from LA to SF on July 10, 2025, your SCD2 table closes the LA row by setting effective_to = 2025‑07‑10 and opens a new row for SF with effective_from = 2025‑07‑10. Queries that join a fact’s event_ts into [effective_from, effective_to) will always land on the correct historical state—even years later.

This “close then open” pattern is triggered by events in CDC but expressed as state in SCD2. That’s why we keep CDC and SCD distinct: events make history replayable, state makes history queryable.

Snapshots Without the Copy Explosion

Finance asks for “numbers as of month‑end.” Don’t export CSVs. Instead, materialize a snapshot that captures the state at a point in time—ideally via clone/time‑travel if your platform supports it, or as an immutable table with a manifest entry (date, source version, row count, checksum). You get perfect reproducibility and auditable lineage—still within the managed, zero‑unmanaged‑copy posture.

Common Pitfalls (and the fix, in words)

Full‑table MERGE every run: Filter by watermark and cluster keys so the engine prunes 95% of files before reading.
Small‑file explosion: Batch writes and run compaction on a cadence; set a minimum target file size.
Skewed “hot” keys: Use bucketing/salting or enable adaptive execution to prevent single‑partition hotspots.
Unbounded late arrivals: Declare a lateness horizon (e.g., 7 days). Process older stragglers in a separate backfill lane with its own SLA.
Analyst “temporary” copies: Offer clones with TTL and tags; forbid unmanaged exports by policy.

A Lightweight Contract You Can Adopt Tomorrow

Define a single, human‑readable contract per pipeline. Example:

pipeline: customer_master
truth: cdc_customer
derived: dim_customer_scd2
slas:
  ingest_p95: 5m
  merge_p95: 10m
layout:
  cdc_partition: by_day(commit_ts)
  scd_cluster: [bk, effective_from]
ops:
  batch_rows_max: 5_000_000
  parallel_merges: 4
dq:
  unique_interval: "bk, effective_from"
  no_overlap: true
  no_future_commit: true
observability:
  metrics: [rows_in, rows_changed, merge_ms_p95, partitions_pruned, compaction_mb]
  lineage: enabled
retention:
  cdc: 180d hot, 3y warm
  snapshots: 7y immutable

This is the difference between “a pile of tables” and a managed system with guarantees.

Conclusion

You can maintain rich SCD history and rock‑solid performance without drowning in copies. The pattern is simple:

Mirror to shield production and normalize ingress.
CDC to capture every change, once, in order.
SCD2 to express what was true, when.
Snapshots to freeze time on purpose.

Wrap those pieces in contracts (layout, micro‑batches, idempotent merges, isolation, and observability) and you’ll deliver fast, auditable analytics with a minimal, managed physical footprint. That’s zero unmanaged copy in action—clean, reproducible, and ready for scale.

Author: Jason Miles

A solution-focused developer, engineer, and data specialist focusing on diverse industries. He has led data products and citizen data initiatives for almost twenty years and is an expert in enabling organizations to turn data into insight, and then into action. He holds MS in Analytics from Texas A&M, DAMA CDMP Master, and INFORMS CAP-Expert credentials. View all posts by Jason Miles