Slowly Changing Dimensions (SCDs): A Practical Guide for Your Star Schema

Star schemas shine when your facts (events) are analyzed through dimensions (who/what/where/when). But in real life, dimension attributes change—customers move, products rebrand, sales territories realign. Slowly changing dimensions (SCDs) are the modeling patterns that preserve analytic correctness as those attributes evolve.

What is a slowly changing dimension?

A slowly changing dimension is a special kind of dimension used on in star schema (and related) modeling that allows you to track things that change about the dimension. One of the classic examples is that if you are working on a data platform for the DMV, you might want to track what color a car is at a specific time. How do you do that without making everything “one big table?” You use a slowly changing dimension. Slowly changing dimensions come in several types.

  • Type 0 — Retain original: Value is immutable after first set; later differences are treated as data‑quality (DQ) issues, not history.
  • Type 1 — Overwrite: Keep only the latest value; no history.
  • Type 2 — Row versioning: New row per change with effective dates and a “current” flag; full history.
  • Type 3 — Limited history: Keep a small number of previous values in extra columns.
  • Type 4 — History table: Current values in the main dimension; full history in a separate table.
  • Type 6 — Hybrid (1+2+3): Historical rows and a few “current/previous” columns for convenience.

If your analytics ask “what was true at the time?” (e.g., revenue by the customer’s region as of the sale), you almost certainly want Type 2 for that attribute.


Why SCDs matter

Suppose you ask:

“How did last year’s revenue break down by customer region?”

If you overwrite a customer’s region when they move (Type 1), last year’s sales will suddenly appear under the newregion. With Type 2, the older facts keep pointing to the “old” region row and newer facts to the “new” one—your trend lines remain trustworthy.


Core building blocks

  • Natural key (customer_id_nk): Business identifier from source (e.g., CRM CustomerID).
  • Surrogate key (customer_sk): Warehouse‑generated key; facts join to the specific dimension version via this key.
  • Change metadata (for Type 2/6):
    • valid_from (date/timestamp)
    • valid_to (date/timestamp; often 9999-12-31 for open‑ended)
    • is_current (boolean)
    • hash_diff (optional but recommended—hash of tracked attributes to detect changes)

Consistent naming is your friend. Decide conventions early and document them.


The SCD types—what to use when

Type 0 — Retain Original

Definition. Set once, never updated in the dimensional model. Later “changes” are treated as errors/corrections, not business history.

Good candidates (truly immutable):

  • Date of birth / incorporation date
  • First‑seen timestamp (when the entity entered your systems)
  • Original acquisition channel (if defined strictly as “first touch”)
  • System‑assigned immutable identifiers

Anti‑examples (often misclassified as Type 0):

  • Legal name (can change) → often Type 2 (or Type 1 if only current matters)
  • Address/region → Type 2
  • Email/phone → usually Type 1; sometimes Type 2 if used for historical slicing

Modeling patterns:

  • Exclude Type 0 columns from hash_diff so they never trigger new versions.
  • Optionally keep a pair: *_original (Type 0) and *_curated (Type 1) to allow safe corrections without losing lineage.
  • Enforce with DQ tests (e.g., flag if a stage value for a Type 0 column differs from the current dimension value).

Type 1 — Overwrite

Update the existing row; no history.
Use when: Latest value is all that matters (e.g., standardized casing, email fix), or the attribute isn’t analytically relevant over time.

Type 2 — Row Versioning with Dates

Insert a new row on change; close the old row (valid_tois_current=false). Facts keep the surrogate key captured at load time, preserving “as of” accuracy.
Use when: You ask temporal questions or require auditability.

Minimal schema sketch:

CREATE TABLE dim_customer (
  customer_sk    BIGINT GENERATED ALWAYS AS IDENTITY,
  customer_id_nk STRING NOT NULL,
  customer_name  STRING,
  region         STRING,
  -- SCD metadata
  valid_from     TIMESTAMP NOT NULL,
  valid_to       TIMESTAMP NOT NULL,
  is_current     BOOLEAN   NOT NULL,
  hash_diff      STRING, -- hash over tracked attrs (exclude Type 0)
  PRIMARY KEY (customer_sk)
);

Type 3 — Limited History by Columns

Store a small set of previous values (e.g., region_previous).
Use when: Analysts only need “current vs previous,” not full history.

Type 4 — History Table

Keep a slim “current” dimension for performance; move full history into a separate table (e.g., dim_customer_history).
Use when: You want fast daily joins but occasional deep historical analysis.

Type 6 — Hybrid (1+2+3)

A Type 2 base with some Type 3 “previous_*” columns and Type 1 semantics on non‑critical attributes.
Use when: You need full history and simple “current vs previous” reporting without extra self‑joins.

Note: You’ll see references to other numbered types (5, 7, etc.). In practice, 0/1/2/3/4/6 cover the vast majority of needs and are least ambiguous across teams.


Choosing the right type (quick decision guide)

  • Does this attribute affect historical analysis?
    • Yes → Type 2 (or Type 6).
    • No → Type 1.
  • Only need “current vs previous”?
    • Yes → Type 3 (or Type 6).
  • Keep main dimension lean; archive deep history elsewhere?
    • Yes → Type 4.
  • Is it truly immutable?
    • Yes → Type 0.
  • High‑churn attribute causing row explosion?
    • Consider a mini‑dimension (separate small dimension for volatile attributes) or keep that attribute Type 1.

Implementing SCD Type 2 (warehouse‑agnostic pattern)

Assumptions:

  • Staging view/table stg_customer with today’s snapshot/changes.
  • Tracked (Type 2) attributes include customer_nameregion.
  • Type 0 attributes (e.g., date_of_birth) are excluded from the change hash.

1) Compute a hash on tracked attributes in staging

WITH stg AS (
  SELECT
    customer_id_nk,
    customer_name,
    region,
    TO_HEX(MD5(CONCAT_WS('|',
      COALESCE(customer_name, ''),
      COALESCE(region, '')
    ))) AS hash_diff
  FROM stg_customer
)

2) Upsert logic with MERGE (conceptual)

MERGE INTO dim_customer d
USING (SELECT s.*, CURRENT_TIMESTAMP AS load_ts FROM stg s) s
  ON d.customer_id_nk = s.customer_id_nk AND d.is_current = TRUE

WHEN MATCHED AND d.hash_diff <> s.hash_diff THEN
  UPDATE SET d.valid_to = s.load_ts, d.is_current = FALSE

WHEN NOT MATCHED BY TARGET THEN
  INSERT (customer_id_nk, customer_name, region, valid_from, valid_to, is_current, hash_diff)
  VALUES (s.customer_id_nk, s.customer_name, s.region,
          s.load_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, s.hash_diff);

3) Insert the new version for changed rows
(Some engines support a second MERGE clause; others require a follow‑up insert.)

INSERT INTO dim_customer (customer_id_nk, customer_name, region,
                          valid_from, valid_to, is_current, hash_diff)
SELECT s.customer_id_nk, s.customer_name, s.region,
       s.load_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, s.hash_diff
FROM stg s
JOIN dim_customer d
  ON d.customer_id_nk = s.customer_id_nk
WHERE d.valid_to = s.load_ts
  AND d.is_current = FALSE;

4) As‑of lookup for facts

SELECT d.customer_sk
FROM dim_customer d
WHERE d.customer_id_nk = fact.customer_id_nk
  AND d.valid_from <= fact.event_ts
  AND d.valid_to   >  fact.event_ts;


Special cases

Late‑arriving dimensions

When a fact arrives before its dimension row/version exists:

  • Create an inferred member (minimal attributes, is_current=TRUE).
  • Backfill: either update in place for Type 1 attributes or close & insert a new row for Type 2 attributes.

Soft deletes

On deactivation:

  • Expire the current row (valid_to = deactivation_tsis_current=false), optionally set is_deleted=true.
  • Facts remain historically valid.

High‑change attributes

Avoid Type 2 row explosion:

  • Mini‑dimension: Move volatile, high‑cardinality attributes (preferences, flags) to their own dimension and reference via a foreign key.
  • Mixed strategy: Track critical attributes as Type 2; keep non‑critical ones Type 1.

Testing & data quality

  • Idempotency: Re‑running the same load must not create duplicate versions. Hash‑diff + deterministic valid_fromhelps.
  • Exactly one current row per natural key:SELECT customer_id_nk FROM dim_customer WHERE is_current GROUP BY 1 HAVING COUNT(*) > 1;
  • No temporal gaps/overlaps: For each natural key, valid_to of version n should equal valid_from of version n+1.
  • Type 0 enforcement: Alert if a proposed change touches a Type 0 column; consider “original vs curated” pairs when legitimate corrections are needed.

Performance tips

  • Indexing/Clustering: Cluster/partition on (customer_id_nk, valid_from DESC) or (customer_id_nk, valid_to DESC) to speed “as‑of” lookups.
  • Compression: Type 2 duplicates many values; columnar engines compress this well.
  • CDC where possible: Source‑side Change Data Capture reduces heavy diffing against full snapshots.

Checklist: before you ship

  •  Attribute‑by‑attribute SCD type decided and documented (including Type 0 candidates).
  •  Surrogate key, natural key, and effective‑dating columns chosen.
  •  hash_diff defined over only the tracked (non‑Type 0) attributes.
  •  Idempotent MERGE / expire / insert logic implemented.
  •  Fact loader performs as‑of dimension lookup.
  •  Tests for one current row, no temporal gaps, late‑arrival handling, and Type 0 DQ checks.
  •  Data dictionary updated with definitions and DQ rules.

Wrap‑up

SCDs are how you keep your star schema honest as business reality changes. Start by classifying each attribute—Type 0 if truly immutable, Type 1 for “latest only,” Type 2 (or Type 6) for historical accuracy—then back it with effective‑dating, a robust change‑detection strategy, and clear DQ rules. Your analysts (and auditors) will thank you.

Unknown's avatar

Author: Jason Miles

A solution-focused developer, engineer, and data specialist focusing on diverse industries. He has led data products and citizen data initiatives for almost twenty years and is an expert in enabling organizations to turn data into insight, and then into action. He holds MS in Analytics from Texas A&M, DAMA CDMP Master, and INFORMS CAP-Expert credentials.