Star schemas shine when your facts (events) are analyzed through dimensions (who/what/where/when). But in real life, dimension attributes change—customers move, products rebrand, sales territories realign. Slowly changing dimensions (SCDs) are the modeling patterns that preserve analytic correctness as those attributes evolve.
What is a slowly changing dimension?
A slowly changing dimension is a special kind of dimension used on in star schema (and related) modeling that allows you to track things that change about the dimension. One of the classic examples is that if you are working on a data platform for the DMV, you might want to track what color a car is at a specific time. How do you do that without making everything “one big table?” You use a slowly changing dimension. Slowly changing dimensions come in several types.
- Type 0 — Retain original: Value is immutable after first set; later differences are treated as data‑quality (DQ) issues, not history.
- Type 1 — Overwrite: Keep only the latest value; no history.
- Type 2 — Row versioning: New row per change with effective dates and a “current” flag; full history.
- Type 3 — Limited history: Keep a small number of previous values in extra columns.
- Type 4 — History table: Current values in the main dimension; full history in a separate table.
- Type 6 — Hybrid (1+2+3): Historical rows and a few “current/previous” columns for convenience.
If your analytics ask “what was true at the time?” (e.g., revenue by the customer’s region as of the sale), you almost certainly want Type 2 for that attribute.
Why SCDs matter
Suppose you ask:
“How did last year’s revenue break down by customer region?”
If you overwrite a customer’s region when they move (Type 1), last year’s sales will suddenly appear under the newregion. With Type 2, the older facts keep pointing to the “old” region row and newer facts to the “new” one—your trend lines remain trustworthy.
Core building blocks
- Natural key (
customer_id_nk): Business identifier from source (e.g., CRMCustomerID). - Surrogate key (
customer_sk): Warehouse‑generated key; facts join to the specific dimension version via this key. - Change metadata (for Type 2/6):
valid_from(date/timestamp)valid_to(date/timestamp; often9999-12-31for open‑ended)is_current(boolean)hash_diff(optional but recommended—hash of tracked attributes to detect changes)
Consistent naming is your friend. Decide conventions early and document them.
The SCD types—what to use when
Type 0 — Retain Original
Definition. Set once, never updated in the dimensional model. Later “changes” are treated as errors/corrections, not business history.
Good candidates (truly immutable):
- Date of birth / incorporation date
- First‑seen timestamp (when the entity entered your systems)
- Original acquisition channel (if defined strictly as “first touch”)
- System‑assigned immutable identifiers
Anti‑examples (often misclassified as Type 0):
- Legal name (can change) → often Type 2 (or Type 1 if only current matters)
- Address/region → Type 2
- Email/phone → usually Type 1; sometimes Type 2 if used for historical slicing
Modeling patterns:
- Exclude Type 0 columns from
hash_diffso they never trigger new versions. - Optionally keep a pair:
*_original(Type 0) and*_curated(Type 1) to allow safe corrections without losing lineage. - Enforce with DQ tests (e.g., flag if a stage value for a Type 0 column differs from the current dimension value).
Type 1 — Overwrite
Update the existing row; no history.
Use when: Latest value is all that matters (e.g., standardized casing, email fix), or the attribute isn’t analytically relevant over time.
Type 2 — Row Versioning with Dates
Insert a new row on change; close the old row (valid_to, is_current=false). Facts keep the surrogate key captured at load time, preserving “as of” accuracy.
Use when: You ask temporal questions or require auditability.
Minimal schema sketch:
CREATE TABLE dim_customer (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY,
customer_id_nk STRING NOT NULL,
customer_name STRING,
region STRING,
-- SCD metadata
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL,
is_current BOOLEAN NOT NULL,
hash_diff STRING, -- hash over tracked attrs (exclude Type 0)
PRIMARY KEY (customer_sk)
);
Type 3 — Limited History by Columns
Store a small set of previous values (e.g., region_previous).
Use when: Analysts only need “current vs previous,” not full history.
Type 4 — History Table
Keep a slim “current” dimension for performance; move full history into a separate table (e.g., dim_customer_history).
Use when: You want fast daily joins but occasional deep historical analysis.
Type 6 — Hybrid (1+2+3)
A Type 2 base with some Type 3 “previous_*” columns and Type 1 semantics on non‑critical attributes.
Use when: You need full history and simple “current vs previous” reporting without extra self‑joins.
Note: You’ll see references to other numbered types (5, 7, etc.). In practice, 0/1/2/3/4/6 cover the vast majority of needs and are least ambiguous across teams.
Choosing the right type (quick decision guide)
- Does this attribute affect historical analysis?
- Yes → Type 2 (or Type 6).
- No → Type 1.
- Only need “current vs previous”?
- Yes → Type 3 (or Type 6).
- Keep main dimension lean; archive deep history elsewhere?
- Yes → Type 4.
- Is it truly immutable?
- Yes → Type 0.
- High‑churn attribute causing row explosion?
- Consider a mini‑dimension (separate small dimension for volatile attributes) or keep that attribute Type 1.
Implementing SCD Type 2 (warehouse‑agnostic pattern)
Assumptions:
- Staging view/table
stg_customerwith today’s snapshot/changes. - Tracked (Type 2) attributes include
customer_name,region. - Type 0 attributes (e.g.,
date_of_birth) are excluded from the change hash.
1) Compute a hash on tracked attributes in staging
WITH stg AS (
SELECT
customer_id_nk,
customer_name,
region,
TO_HEX(MD5(CONCAT_WS('|',
COALESCE(customer_name, ''),
COALESCE(region, '')
))) AS hash_diff
FROM stg_customer
)
2) Upsert logic with MERGE (conceptual)
MERGE INTO dim_customer d
USING (SELECT s.*, CURRENT_TIMESTAMP AS load_ts FROM stg s) s
ON d.customer_id_nk = s.customer_id_nk AND d.is_current = TRUE
WHEN MATCHED AND d.hash_diff <> s.hash_diff THEN
UPDATE SET d.valid_to = s.load_ts, d.is_current = FALSE
WHEN NOT MATCHED BY TARGET THEN
INSERT (customer_id_nk, customer_name, region, valid_from, valid_to, is_current, hash_diff)
VALUES (s.customer_id_nk, s.customer_name, s.region,
s.load_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, s.hash_diff);
3) Insert the new version for changed rows
(Some engines support a second MERGE clause; others require a follow‑up insert.)
INSERT INTO dim_customer (customer_id_nk, customer_name, region,
valid_from, valid_to, is_current, hash_diff)
SELECT s.customer_id_nk, s.customer_name, s.region,
s.load_ts, TIMESTAMP '9999-12-31 00:00:00', TRUE, s.hash_diff
FROM stg s
JOIN dim_customer d
ON d.customer_id_nk = s.customer_id_nk
WHERE d.valid_to = s.load_ts
AND d.is_current = FALSE;
4) As‑of lookup for facts
SELECT d.customer_sk
FROM dim_customer d
WHERE d.customer_id_nk = fact.customer_id_nk
AND d.valid_from <= fact.event_ts
AND d.valid_to > fact.event_ts;
Special cases
Late‑arriving dimensions
When a fact arrives before its dimension row/version exists:
- Create an inferred member (minimal attributes,
is_current=TRUE). - Backfill: either update in place for Type 1 attributes or close & insert a new row for Type 2 attributes.
Soft deletes
On deactivation:
- Expire the current row (
valid_to = deactivation_ts,is_current=false), optionally setis_deleted=true. - Facts remain historically valid.
High‑change attributes
Avoid Type 2 row explosion:
- Mini‑dimension: Move volatile, high‑cardinality attributes (preferences, flags) to their own dimension and reference via a foreign key.
- Mixed strategy: Track critical attributes as Type 2; keep non‑critical ones Type 1.
Testing & data quality
- Idempotency: Re‑running the same load must not create duplicate versions. Hash‑diff + deterministic
valid_fromhelps. - Exactly one current row per natural key:
SELECT customer_id_nk FROM dim_customer WHERE is_current GROUP BY 1 HAVING COUNT(*) > 1; - No temporal gaps/overlaps: For each natural key,
valid_toof version n should equalvalid_fromof version n+1. - Type 0 enforcement: Alert if a proposed change touches a Type 0 column; consider “original vs curated” pairs when legitimate corrections are needed.
Performance tips
- Indexing/Clustering: Cluster/partition on
(customer_id_nk, valid_from DESC)or(customer_id_nk, valid_to DESC)to speed “as‑of” lookups. - Compression: Type 2 duplicates many values; columnar engines compress this well.
- CDC where possible: Source‑side Change Data Capture reduces heavy diffing against full snapshots.
Checklist: before you ship
- Attribute‑by‑attribute SCD type decided and documented (including Type 0 candidates).
- Surrogate key, natural key, and effective‑dating columns chosen.
-
hash_diffdefined over only the tracked (non‑Type 0) attributes. - Idempotent MERGE / expire / insert logic implemented.
- Fact loader performs as‑of dimension lookup.
- Tests for one current row, no temporal gaps, late‑arrival handling, and Type 0 DQ checks.
- Data dictionary updated with definitions and DQ rules.
Wrap‑up
SCDs are how you keep your star schema honest as business reality changes. Start by classifying each attribute—Type 0 if truly immutable, Type 1 for “latest only,” Type 2 (or Type 6) for historical accuracy—then back it with effective‑dating, a robust change‑detection strategy, and clear DQ rules. Your analysts (and auditors) will thank you.