If you’ve built data systems long enough, you’ve lived through at least three architectural moods: the tidy certainty of Kimball and Inmon, the anarchic freedom of “throw everything in the data lake to ingest quickly,” and today’s lakehouse, which tries to keep our speed without losing our sanity. I’ve always cared less about labels and more about baselines—clear, durable expectations that make change safe. This piece traces how those baselines shifted, what we gained and lost, and how to rebuild them for modern work, including real‑time, very large, and unstructured data.
What Kimball and Inmon Actually Gave Us: A Baseline
The classic data warehouse patterns were never just about schemas. They were about social contracts.
- Kimball prioritized usability: conformed dimensions, star schemas, slowly changing dimensions. Analysts received stable, well-labeled facts and dimensions whose meanings didn’t drift week to week.
- Inmon prioritized integration: a centralized, normalized enterprise warehouse fed downstream marts. It set a baseline that said, “Definitions live here. We reconcile conflicts here.”
Both schools created a semantic baseline: shared business language enforced by design. That foundation slowed initial delivery, yes—but it protected meaning. The cost of change was explicit and budgeted.
The Data Lake Era: Fast Ingest, Fuzzy Meaning
Then object storage and “schema-on-read” arrived. We could land data cheaply, in any shape, at any velocity. The new baseline became: “land everything, now.” Speed won.
But with speed came drift. Without a contract at ingest, we often traded semantic clarity for throughput. Names meant different things in different folders. Quality checks became ad-hoc. Lineage was tribal knowledge. The lake delivered raw information but degraded confidence. Teams filled the gap with tribal conventions, brittle notebooks, and heroic individuals.
The lesson wasn’t that lakes are bad. It was that speed without a baseline degrades trust. And when trust goes, so does adoption.
The Lakehouse: Reclaiming the Baseline Without Losing Speed
The lakehouse is not a single product; it’s a pattern: open, low-cost storage combined with table abstractions that restore guarantees we relied on in warehouses—atomic writes, versioning, schema evolution, time travel, and reliable indexing/partitioning—right atop the lake.
The baseline returns in three ways:
- Table Contracts on the Lake. Instead of dumping files, we publish tables with expectations: primary keys or unique constraints where possible, compatible schema evolution rules, and predictable partitioning. “Schema-on-read” becomes “schema-on-read with contracts.”
- Layered Surfaces. Whether you call it bronze/silver/gold or raw/validated/curated, the pattern is the same: escalate guarantees as data moves. Raw is durable and minimally altered. Validated checks types, nullability, and referential rules. Curated reintroduces semantic models: marts, feature sets, and subject-area views.
- Unified Engines for Batch and Stream. The same compute fabric runs micro-batch and streaming jobs, letting you apply the same contracts to moving and at-rest data. That kills off most “dual pipeline” drift.
This is how we keep “ingest quickly” and make it safe to build something durable on top.
The Shift from ETL to ELT: Why It Matters for Baselines
In classic ETL, slow and expensive staging systems forced us to transform first, load after. In ELT, storage is cheap and compute elastic, so we load raw data first and transform where it lives. That’s more than a tooling change; it’s a contract change.
- In ETL, the transformation is the gatekeeper; the staging area is a side effect. The baseline is enforced before data is visible.
- In ELT, the raw zone is the baseline of record: append-only, auditable, versioned. Transformations are publishedartifacts layered above raw data with their own contracts and tests.
ELT aligns with scientific practice: preserve the original observation (raw), keep your code and transformations reproducible, and publish derived datasets with explicit claims. It also shortens feedback loops: analysts can explore raw data early, while curated models mature at their own pace.
What We Keep from Kimball and Inmon
We keep the semantics. Conformed dimensions and well-defined facts still matter; they just live in the curated layer of the lakehouse rather than a single monolithic warehouse. Metrics belong to a semantic layer or contract—versioned, reviewed, and testable—not scattered across dashboards. The integration bias from Inmon and the usability bias from Kimball are complementary; the lakehouse hosts both.
(For teams that favor a source‑of‑truth integration approach, a Data Vault–style pattern often works well in the validated layer: business keys in hubs, relationships in links, context in satellites. It preserves raw fidelity while enabling repeatable, governed derivations upstream of curated marts.)
Governance and Observability: Baselines People Can See
A modern baseline is visible, testable, and automated:
- Visible: Data contracts, schemas, and expectations live with the code as declarative artifacts. Lineage shows how a curated table depends on validated sources.
- Testable: You treat data quality like software quality: null checks, uniqueness, referential integrity, schema compatibility, freshness SLAs—and you run them as part of CI for pipelines.
- Automated: Policies for PII, retention, and access are attached to tables, not buried in docs. Compaction, clustering, and vacuuming are routine, not oral tradition.
Baselines that are social (not just technical) are the ones that hold up under change.
Unusual Scenarios, Reframed with Baselines
1) Real‑Time Data: The Baseline for Moving Targets
Real-time used to force teams into a split-brain “Lambda” architecture: one system for streaming, another for batch, plus glue code. The result was drift—two truths that never matched.
A lakehouse-era baseline for real-time looks different:
- One Pipeline, Two Tempos. Treat streams as continuously arriving micro-batches appended to raw tables. The same transformation code that builds daily aggregates can build minute-level views, just with different triggers.
- Event-Time Discipline. Base windows and joins on event time, not processing time, with watermarks to bound lateness. This avoids double counting and keeps late-arriving corrections consistent.
- Contracts for Events. Use a schema registry or table contract for each event topic. If a producer wants to evolve shape, it negotiates compatibility. That’s the social contract that keeps downstream consumers sane.
- CDC as a Log. For databases, change data capture publishes an immutable record of inserts/updates/deletes into raw tables. Curated state is recomputed idempotently from that log, so replays are safe.
The outcome: real-time feels like faster batch, not a different universe.
2) Very Large Datasets: When Size Redefines “Reasonable”
Scale turns small frictions into budget items. Baselines help by making the expensive parts boring:
- File and Partition Hygiene. Large tables behave when files are right-sized, partitions reflect query predicates, and small-file explosions are repaired with regular compaction. This isn’t glamour; it is the difference between minutes and hours.
- Predictable Layouts. Sorting or clustering on high-selectivity columns reduces work. Even without fancy indexing, consistent layout enables predicate and partition pruning to do their job.
- Versioned Backfills. Backfills become new versions of the same tables, not overwrite roulette. Consumers can move when ready; lineage shows exactly what changed.
- Cold/Hot Economics. The baseline includes retention and tiering expectations. Raw stays long; curated gets trimmed to what’s needed; aggregates are cheap to rebuild.
When the boring parts are codified, scale becomes a cost curve you can manage, not an emergency.
3) Unstructured & Semi‑Structured Data: Baselines at the Edge
Modern analytics increasingly leans on text, images, audio, and logs. The lakehouse can host them, but the baseline must expand:
- Objects + Rich Metadata. Store the bytes in the lake; store structured metadata, provenance, and checksums in tables. The table—not the object path—is the unit of governance and discovery.
- Repeatable Extraction. OCR, speech‑to‑text, entity extraction, and embedding generation should be deterministic, versioned, and idempotent. Changes to the extraction model create new derived tables rather than mutating history.
- Vector Search With Provenance. If you build a vector index for retrieval or RAG, link every vector back to the source object and extraction version. Baseline means you can trace a model prediction to the exact inputs and code that produced it.
- Privacy by Construction. PII classification and redaction run in the validated layer. Unstructured does not mean ungoverned.
The baseline principle is the same: preserve the original observation, publish derived structure with contracts.
Putting It Together: A Modern Baseline That Survives Change
A healthy lakehouse isn’t a diagram; it’s a set of habits:
- Raw is immutable and comprehensible. You can answer, “Where did this come from?” in one click.
- Validated is predictable. Types, keys, and basic integrity are enforced. Schemas evolve with clear compatibility rules.
- Curated is meaningful. Business definitions, conformed dimensions, and metrics are reviewed, versioned, and observable.
- Transformations are code. They are idempotent, testable, documented by lineage, and executed by engines that handle both batch and streaming modes.
- Costs are visible. Compaction, retention, and tiering are routine and tuned by data products, not emergencies.
None of this slows you down. It lets you keep moving without the slow bleed of ambiguity and rework.
Why This Beats “Just Dump It In”
“Just dump it in” optimizes the first week and taxes every week after. Baselines reverse that curve. When ingestion is fast and principled, exploration starts early, quality improves over time, and trust grows instead of decaying. ELT makes this practical by preserving raw truth and letting you iterate at the curated edge without fear.
Kimball and Inmon taught us to value shared meaning. The data lake taught us to value speed and openness. The lakehouse gives us the chance to keep both—if we make baselines the center of the conversation again.
That’s not a fad; it’s just good engineering.