Fabric data ingestion: what to use when

Data platforms usually fail in two predictable ways: they drown in shadow copies nobody owns, or they calcify around a single ingestion pattern that does not fit every source. Microsoft Fabric offers a broader palette. You can read data where it already lives, replicate operational systems into a governed lake, and run high‑throughput batch and low‑latency streams without wiring a dozen services together. The work isn’t picking a tool; it’s choosing deliberately so your estate stays fast, testable, and governable as it grows.

This guide treats Zero Unmanaged Copies (ZUC) as a strong—but not exclusive—operating model. ZUC constrains where bytes land and keeps lineage simple: if data persists, it is inside OneLake under policy and catalog; if it does not need to persist, you read it in place. Many teams will also continue to run a traditional lakehouse with raw/bronze landings, curated silver, and published gold. Fabric supports both because everything converges on OneLake (the boundary) and Delta (the table format). We evaluate each option with consistent criteria: performance (bulk throughput and end‑to‑end latency), operational surface (how much you must run and monitor), governance posture (where data persists and how it is secured), team ergonomics (SQL, Spark, or low‑code), and table health (file sizes, partitioning, Delta logs).

For clarity: zero‑copy means reading in place. Managed copy means materializing inside OneLake with lineage. Unmanaged copy is anything persisted outside governance—temporary blobs, stray CSV drops, buckets with unclear ownership. ZUC eliminates that last category; a traditional lakehouse allows governed staging and raw landings as part of the pipeline.

Two operating modes you can mix (not choose)

  • Zero‑Unmanaged‑Copy (ZUC). Avoid shadow storage. If data moves, it lands in OneLake; if not, read it in place. This favors Shortcuts, Mirroring, and Eventstream to reduce copies and simplify lineage.
  • Traditional Lakehouse (TLH). Maintain a bronze landing, curate to silver, and publish gold. Allow governed staging when needed. This model is familiar, debuggable, and strong for heavy batch and complex transforms.

Use both: ZUC where sources are already analytics‑friendly; TLH where reshaping, merging, or re‑partitioning at scale is the job.


The ingestion toolbox in Fabric

MechanismWhat it isPerformance postureFit for ZUCFit for TLHAdvantagesWatch‑outs
OneLake ShortcutsVirtualize external storage (ADLS, S3, other OneLake) with no copyNo ingest time; speed tied to upstream layout and networkExcellentGood for discovery; materialize laterInstant access; single namespace and authEgress/latency depend on source; upstream schema drift is immediate; some features assume local Delta
MirroringManaged, near‑real‑time replica of OLTP into OneLakeLow‑latency CDC without DIY plumbingStrongStrong (treat as bronze input)Minimal setup; governed freshnessSource coverage is opinionated; still a copy (though governed)
Pipelines — Copy → LakehouseData Factory‑style Copy landing into DeltaHigh‑throughput, parallelized columnar writes; V‑Order‑friendlyStrongExcellentRich connectors; schedules, triggers, retries, monitoringIf ends are on‑prem behind different gateways, plan a governed interim hop
EventstreamManaged streaming to Delta or KQL (Kafka‑friendly)Low‑latency ingestion; scalable fan‑inStrongStrong (append to bronze, then MERGE to silver)No broker babysitting; clean upsert pathValidate ordering/latency; control steady‑state costs
Warehouse COPY INTOT‑SQL bulk loader into Fabric Warehouse (Delta under the hood)Very high bandwidth on large filesStrongStrong (SQL‑first marts)SQL‑native; great ergonomicsKeep sources inside governance; curate file sizes
Dataflow Gen2Low‑code Power Query (M) with scheduled refreshEasy to start; slower at very large scaleGoodGood (ingest + light shaping)Huge SaaS connector surface; parameters, guardrailsNot for massive merges/heavy ETL; hand off to Spark at scale
Notebooks & Spark JobsCode‑first batch/stream transforms and mergesScales with Spark; you own OPTIMIZE/VACUUMStrongExcellent (medallion curation)Full control; CI/CD‑friendly; testableRequires discipline; table health drives BI speed (e.g., Direct Lake)

Table health first. Fewer, larger files, sensible partitioning, periodic compaction/OPTIMIZE, and routine VACUUM will improve performance more than any single tool choice.


Traditional lakehouse on Fabric (tight, reliable, debuggable)

  1. Land to bronze via Copy, COPY INTO, Eventstream, or Dataflow Gen2. Favor columnar, well‑sized files and predictable folders.
  2. Curate to silver with Spark or T‑SQL: enforce schema, deduplicate, apply SCDs, and reshape for conformance.
  3. Publish gold for BI and data science; consider Direct Lake where Power BI reads Delta directly.
  4. Orchestrate with Pipelines for multi‑step flows; use notebook/job schedules for focused tasks.
  5. Govern end‑to‑end inside OneLake, even for “temporary” staging.

This path excels when upserts, schema evolution, and predictable batch SLAs dominate.


Zero‑unmanaged‑copy on Fabric (minimal movement, governed by default)

  • Shortcuts for analytics‑ready external Parquet/Delta; explore and join in place.
  • Mirroring for relational CDC without building it yourself—treat mirrored tables as bronze input.
  • Eventstream for events without broker ops; land to Delta, then MERGE to silver.
  • Materialize intentionally (and only inside OneLake) when locality, features, or latency demand it.

This path reduces movement and operational surface while keeping lineage clean.


Orchestration patterns that work for both

  • Pipelines as enterprise glue. Multi‑step flows, dependencies, retries, branching, parameters, and alerts. Run Copy, notebooks, SQL, and semantic model refresh. Support both time‑based and event‑based triggers (e.g., file arrival).
  • Notebook/Spark Job schedulers for focused jobs. “This job at 02:00 UTC and on file arrival” without orchestration overhead.
  • Dataflow Gen2 scheduled refresh for self‑service. Keep small SaaS workloads self‑contained, or call them from Pipelines for unified monitoring.
  • Data Activator for reactive patterns. Launch Fabric items on data signals without writing a listener service.

Rule of thumb: Pipelines = flows of many things. Schedulers = one thing on a clock. Activator = one thing on a signal.


Performance & cost: what actually moves the needle

  • Throughput loves big files. Target tens–hundreds of MB per file (or larger, within engine sweet spots). Avoid small‑file explosions.
  • Partition by how you filter. Dates, regions, and a few selective keys—avoid high‑cardinality IDs as partitions.
  • Plan merges. Batch upserts, cluster around merge keys, and separate append/merge phases where possible.
  • Mind the network. With Shortcuts across clouds, egress and locality cap performance. Materialize to OneLake when the math works.
  • Keep Delta logs healthy. Compact frequently updated tables, checkpoint streams, and run VACUUM. Direct Lake speed mirrors housekeeping.

Choosing guide (fast heuristics)

  • External Parquet/Delta and read‑heavy? Start with Shortcuts; materialize only if you need locality or features.
  • Fresh OLTP without CDC plumbing? Mirror and curate downstream.
  • Large scheduled batches from files/DBs? Copy → Lakehouse; SQL‑first teams: Warehouse COPY INTO.
  • Clickstreams, IoT, CDC topics as events? Eventstream → Delta → MERGE to silver.
  • SaaS sources + guardrails? Dataflow Gen2; graduate heavy transforms to Spark.
  • Complex merges, schema evolution, performance‑sensitive curation? Notebooks & Spark Jobs—and own table hygiene.

Conclusion

Fabric supports multiple valid ingestion styles. If you want tight governance and minimal movement, a ZUC approach built on Shortcuts, Mirroring, and Eventstream trims operational surface while keeping data in OneLake. If your reality is a traditional lakehouse, lean on Copy, Warehouse COPY INTO, Dataflow Gen2, and Spark to land bronze, curate silver, and publish gold with predictable SLAs. In both cases, treat OneLake as the boundary, Delta as the common format, and table health as a first‑class responsibility. Choose the tool that matches your latency and scale, keep every persisted byte governed, and you’ll ship faster without a graveyard of shadow storage.

Unknown's avatar

Author: Jason Miles

A solution-focused developer, engineer, and data specialist focusing on diverse industries. He has led data products and citizen data initiatives for almost twenty years and is an expert in enabling organizations to turn data into insight, and then into action. He holds MS in Analytics from Texas A&M, DAMA CDMP Master, and INFORMS CAP-Expert credentials.