Fabric data ingestion: what to use when

Data platforms usually fail in two predictable ways: they drown in shadow copies nobody owns, or they calcify around a single ingestion pattern that does not fit every source. Microsoft Fabric offers a broader palette. You can read data where it already lives, replicate operational systems into a governed lake, and run high‑throughput batch and low‑latency streams without wiring a dozen services together. The work isn’t picking a tool; it’s choosing deliberately so your estate stays fast, testable, and governable as it grows.

This guide treats Zero Unmanaged Copies (ZUC) as a strong—but not exclusive—operating model. ZUC constrains where bytes land and keeps lineage simple: if data persists, it is inside OneLake under policy and catalog; if it does not need to persist, you read it in place. Many teams will also continue to run a traditional lakehouse with raw/bronze landings, curated silver, and published gold. Fabric supports both because everything converges on OneLake (the boundary) and Delta (the table format). We evaluate each option with consistent criteria: performance (bulk throughput and end‑to‑end latency), operational surface (how much you must run and monitor), governance posture (where data persists and how it is secured), team ergonomics (SQL, Spark, or low‑code), and table health (file sizes, partitioning, Delta logs).

For clarity: zero‑copy means reading in place. Managed copy means materializing inside OneLake with lineage. Unmanaged copy is anything persisted outside governance—temporary blobs, stray CSV drops, buckets with unclear ownership. ZUC eliminates that last category; a traditional lakehouse allows governed staging and raw landings as part of the pipeline.

Two operating modes you can mix (not choose)

Zero‑Unmanaged‑Copy (ZUC). Avoid shadow storage. If data moves, it lands in OneLake; if not, read it in place. This favors Shortcuts, Mirroring, and Eventstream to reduce copies and simplify lineage.
Traditional Lakehouse (TLH). Maintain a bronze landing, curate to silver, and publish gold. Allow governed staging when needed. This model is familiar, debuggable, and strong for heavy batch and complex transforms.

Use both: ZUC where sources are already analytics‑friendly; TLH where reshaping, merging, or re‑partitioning at scale is the job.

The ingestion toolbox in Fabric

Mechanism	What it is	Performance posture	Fit for ZUC	Fit for TLH	Advantages	Watch‑outs
OneLake Shortcuts	Virtualize external storage (ADLS, S3, other OneLake) with no copy	No ingest time; speed tied to upstream layout and network	Excellent	Good for discovery; materialize later	Instant access; single namespace and auth	Egress/latency depend on source; upstream schema drift is immediate; some features assume local Delta
Mirroring	Managed, near‑real‑time replica of OLTP into OneLake	Low‑latency CDC without DIY plumbing	Strong	Strong (treat as bronze input)	Minimal setup; governed freshness	Source coverage is opinionated; still a copy (though governed)
Pipelines — Copy → Lakehouse	Data Factory‑style Copy landing into Delta	High‑throughput, parallelized columnar writes; V‑Order‑friendly	Strong	Excellent	Rich connectors; schedules, triggers, retries, monitoring	If ends are on‑prem behind different gateways, plan a governed interim hop
Eventstream	Managed streaming to Delta or KQL (Kafka‑friendly)	Low‑latency ingestion; scalable fan‑in	Strong	Strong (append to bronze, then MERGE to silver)	No broker babysitting; clean upsert path	Validate ordering/latency; control steady‑state costs
Warehouse `COPY INTO`	T‑SQL bulk loader into Fabric Warehouse (Delta under the hood)	Very high bandwidth on large files	Strong	Strong (SQL‑first marts)	SQL‑native; great ergonomics	Keep sources inside governance; curate file sizes
Dataflow Gen2	Low‑code Power Query (M) with scheduled refresh	Easy to start; slower at very large scale	Good	Good (ingest + light shaping)	Huge SaaS connector surface; parameters, guardrails	Not for massive merges/heavy ETL; hand off to Spark at scale
Notebooks & Spark Jobs	Code‑first batch/stream transforms and merges	Scales with Spark; you own OPTIMIZE/VACUUM	Strong	Excellent (medallion curation)	Full control; CI/CD‑friendly; testable	Requires discipline; table health drives BI speed (e.g., Direct Lake)

Table health first. Fewer, larger files, sensible partitioning, periodic compaction/OPTIMIZE, and routine VACUUM will improve performance more than any single tool choice.

Traditional lakehouse on Fabric (tight, reliable, debuggable)

Land to bronze via Copy, COPY INTO, Eventstream, or Dataflow Gen2. Favor columnar, well‑sized files and predictable folders.
Curate to silver with Spark or T‑SQL: enforce schema, deduplicate, apply SCDs, and reshape for conformance.
Publish gold for BI and data science; consider Direct Lake where Power BI reads Delta directly.
Orchestrate with Pipelines for multi‑step flows; use notebook/job schedules for focused tasks.
Govern end‑to‑end inside OneLake, even for “temporary” staging.

This path excels when upserts, schema evolution, and predictable batch SLAs dominate.

Zero‑unmanaged‑copy on Fabric (minimal movement, governed by default)

Shortcuts for analytics‑ready external Parquet/Delta; explore and join in place.
Mirroring for relational CDC without building it yourself—treat mirrored tables as bronze input.
Eventstream for events without broker ops; land to Delta, then MERGE to silver.
Materialize intentionally (and only inside OneLake) when locality, features, or latency demand it.

This path reduces movement and operational surface while keeping lineage clean.

Orchestration patterns that work for both

Pipelines as enterprise glue. Multi‑step flows, dependencies, retries, branching, parameters, and alerts. Run Copy, notebooks, SQL, and semantic model refresh. Support both time‑based and event‑based triggers (e.g., file arrival).
Notebook/Spark Job schedulers for focused jobs. “This job at 02:00 UTC and on file arrival” without orchestration overhead.
Dataflow Gen2 scheduled refresh for self‑service. Keep small SaaS workloads self‑contained, or call them from Pipelines for unified monitoring.
Data Activator for reactive patterns. Launch Fabric items on data signals without writing a listener service.

Rule of thumb: Pipelines = flows of many things. Schedulers = one thing on a clock. Activator = one thing on a signal.

Performance & cost: what actually moves the needle

Throughput loves big files. Target tens–hundreds of MB per file (or larger, within engine sweet spots). Avoid small‑file explosions.
Partition by how you filter. Dates, regions, and a few selective keys—avoid high‑cardinality IDs as partitions.
Plan merges. Batch upserts, cluster around merge keys, and separate append/merge phases where possible.
Mind the network. With Shortcuts across clouds, egress and locality cap performance. Materialize to OneLake when the math works.
Keep Delta logs healthy. Compact frequently updated tables, checkpoint streams, and run VACUUM. Direct Lake speed mirrors housekeeping.

Choosing guide (fast heuristics)

External Parquet/Delta and read‑heavy? Start with Shortcuts; materialize only if you need locality or features.
Fresh OLTP without CDC plumbing? Mirror and curate downstream.
Large scheduled batches from files/DBs? Copy → Lakehouse; SQL‑first teams: Warehouse COPY INTO.
Clickstreams, IoT, CDC topics as events? Eventstream → Delta → MERGE to silver.
SaaS sources + guardrails? Dataflow Gen2; graduate heavy transforms to Spark.
Complex merges, schema evolution, performance‑sensitive curation? Notebooks & Spark Jobs—and own table hygiene.

Conclusion

Fabric supports multiple valid ingestion styles. If you want tight governance and minimal movement, a ZUC approach built on Shortcuts, Mirroring, and Eventstream trims operational surface while keeping data in OneLake. If your reality is a traditional lakehouse, lean on Copy, Warehouse COPY INTO, Dataflow Gen2, and Spark to land bronze, curate silver, and publish gold with predictable SLAs. In both cases, treat OneLake as the boundary, Delta as the common format, and table health as a first‑class responsibility. Choose the tool that matches your latency and scale, keep every persisted byte governed, and you’ll ship faster without a graveyard of shadow storage.

Author: Jason Miles

A solution-focused developer, engineer, and data specialist focusing on diverse industries. He has led data products and citizen data initiatives for almost twenty years and is an expert in enabling organizations to turn data into insight, and then into action. He holds MS in Analytics from Texas A&M, DAMA CDMP Master, and INFORMS CAP-Expert credentials. View all posts by Jason Miles