Tag: data ingestion

Data Development, DP-700, Microsoft Certification, Microsoft Fabric June 3, 2026

Monitor data ingestion (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
   --> Monitor Fabric items
      --> Monitor data ingestion

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Overview

Data ingestion is one of the most critical processes in any data engineering solution. Regardless of whether data is ingested through pipelines, Dataflows Gen2, Eventstreams, Spark notebooks, mirroring, shortcuts, or streaming solutions, engineers must ensure that ingestion processes are running successfully, efficiently, and reliably.

In Microsoft Fabric, monitoring data ingestion involves tracking data movement activities, identifying failures, measuring performance, validating data completeness, troubleshooting bottlenecks, and ensuring data arrives in the correct destination on schedule.

For the DP-700 exam, you should understand:

How ingestion monitoring works across Fabric workloads
Monitoring pipelines and Dataflows Gen2
Monitoring Spark jobs and notebooks
Monitoring streaming ingestion
Using monitoring hubs and run history
Detecting ingestion failures
Investigating performance issues
Monitoring data quality and completeness
Best practices for operational monitoring

Why Data Ingestion Monitoring Matters

A data engineering solution is only valuable if data arrives correctly and on time.

Poorly monitored ingestion processes can result in:

Missing data
Incomplete reports
Delayed analytics
Data quality issues
Failed downstream transformations
Business decision errors

Consider an hourly sales ingestion process:

If the process fails at 2:00 AM
No monitoring is in place
The issue is not discovered until business users report incorrect dashboards

Proper monitoring helps detect and resolve problems before they impact users.

Data Ingestion Components in Microsoft Fabric

Several Fabric services perform data ingestion:

Data Pipelines

Used for:

Copy activities
Data movement
Workflow orchestration
ETL/ELT execution

Pipelines often serve as the primary ingestion mechanism for batch data.

Dataflows Gen2

Used for:

Low-code data ingestion
Power Query transformations
ETL development

Dataflows commonly ingest data from SaaS applications, databases, and files.

Spark Notebooks

Used for:

Large-scale ingestion
Custom transformations
Lakehouse loading

Spark jobs frequently handle enterprise-scale ingestion workloads.

Eventstreams

Used for:

Streaming ingestion
Event processing
Real-time data pipelines

Mirroring

Used for:

Near real-time replication
Continuous synchronization
Operational system integration

Monitoring Hub

The Monitoring Hub is the central monitoring experience within Microsoft Fabric.

It allows administrators and engineers to monitor:

Pipeline executions
Dataflow refreshes
Notebook runs
Spark jobs
Warehouse activities
Real-Time Intelligence workloads

The Monitoring Hub provides:

Run status
Start time
End time
Duration
Error messages
Historical execution information

For DP-700, expect questions regarding how to investigate failures and review execution history.

Monitoring Pipeline Executions

Pipelines provide detailed execution tracking.

Each pipeline run includes:

Status
Activity-level details
Runtime metrics
Input/output information
Error details

Typical statuses include:

Status	Meaning
Succeeded	Completed successfully
Failed	One or more activities failed
In Progress	Currently executing
Cancelled	Stopped before completion

Activity-Level Monitoring

Pipeline monitoring drills into individual activities.

Examples:

Copy Data activity
Notebook activity
Dataflow activity
Stored Procedure activity

If a pipeline fails, reviewing activity-level details is often the fastest way to identify the root cause.

Common Pipeline Failures

Authentication Errors

Examples:

Expired credentials
Missing permissions
Invalid service principal access

Network Issues

Examples:

Source unavailable
Connectivity interruptions

Schema Changes

Examples:

Missing columns
Data type mismatches

Capacity Constraints

Examples:

Resource contention
Capacity throttling

Monitoring Dataflows Gen2

Dataflows Gen2 provide refresh history information.

Engineers can monitor:

Refresh success
Refresh failures
Execution duration
Row processing counts

Monitoring refresh history helps identify:

Slow transformations
Source system issues
Data quality problems

Dataflow Refresh History

Common metrics include:

Start time
End time
Duration
Refresh status
Error details

If refresh duration increases significantly over time, it may indicate:

Growing data volumes
Source performance degradation
Inefficient transformations

Monitoring Spark Ingestion Jobs

Spark workloads often support large-scale ingestion processes.

Monitoring includes:

Job execution status
Spark application logs
Resource utilization
Stage execution metrics

Spark Monitoring Metrics

Important metrics include:

Job Duration

Tracks overall execution time.

Executor Usage

Indicates cluster resource consumption.

Task Failures

Shows processing errors.

Data Skew

Identifies uneven partition distribution.

Shuffle Operations

Helps diagnose performance bottlenecks.

Monitoring Streaming Ingestion

Streaming solutions require continuous monitoring.

Common streaming workloads include:

Eventstreams
KQL databases
Real-Time Intelligence
Spark Structured Streaming

Key Streaming Metrics

Events Ingested

Measures throughput.

Example:

50,000 events per minute

Ingestion Latency

Measures delay between event creation and availability.

Lower latency generally indicates healthier streaming systems.

Failed Events

Tracks records that could not be processed.

Backlog Size

Measures unprocessed events waiting for ingestion.

Large backlogs may indicate:

Capacity issues
Slow downstream processing
Configuration problems

Monitoring Eventstreams

Eventstreams provide operational monitoring capabilities.

You can monitor:

Incoming event volume
Processing status
Transformation performance
Output destinations

Common issues include:

Source connectivity failures
Event schema mismatches
Destination write failures

Monitoring Mirroring

Mirroring continuously replicates source data into Fabric.

Monitoring focuses on:

Replication status
Synchronization delays
Replication failures
Data freshness

Important concepts include:

Replication Latency

Time between source changes and destination availability.

Synchronization Health

Indicates whether replication remains current.

Monitoring Data Completeness

Successful execution does not always mean successful ingestion.

Data engineers should validate:

Expected row counts
File counts
Event counts
Record completeness

Example:

A pipeline succeeds but only loads 70% of expected records.

Technical execution succeeded, but business requirements were not met.

Common Validation Checks

Row Count Validation

Compare source and destination record counts.

File Validation

Verify expected files arrived.

Timestamp Validation

Confirm recent records are present.

Duplicate Detection

Identify accidental duplicate ingestion.

Monitoring Data Quality During Ingestion

Data quality monitoring often includes:

Null value detection
Invalid data type identification
Duplicate record detection
Referential integrity checks

Monitoring quality issues early prevents downstream reporting problems.

Alerts and Notifications

Monitoring becomes significantly more effective when alerts are configured.

Common alert scenarios include:

Pipeline failures
Dataflow refresh failures
Long-running jobs
Excessive ingestion latency
Capacity utilization thresholds

Alerts allow engineers to respond before business users notice issues.

Troubleshooting Ingestion Failures

A common troubleshooting workflow includes:

Step 1

Review Monitoring Hub status.

Step 2

Identify failed workload.

Step 3

Inspect detailed error message.

Step 4

Validate source connectivity.

Step 5

Verify credentials and permissions.

Step 6

Review recent schema changes.

Step 7

Rerun ingestion process if appropriate.

Best Practices

Establish Baselines

Track normal:

Runtime duration
Throughput
Latency
Data volume

Baseline measurements make anomalies easier to identify.

Monitor Data Quality

Do not rely solely on execution success.

Validate:

Completeness
Accuracy
Timeliness

Use Alerts

Configure proactive notifications for:

Failures
Delays
Performance degradation

Retain Historical Monitoring Data

Historical execution information helps identify:

Trends
Capacity growth
Recurring failures

Investigate Long-Running Jobs

Increasing execution times often indicate:

Growing data volumes
Inefficient queries
Capacity limitations

DP-700 Exam Tips

Know the Monitoring Hub

The Monitoring Hub is the primary location for monitoring Fabric workloads.

Understand Pipeline Monitoring

Be familiar with:

Run history
Activity runs
Error messages
Execution duration

Understand Streaming Metrics

Know the importance of:

Throughput
Latency
Backlogs
Failed events

Monitor More Than Success Status

Successful execution does not guarantee complete or accurate data ingestion.

Understand Data Validation

Exam questions often focus on verifying:

Row counts
Data completeness
Freshness
Data quality

Practice Exam Questions

Question 1

Which Microsoft Fabric feature serves as the central location for monitoring pipelines, notebooks, Spark jobs, and dataflows?

A. Data Activator

B. OneLake Explorer

C. Monitoring Hub

D. Eventhouse

Answer: C

Explanation: The Monitoring Hub provides centralized monitoring across Fabric workloads and is the primary tool for reviewing execution history and failures.

Question 2

A pipeline execution completed successfully, but only half the expected records were loaded.

What should you verify first?

A. Workspace permissions

B. Data completeness and row counts

C. Capacity SKU

D. Sensitivity labels

Answer: B

Explanation: Successful execution does not guarantee successful business outcomes. Row count validation helps confirm complete ingestion.

Question 3

Which metric measures the delay between event creation and event availability in a streaming solution?

A. Throughput

B. Replication count

C. Ingestion latency

D. Refresh frequency

Answer: C

Explanation: Ingestion latency measures how quickly streaming data becomes available after being generated.

Question 4

Which issue is most likely if streaming event backlogs continue growing over time?

A. Processing cannot keep up with incoming events

B. Missing endorsement settings

C. Too many workspace roles

D. Excessive sensitivity labels

Answer: A

Explanation: Growing backlogs typically indicate that event processing is slower than event arrival rates.

Question 5

When troubleshooting a failed pipeline, what should typically be examined first?

A. Lakehouse shortcuts

B. Activity-level execution details

C. Workspace endorsements

D. Semantic model refresh schedules

Answer: B

Explanation: Activity-level details usually identify the exact source of a pipeline failure.

Question 6

Which metric is most useful for determining whether a Dataflow Gen2 refresh is becoming slower over time?

A. Sensitivity label

B. Number of workspaces

C. Refresh duration

D. Dataset owner

Answer: C

Explanation: Refresh duration directly measures execution performance and helps identify degradation trends.

Question 7

A data engineer wants to verify that every expected source file was loaded during ingestion.

Which validation approach should be used?

A. Capacity monitoring

B. File count validation

C. Role assignment review

D. Workspace auditing

Answer: B

Explanation: File count validation confirms that all expected files were ingested.

Question 8

Which Spark monitoring metric can help identify uneven partition distribution during ingestion?

A. Activity retry count

B. Replication latency

C. Refresh history

D. Data skew

Answer: D

Explanation: Data skew occurs when partitions contain significantly different amounts of data, creating processing bottlenecks.

Question 9

What is the primary purpose of configuring alerts for ingestion workloads?

A. To reduce storage costs

B. To automatically increase capacity

C. To proactively notify administrators of issues

D. To encrypt incoming data

Answer: C

Explanation: Alerts help identify failures, delays, and performance issues before they impact users.

Question 10

Which monitoring focus is most important for mirrored databases?

A. Report visual refresh time

B. Synchronization health and replication latency

C. Notebook parameter values

D. Semantic model relationships

Answer: B

Explanation: Mirroring depends on keeping source and destination systems synchronized, making replication latency and synchronization health critical monitoring metrics.

Go to the DP-700 Exam Prep Hub main page.

DP-700, Microsoft Certification, Microsoft Fabric June 3, 2026

Ingest data by using pipelines (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Ingest data by using pipelines

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Microsoft Fabric Data Pipelines are one of the primary tools used by data engineers to ingest, move, and orchestrate data across various sources and destinations. Pipelines provide a low-code orchestration framework that allows organizations to build scalable, repeatable, and maintainable data ingestion solutions.

For the DP-700 exam, it is important to understand:

What pipelines are
Pipeline architecture and components
Common ingestion patterns
Copy Data activity
Data source and destination connectivity
Pipeline orchestration
Parameters and dynamic content
Scheduling and triggering
Monitoring and troubleshooting
Best practices for pipeline-based ingestion

What Is a Microsoft Fabric Data Pipeline?

A Data Pipeline is a workflow orchestration service within Microsoft Fabric that enables data engineers to:

Move data between systems
Schedule data ingestion
Execute transformation activities
Coordinate multiple processes
Automate data workflows

Pipelines are derived from the same core concepts used in Azure Data Factory and Azure Synapse Analytics, making them familiar to many data professionals.

A pipeline is essentially a container that holds one or more activities that execute in a defined sequence.

Why Use Pipelines for Data Ingestion?

Organizations often need to ingest data from:

SQL Server
Azure SQL Database
Azure Blob Storage
Amazon S3
REST APIs
CSV files
Excel files
On-premises systems
Data warehouses
SaaS applications

Pipelines provide a centralized and scalable way to move this data into Fabric.

Benefits include:

Automation

No manual intervention required once configured.

Scalability

Handles large volumes of data efficiently.

Reusability

Pipelines can be reused across multiple ingestion scenarios.

Monitoring

Built-in execution tracking and logging.

Integration

Works with many Fabric workloads and external systems.

Pipeline Architecture

A pipeline consists of several components:

Pipeline

The overall workflow container.

Activities

Tasks performed within the pipeline.

Examples:

Copy Data
Notebook execution
Stored procedure execution
Dataflow execution
Variable assignment

Datasets

Represent source or destination data structures.

Connections

Define how the pipeline connects to external systems.

Parameters

Provide runtime flexibility.

Triggers

Determine when pipelines execute.

Common Pipeline Activities

For DP-700, understanding activities is essential.

Copy Data Activity

The most commonly used ingestion activity.

Used to:

Copy files
Move tables
Transfer structured data
Load data into Fabric destinations

Examples:

SQL Server → Lakehouse
Azure SQL → Warehouse
CSV → OneLake
Blob Storage → Lakehouse

Notebook Activity

Executes Spark notebooks.

Common uses:

Data transformation
Data cleansing
Machine learning processing

Dataflow Activity

Runs Dataflow Gen2 processes.

Used when:

Low-code transformations are preferred
Business users participate in data preparation

Stored Procedure Activity

Executes SQL stored procedures.

Useful for:

Database maintenance
Incremental processing
Metadata updates

Using the Copy Data Activity

The Copy Data activity is heavily emphasized on the DP-700 exam.

Source

Defines where data originates.

Examples:

SQL Database
Oracle
REST API
CSV File
Blob Storage

Destination

Defines where data is written.

Examples:

Lakehouse
Data Warehouse
OneLake files
SQL endpoint

Mapping

Maps source columns to destination columns.

Example:

Source	Destination
CustomerID	CustomerKey
Name	CustomerName
City	CustomerCity

Data Sources Supported by Pipelines

Fabric pipelines support numerous source systems.

Common examples include:

Relational Databases

SQL Server
Azure SQL Database
Oracle
PostgreSQL
MySQL

File-Based Sources

CSV
JSON
Parquet
Excel

Cloud Storage

Azure Blob Storage
Azure Data Lake Storage
Amazon S3

Web-Based Sources

REST APIs
HTTP endpoints

Pipeline Destinations

Common destinations include:

Lakehouse

Frequently used for raw and curated data storage.

Benefits:

Delta format
Open storage
Spark compatibility

Data Warehouse

Ideal for structured analytical workloads.

Benefits:

SQL support
Relational design
High-performance reporting

OneLake Files

Used for raw file storage.

Batch Data Ingestion Patterns

The DP-700 exam focuses heavily on batch ingestion.

Full Load Pattern

Every execution loads the entire dataset.

Example:

Daily import of a 5,000-row lookup table.

Advantages:

Simple implementation

Disadvantages:

Higher processing costs
Longer runtimes

Incremental Load Pattern

Only new or changed records are loaded.

Example:

Import orders created since the last execution.

Advantages:

Faster
Lower costs
Reduced data movement

Disadvantages:

More complex configuration

Parameterized Pipelines

Parameters make pipelines reusable.

Example parameter:

SourceTable

Pipeline executions can specify:

			
Customers
Orders
Products
Invoices

This allows one pipeline design to ingest many tables.

Benefits:

Reduced development effort
Easier maintenance
Consistent ingestion processes

Dynamic Content

Dynamic expressions enable runtime flexibility.

Examples:

Generate file names:

Sales_@{utcnow()}.csv

Generate folders:

Raw/@{formatDateTime(utcnow(),'yyyy/MM/dd')}

Use parameter values:

@pipeline().parameters.TableName

Dynamic content is commonly tested on DP-700.

Control Flow Activities

Pipelines can include logic and branching.

If Condition

Executes different paths depending on conditions.

Example:

File exists → Continue
File missing → Send notification

Switch Activity

Handles multiple execution paths.

Example:

Process data differently based on source type.

ForEach Activity

Loops through collections.

Example:

Load 100 source tables using one pipeline.

Until Activity

Repeats execution until a condition becomes true.

Scheduling Pipelines

Pipelines commonly run on schedules.

Examples:

Hourly
Daily
Weekly
Monthly

Typical workloads:

Workload	Schedule
Sales Data	Hourly
ERP Data	Daily
Financial Data	Nightly
Master Data	Weekly

Event-Based Triggers

Instead of schedules, pipelines can run when events occur.

Examples:

New file arrives
Data source updated
Upstream process completed

Benefits:

Reduced latency
Faster processing
More responsive architecture

Monitoring Pipeline Executions

Fabric provides execution monitoring.

Data engineers can review:

Run Status

Succeeded
Failed
In Progress
Cancelled

Duration

How long execution required.

Activity-Level Results

Identify which step failed.

Error Messages

Useful for troubleshooting.

Common issues include:

Authentication failures
Missing files
Schema mismatches
Permission problems

Error Handling

Reliable ingestion solutions require proper error handling.

Common approaches:

Retry Policies

Automatically rerun failed activities.

Logging

Record execution details.

Validation

Check data quality before loading.

Notifications

Alert administrators when failures occur.

Security Considerations

Pipeline ingestion must follow security best practices.

Secure Credentials

Use managed identities and secure connections whenever possible.

Least Privilege

Grant only required permissions.

Workspace Security

Control who can modify pipelines.

Data Governance

Apply sensitivity labels and auditing where appropriate.

Pipeline Best Practices

Use Parameterization

Avoid hardcoding values.

Build Reusable Components

Create generic ingestion pipelines.

Use Incremental Loads

When possible, reduce data movement.

Monitor Executions

Review failures proactively.

Implement Error Handling

Design for operational resilience.

Separate Environments

Maintain Dev, Test, and Production pipelines.

Pipeline vs Dataflow Gen2 vs Notebook

Understanding when to use each tool is a common exam objective.

Feature	Pipeline	Dataflow Gen2	Notebook
Orchestration	Excellent	Limited	Limited
Data Movement	Excellent	Good	Good
Low-Code	Yes	Yes	No
Spark Processing	No	No	Yes
Complex Programming	No	No	Yes
Scheduling	Excellent	Good	Good

Use Pipelines When:

Moving data between systems
Orchestrating workflows
Scheduling processes
Managing dependencies

Use Dataflow Gen2 When:

Low-code transformations are required

Use Notebooks When:

Spark processing is needed
Custom Python or Scala logic is required

DP-700 Exam Tips

Remember these key points:

✓ Pipelines are primarily orchestration and data movement tools.

✓ The Copy Data activity is the most common ingestion activity.

✓ Pipelines support both scheduled and event-based execution.

✓ Parameters and dynamic expressions improve reusability.

✓ Incremental loads are preferred for large datasets.

✓ Pipelines can execute notebooks and dataflows.

✓ Monitoring and troubleshooting pipeline runs are important operational responsibilities.

✓ Control flow activities such as ForEach and If Condition are frequently used in enterprise solutions.

✓ Pipelines are generally the preferred Fabric tool for orchestrating end-to-end ingestion workflows.

Practice Exam Questions

Question 1

A data engineer needs to copy data nightly from Azure SQL Database into a Fabric Lakehouse. Which Fabric component is most appropriate?

A. Semantic Model
B. Data Pipeline
C. Dashboard
D. KQL Queryset

Correct Answer: B

Explanation:
Data Pipelines are designed for orchestrating and executing data movement activities such as copying data from Azure SQL Database into a Lakehouse.

Question 2

Which pipeline activity is primarily used to move data from a source system to a destination?

A. Notebook Activity
B. Copy Data Activity
C. If Condition Activity
D. Switch Activity

Correct Answer: B

Explanation:
The Copy Data activity is specifically designed for ingesting and transferring data between sources and destinations.

Question 3

A company wants a pipeline to process 50 tables using a single reusable workflow. Which feature should be implemented?

A. Data Warehouse
B. OneLake Shortcut
C. Parameters
D. Mirroring

Correct Answer: C

Explanation:
Parameters allow a pipeline to accept table names and other runtime values, making the solution reusable.

Question 4

Which control flow activity is used to repeatedly process a collection of items?

A. ForEach
B. Wait
C. Lookup
D. If Condition

Correct Answer: A

Explanation:
The ForEach activity iterates through collections and executes activities for each item.

Question 5

A data engineer wants a pipeline to run automatically every night at midnight. What should be configured?

A. Sensitivity Label
B. Scheduled Trigger
C. Dataflow Refresh Policy
D. Lakehouse Shortcut

Correct Answer: B

Explanation:
Scheduled triggers are used to execute pipelines at predefined times.

Question 6

Which Fabric destination is most commonly used for storing raw and curated Delta tables?

A. Lakehouse
B. Dashboard
C. Workspace Role
D. Semantic Model

Correct Answer: A

Explanation:
Lakehouses provide Delta Lake storage and are commonly used as ingestion targets.

Question 7

A pipeline should execute only when a new file arrives in storage. What should be used?

A. Manual Execution
B. Incremental Refresh
C. Event-Based Trigger
D. Full Load

Correct Answer: C

Explanation:
Event-based triggers allow pipelines to start when specific events occur, such as file creation.

Question 8

Which statement about incremental loading is correct?

A. It reloads all records every execution.
B. It loads only new or changed records.
C. It requires deleting the destination table first.
D. It cannot be implemented in pipelines.

Correct Answer: B

Explanation:
Incremental loading minimizes processing by transferring only new or modified data.

Question 9

A data engineer needs to execute custom PySpark transformation logic as part of a pipeline. Which activity should be used?

A. Copy Data Activity
B. If Condition Activity
C. Stored Procedure Activity
D. Notebook Activity

Correct Answer: D

Explanation:
Notebook activities allow execution of Spark notebooks containing custom Python, Scala, SQL, or Spark code.

Question 10

A pipeline execution fails due to a temporary network interruption. Which design practice can help improve reliability?

A. Use dashboard subscriptions
B. Apply endorsement labels
C. Configure retry policies
D. Disable monitoring

Correct Answer: C

Explanation:
Retry policies automatically reattempt failed activities and are a key best practice for building resilient ingestion pipelines.

Go to the DP-700 Exam Prep Hub main page.