Tag: data ingestion

Monitor data ingestion (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
   --> Monitor Fabric items
      --> Monitor data ingestion


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Overview

Data ingestion is one of the most critical processes in any data engineering solution. Regardless of whether data is ingested through pipelines, Dataflows Gen2, Eventstreams, Spark notebooks, mirroring, shortcuts, or streaming solutions, engineers must ensure that ingestion processes are running successfully, efficiently, and reliably.

In Microsoft Fabric, monitoring data ingestion involves tracking data movement activities, identifying failures, measuring performance, validating data completeness, troubleshooting bottlenecks, and ensuring data arrives in the correct destination on schedule.

For the DP-700 exam, you should understand:

  • How ingestion monitoring works across Fabric workloads
  • Monitoring pipelines and Dataflows Gen2
  • Monitoring Spark jobs and notebooks
  • Monitoring streaming ingestion
  • Using monitoring hubs and run history
  • Detecting ingestion failures
  • Investigating performance issues
  • Monitoring data quality and completeness
  • Best practices for operational monitoring

Why Data Ingestion Monitoring Matters

A data engineering solution is only valuable if data arrives correctly and on time.

Poorly monitored ingestion processes can result in:

  • Missing data
  • Incomplete reports
  • Delayed analytics
  • Data quality issues
  • Failed downstream transformations
  • Business decision errors

Consider an hourly sales ingestion process:

  • If the process fails at 2:00 AM
  • No monitoring is in place
  • The issue is not discovered until business users report incorrect dashboards

Proper monitoring helps detect and resolve problems before they impact users.


Data Ingestion Components in Microsoft Fabric

Several Fabric services perform data ingestion:

Data Pipelines

Used for:

  • Copy activities
  • Data movement
  • Workflow orchestration
  • ETL/ELT execution

Pipelines often serve as the primary ingestion mechanism for batch data.


Dataflows Gen2

Used for:

  • Low-code data ingestion
  • Power Query transformations
  • ETL development

Dataflows commonly ingest data from SaaS applications, databases, and files.


Spark Notebooks

Used for:

  • Large-scale ingestion
  • Custom transformations
  • Lakehouse loading

Spark jobs frequently handle enterprise-scale ingestion workloads.


Eventstreams

Used for:

  • Streaming ingestion
  • Event processing
  • Real-time data pipelines

Mirroring

Used for:

  • Near real-time replication
  • Continuous synchronization
  • Operational system integration

Monitoring Hub

The Monitoring Hub is the central monitoring experience within Microsoft Fabric.

It allows administrators and engineers to monitor:

  • Pipeline executions
  • Dataflow refreshes
  • Notebook runs
  • Spark jobs
  • Warehouse activities
  • Real-Time Intelligence workloads

The Monitoring Hub provides:

  • Run status
  • Start time
  • End time
  • Duration
  • Error messages
  • Historical execution information

For DP-700, expect questions regarding how to investigate failures and review execution history.


Monitoring Pipeline Executions

Pipelines provide detailed execution tracking.

Each pipeline run includes:

  • Status
  • Activity-level details
  • Runtime metrics
  • Input/output information
  • Error details

Typical statuses include:

StatusMeaning
SucceededCompleted successfully
FailedOne or more activities failed
In ProgressCurrently executing
CancelledStopped before completion

Activity-Level Monitoring

Pipeline monitoring drills into individual activities.

Examples:

  • Copy Data activity
  • Notebook activity
  • Dataflow activity
  • Stored Procedure activity

If a pipeline fails, reviewing activity-level details is often the fastest way to identify the root cause.


Common Pipeline Failures

Authentication Errors

Examples:

  • Expired credentials
  • Missing permissions
  • Invalid service principal access

Network Issues

Examples:

  • Source unavailable
  • Connectivity interruptions

Schema Changes

Examples:

  • Missing columns
  • Data type mismatches

Capacity Constraints

Examples:

  • Resource contention
  • Capacity throttling

Monitoring Dataflows Gen2

Dataflows Gen2 provide refresh history information.

Engineers can monitor:

  • Refresh success
  • Refresh failures
  • Execution duration
  • Row processing counts

Monitoring refresh history helps identify:

  • Slow transformations
  • Source system issues
  • Data quality problems

Dataflow Refresh History

Common metrics include:

  • Start time
  • End time
  • Duration
  • Refresh status
  • Error details

If refresh duration increases significantly over time, it may indicate:

  • Growing data volumes
  • Source performance degradation
  • Inefficient transformations

Monitoring Spark Ingestion Jobs

Spark workloads often support large-scale ingestion processes.

Monitoring includes:

  • Job execution status
  • Spark application logs
  • Resource utilization
  • Stage execution metrics

Spark Monitoring Metrics

Important metrics include:

Job Duration

Tracks overall execution time.

Executor Usage

Indicates cluster resource consumption.

Task Failures

Shows processing errors.

Data Skew

Identifies uneven partition distribution.

Shuffle Operations

Helps diagnose performance bottlenecks.


Monitoring Streaming Ingestion

Streaming solutions require continuous monitoring.

Common streaming workloads include:

  • Eventstreams
  • KQL databases
  • Real-Time Intelligence
  • Spark Structured Streaming

Key Streaming Metrics

Events Ingested

Measures throughput.

Example:

  • 50,000 events per minute

Ingestion Latency

Measures delay between event creation and availability.

Lower latency generally indicates healthier streaming systems.

Failed Events

Tracks records that could not be processed.

Backlog Size

Measures unprocessed events waiting for ingestion.

Large backlogs may indicate:

  • Capacity issues
  • Slow downstream processing
  • Configuration problems

Monitoring Eventstreams

Eventstreams provide operational monitoring capabilities.

You can monitor:

  • Incoming event volume
  • Processing status
  • Transformation performance
  • Output destinations

Common issues include:

  • Source connectivity failures
  • Event schema mismatches
  • Destination write failures

Monitoring Mirroring

Mirroring continuously replicates source data into Fabric.

Monitoring focuses on:

  • Replication status
  • Synchronization delays
  • Replication failures
  • Data freshness

Important concepts include:

Replication Latency

Time between source changes and destination availability.

Synchronization Health

Indicates whether replication remains current.


Monitoring Data Completeness

Successful execution does not always mean successful ingestion.

Data engineers should validate:

  • Expected row counts
  • File counts
  • Event counts
  • Record completeness

Example:

A pipeline succeeds but only loads 70% of expected records.

Technical execution succeeded, but business requirements were not met.


Common Validation Checks

Row Count Validation

Compare source and destination record counts.

File Validation

Verify expected files arrived.

Timestamp Validation

Confirm recent records are present.

Duplicate Detection

Identify accidental duplicate ingestion.


Monitoring Data Quality During Ingestion

Data quality monitoring often includes:

  • Null value detection
  • Invalid data type identification
  • Duplicate record detection
  • Referential integrity checks

Monitoring quality issues early prevents downstream reporting problems.


Alerts and Notifications

Monitoring becomes significantly more effective when alerts are configured.

Common alert scenarios include:

  • Pipeline failures
  • Dataflow refresh failures
  • Long-running jobs
  • Excessive ingestion latency
  • Capacity utilization thresholds

Alerts allow engineers to respond before business users notice issues.


Troubleshooting Ingestion Failures

A common troubleshooting workflow includes:

Step 1

Review Monitoring Hub status.

Step 2

Identify failed workload.

Step 3

Inspect detailed error message.

Step 4

Validate source connectivity.

Step 5

Verify credentials and permissions.

Step 6

Review recent schema changes.

Step 7

Rerun ingestion process if appropriate.


Best Practices

Establish Baselines

Track normal:

  • Runtime duration
  • Throughput
  • Latency
  • Data volume

Baseline measurements make anomalies easier to identify.


Monitor Data Quality

Do not rely solely on execution success.

Validate:

  • Completeness
  • Accuracy
  • Timeliness

Use Alerts

Configure proactive notifications for:

  • Failures
  • Delays
  • Performance degradation

Retain Historical Monitoring Data

Historical execution information helps identify:

  • Trends
  • Capacity growth
  • Recurring failures

Investigate Long-Running Jobs

Increasing execution times often indicate:

  • Growing data volumes
  • Inefficient queries
  • Capacity limitations

DP-700 Exam Tips

Know the Monitoring Hub

The Monitoring Hub is the primary location for monitoring Fabric workloads.


Understand Pipeline Monitoring

Be familiar with:

  • Run history
  • Activity runs
  • Error messages
  • Execution duration

Understand Streaming Metrics

Know the importance of:

  • Throughput
  • Latency
  • Backlogs
  • Failed events

Monitor More Than Success Status

Successful execution does not guarantee complete or accurate data ingestion.


Understand Data Validation

Exam questions often focus on verifying:

  • Row counts
  • Data completeness
  • Freshness
  • Data quality

Practice Exam Questions

Question 1

Which Microsoft Fabric feature serves as the central location for monitoring pipelines, notebooks, Spark jobs, and dataflows?

A. Data Activator

B. OneLake Explorer

C. Monitoring Hub

D. Eventhouse

Answer: C

Explanation: The Monitoring Hub provides centralized monitoring across Fabric workloads and is the primary tool for reviewing execution history and failures.


Question 2

A pipeline execution completed successfully, but only half the expected records were loaded.

What should you verify first?

A. Workspace permissions

B. Data completeness and row counts

C. Capacity SKU

D. Sensitivity labels

Answer: B

Explanation: Successful execution does not guarantee successful business outcomes. Row count validation helps confirm complete ingestion.


Question 3

Which metric measures the delay between event creation and event availability in a streaming solution?

A. Throughput

B. Replication count

C. Ingestion latency

D. Refresh frequency

Answer: C

Explanation: Ingestion latency measures how quickly streaming data becomes available after being generated.


Question 4

Which issue is most likely if streaming event backlogs continue growing over time?

A. Processing cannot keep up with incoming events

B. Missing endorsement settings

C. Too many workspace roles

D. Excessive sensitivity labels

Answer: A

Explanation: Growing backlogs typically indicate that event processing is slower than event arrival rates.


Question 5

When troubleshooting a failed pipeline, what should typically be examined first?

A. Lakehouse shortcuts

B. Activity-level execution details

C. Workspace endorsements

D. Semantic model refresh schedules

Answer: B

Explanation: Activity-level details usually identify the exact source of a pipeline failure.


Question 6

Which metric is most useful for determining whether a Dataflow Gen2 refresh is becoming slower over time?

A. Sensitivity label

B. Number of workspaces

C. Refresh duration

D. Dataset owner

Answer: C

Explanation: Refresh duration directly measures execution performance and helps identify degradation trends.


Question 7

A data engineer wants to verify that every expected source file was loaded during ingestion.

Which validation approach should be used?

A. Capacity monitoring

B. File count validation

C. Role assignment review

D. Workspace auditing

Answer: B

Explanation: File count validation confirms that all expected files were ingested.


Question 8

Which Spark monitoring metric can help identify uneven partition distribution during ingestion?

A. Activity retry count

B. Replication latency

C. Refresh history

D. Data skew

Answer: D

Explanation: Data skew occurs when partitions contain significantly different amounts of data, creating processing bottlenecks.


Question 9

What is the primary purpose of configuring alerts for ingestion workloads?

A. To reduce storage costs

B. To automatically increase capacity

C. To proactively notify administrators of issues

D. To encrypt incoming data

Answer: C

Explanation: Alerts help identify failures, delays, and performance issues before they impact users.


Question 10

Which monitoring focus is most important for mirrored databases?

A. Report visual refresh time

B. Synchronization health and replication latency

C. Notebook parameter values

D. Semantic model relationships

Answer: B

Explanation: Mirroring depends on keeping source and destination systems synchronized, making replication latency and synchronization health critical monitoring metrics.


Go to the DP-700 Exam Prep Hub main page.

Ingest data by using pipelines (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Ingest data by using pipelines


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Microsoft Fabric Data Pipelines are one of the primary tools used by data engineers to ingest, move, and orchestrate data across various sources and destinations. Pipelines provide a low-code orchestration framework that allows organizations to build scalable, repeatable, and maintainable data ingestion solutions.

For the DP-700 exam, it is important to understand:

  • What pipelines are
  • Pipeline architecture and components
  • Common ingestion patterns
  • Copy Data activity
  • Data source and destination connectivity
  • Pipeline orchestration
  • Parameters and dynamic content
  • Scheduling and triggering
  • Monitoring and troubleshooting
  • Best practices for pipeline-based ingestion

What Is a Microsoft Fabric Data Pipeline?

A Data Pipeline is a workflow orchestration service within Microsoft Fabric that enables data engineers to:

  • Move data between systems
  • Schedule data ingestion
  • Execute transformation activities
  • Coordinate multiple processes
  • Automate data workflows

Pipelines are derived from the same core concepts used in Azure Data Factory and Azure Synapse Analytics, making them familiar to many data professionals.

A pipeline is essentially a container that holds one or more activities that execute in a defined sequence.


Why Use Pipelines for Data Ingestion?

Organizations often need to ingest data from:

  • SQL Server
  • Azure SQL Database
  • Azure Blob Storage
  • Amazon S3
  • REST APIs
  • CSV files
  • Excel files
  • On-premises systems
  • Data warehouses
  • SaaS applications

Pipelines provide a centralized and scalable way to move this data into Fabric.

Benefits include:

Automation

No manual intervention required once configured.

Scalability

Handles large volumes of data efficiently.

Reusability

Pipelines can be reused across multiple ingestion scenarios.

Monitoring

Built-in execution tracking and logging.

Integration

Works with many Fabric workloads and external systems.


Pipeline Architecture

A pipeline consists of several components:

Pipeline

The overall workflow container.

Activities

Tasks performed within the pipeline.

Examples:

  • Copy Data
  • Notebook execution
  • Stored procedure execution
  • Dataflow execution
  • Variable assignment

Datasets

Represent source or destination data structures.

Connections

Define how the pipeline connects to external systems.

Parameters

Provide runtime flexibility.

Triggers

Determine when pipelines execute.


Common Pipeline Activities

For DP-700, understanding activities is essential.

Copy Data Activity

The most commonly used ingestion activity.

Used to:

  • Copy files
  • Move tables
  • Transfer structured data
  • Load data into Fabric destinations

Examples:

  • SQL Server → Lakehouse
  • Azure SQL → Warehouse
  • CSV → OneLake
  • Blob Storage → Lakehouse

Notebook Activity

Executes Spark notebooks.

Common uses:

  • Data transformation
  • Data cleansing
  • Machine learning processing

Dataflow Activity

Runs Dataflow Gen2 processes.

Used when:

  • Low-code transformations are preferred
  • Business users participate in data preparation

Stored Procedure Activity

Executes SQL stored procedures.

Useful for:

  • Database maintenance
  • Incremental processing
  • Metadata updates

Using the Copy Data Activity

The Copy Data activity is heavily emphasized on the DP-700 exam.

Source

Defines where data originates.

Examples:

  • SQL Database
  • Oracle
  • REST API
  • CSV File
  • Blob Storage

Destination

Defines where data is written.

Examples:

  • Lakehouse
  • Data Warehouse
  • OneLake files
  • SQL endpoint

Mapping

Maps source columns to destination columns.

Example:

SourceDestination
CustomerIDCustomerKey
NameCustomerName
CityCustomerCity

Data Sources Supported by Pipelines

Fabric pipelines support numerous source systems.

Common examples include:

Relational Databases

  • SQL Server
  • Azure SQL Database
  • Oracle
  • PostgreSQL
  • MySQL

File-Based Sources

  • CSV
  • JSON
  • Parquet
  • Excel

Cloud Storage

  • Azure Blob Storage
  • Azure Data Lake Storage
  • Amazon S3

Web-Based Sources

  • REST APIs
  • HTTP endpoints

Pipeline Destinations

Common destinations include:

Lakehouse

Frequently used for raw and curated data storage.

Benefits:

  • Delta format
  • Open storage
  • Spark compatibility

Data Warehouse

Ideal for structured analytical workloads.

Benefits:

  • SQL support
  • Relational design
  • High-performance reporting

OneLake Files

Used for raw file storage.


Batch Data Ingestion Patterns

The DP-700 exam focuses heavily on batch ingestion.

Full Load Pattern

Every execution loads the entire dataset.

Example:

Daily import of a 5,000-row lookup table.

Advantages:

  • Simple implementation

Disadvantages:

  • Higher processing costs
  • Longer runtimes

Incremental Load Pattern

Only new or changed records are loaded.

Example:

Import orders created since the last execution.

Advantages:

  • Faster
  • Lower costs
  • Reduced data movement

Disadvantages:

  • More complex configuration

Parameterized Pipelines

Parameters make pipelines reusable.

Example parameter:

SourceTable

Pipeline executions can specify:

Customers
Orders
Products
Invoices

This allows one pipeline design to ingest many tables.

Benefits:

  • Reduced development effort
  • Easier maintenance
  • Consistent ingestion processes

Dynamic Content

Dynamic expressions enable runtime flexibility.

Examples:

Generate file names:

Sales_@{utcnow()}.csv

Generate folders:

Raw/@{formatDateTime(utcnow(),'yyyy/MM/dd')}

Use parameter values:

@pipeline().parameters.TableName

Dynamic content is commonly tested on DP-700.


Control Flow Activities

Pipelines can include logic and branching.

If Condition

Executes different paths depending on conditions.

Example:

  • File exists → Continue
  • File missing → Send notification

Switch Activity

Handles multiple execution paths.

Example:

Process data differently based on source type.


ForEach Activity

Loops through collections.

Example:

Load 100 source tables using one pipeline.


Until Activity

Repeats execution until a condition becomes true.


Scheduling Pipelines

Pipelines commonly run on schedules.

Examples:

  • Hourly
  • Daily
  • Weekly
  • Monthly

Typical workloads:

WorkloadSchedule
Sales DataHourly
ERP DataDaily
Financial DataNightly
Master DataWeekly

Event-Based Triggers

Instead of schedules, pipelines can run when events occur.

Examples:

  • New file arrives
  • Data source updated
  • Upstream process completed

Benefits:

  • Reduced latency
  • Faster processing
  • More responsive architecture

Monitoring Pipeline Executions

Fabric provides execution monitoring.

Data engineers can review:

Run Status

  • Succeeded
  • Failed
  • In Progress
  • Cancelled

Duration

How long execution required.


Activity-Level Results

Identify which step failed.


Error Messages

Useful for troubleshooting.

Common issues include:

  • Authentication failures
  • Missing files
  • Schema mismatches
  • Permission problems

Error Handling

Reliable ingestion solutions require proper error handling.

Common approaches:

Retry Policies

Automatically rerun failed activities.

Logging

Record execution details.

Validation

Check data quality before loading.

Notifications

Alert administrators when failures occur.


Security Considerations

Pipeline ingestion must follow security best practices.

Secure Credentials

Use managed identities and secure connections whenever possible.

Least Privilege

Grant only required permissions.

Workspace Security

Control who can modify pipelines.

Data Governance

Apply sensitivity labels and auditing where appropriate.


Pipeline Best Practices

Use Parameterization

Avoid hardcoding values.

Build Reusable Components

Create generic ingestion pipelines.

Use Incremental Loads

When possible, reduce data movement.

Monitor Executions

Review failures proactively.

Implement Error Handling

Design for operational resilience.

Separate Environments

Maintain Dev, Test, and Production pipelines.


Pipeline vs Dataflow Gen2 vs Notebook

Understanding when to use each tool is a common exam objective.

FeaturePipelineDataflow Gen2Notebook
OrchestrationExcellentLimitedLimited
Data MovementExcellentGoodGood
Low-CodeYesYesNo
Spark ProcessingNoNoYes
Complex ProgrammingNoNoYes
SchedulingExcellentGoodGood

Use Pipelines When:

  • Moving data between systems
  • Orchestrating workflows
  • Scheduling processes
  • Managing dependencies

Use Dataflow Gen2 When:

  • Low-code transformations are required

Use Notebooks When:

  • Spark processing is needed
  • Custom Python or Scala logic is required

DP-700 Exam Tips

Remember these key points:

✓ Pipelines are primarily orchestration and data movement tools.

✓ The Copy Data activity is the most common ingestion activity.

✓ Pipelines support both scheduled and event-based execution.

✓ Parameters and dynamic expressions improve reusability.

✓ Incremental loads are preferred for large datasets.

✓ Pipelines can execute notebooks and dataflows.

✓ Monitoring and troubleshooting pipeline runs are important operational responsibilities.

✓ Control flow activities such as ForEach and If Condition are frequently used in enterprise solutions.

✓ Pipelines are generally the preferred Fabric tool for orchestrating end-to-end ingestion workflows.


Practice Exam Questions

Question 1

A data engineer needs to copy data nightly from Azure SQL Database into a Fabric Lakehouse. Which Fabric component is most appropriate?

A. Semantic Model
B. Data Pipeline
C. Dashboard
D. KQL Queryset

Correct Answer: B

Explanation:
Data Pipelines are designed for orchestrating and executing data movement activities such as copying data from Azure SQL Database into a Lakehouse.


Question 2

Which pipeline activity is primarily used to move data from a source system to a destination?

A. Notebook Activity
B. Copy Data Activity
C. If Condition Activity
D. Switch Activity

Correct Answer: B

Explanation:
The Copy Data activity is specifically designed for ingesting and transferring data between sources and destinations.


Question 3

A company wants a pipeline to process 50 tables using a single reusable workflow. Which feature should be implemented?

A. Data Warehouse
B. OneLake Shortcut
C. Parameters
D. Mirroring

Correct Answer: C

Explanation:
Parameters allow a pipeline to accept table names and other runtime values, making the solution reusable.


Question 4

Which control flow activity is used to repeatedly process a collection of items?

A. ForEach
B. Wait
C. Lookup
D. If Condition

Correct Answer: A

Explanation:
The ForEach activity iterates through collections and executes activities for each item.


Question 5

A data engineer wants a pipeline to run automatically every night at midnight. What should be configured?

A. Sensitivity Label
B. Scheduled Trigger
C. Dataflow Refresh Policy
D. Lakehouse Shortcut

Correct Answer: B

Explanation:
Scheduled triggers are used to execute pipelines at predefined times.


Question 6

Which Fabric destination is most commonly used for storing raw and curated Delta tables?

A. Lakehouse
B. Dashboard
C. Workspace Role
D. Semantic Model

Correct Answer: A

Explanation:
Lakehouses provide Delta Lake storage and are commonly used as ingestion targets.


Question 7

A pipeline should execute only when a new file arrives in storage. What should be used?

A. Manual Execution
B. Incremental Refresh
C. Event-Based Trigger
D. Full Load

Correct Answer: C

Explanation:
Event-based triggers allow pipelines to start when specific events occur, such as file creation.


Question 8

Which statement about incremental loading is correct?

A. It reloads all records every execution.
B. It loads only new or changed records.
C. It requires deleting the destination table first.
D. It cannot be implemented in pipelines.

Correct Answer: B

Explanation:
Incremental loading minimizes processing by transferring only new or modified data.


Question 9

A data engineer needs to execute custom PySpark transformation logic as part of a pipeline. Which activity should be used?

A. Copy Data Activity
B. If Condition Activity
C. Stored Procedure Activity
D. Notebook Activity

Correct Answer: D

Explanation:
Notebook activities allow execution of Spark notebooks containing custom Python, Scala, SQL, or Spark code.


Question 10

A pipeline execution fails due to a temporary network interruption. Which design practice can help improve reliability?

A. Use dashboard subscriptions
B. Apply endorsement labels
C. Configure retry policies
D. Disable monitoring

Correct Answer: C

Explanation:
Retry policies automatically reattempt failed activities and are a key best practice for building resilient ingestion pipelines.


Go to the DP-700 Exam Prep Hub main page.