This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Handle duplicate, missing, and late-arriving data

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the most important responsibilities of a data engineer is ensuring data quality. In real-world data environments, source systems frequently produce data issues that can negatively impact reporting, analytics, machine learning, and business decision-making.

Three of the most common data quality challenges are:

Duplicate data
Missing data
Late-arriving data

Microsoft Fabric provides multiple tools for addressing these challenges, including:

Dataflows Gen2
Data Pipelines
Lakehouses
Data Warehouses
PySpark Notebooks
SQL
Delta Lake tables

For the DP-700 exam, you should understand:

How duplicate data occurs
How to identify and remove duplicates
How to handle missing values
Strategies for managing late-arriving facts and dimensions
Data quality best practices in Fabric
Tools and techniques available across Fabric workloads

Understanding Duplicate Data

Duplicate data occurs when the same logical record appears more than once.

Example:

CustomerID	Name	Email
1001	John Smith	john@email.com
1001	John Smith	john@email.com

Duplicates can occur due to:

Multiple source system loads
Pipeline reruns
Source application errors
Data integration mistakes
Incomplete change tracking

Duplicate records can lead to:

Inflated sales totals
Incorrect customer counts
Poor analytics results
Data integrity issues

Types of Duplicates

Exact Duplicates

Every field is identical.

Example:

OrderID	Amount
5001	250
5001	250

These are usually the easiest to detect.

Partial Duplicates

Records represent the same entity but differ slightly.

Example:

CustomerID	Name
1001	John Smith
1001	Jon Smith

These may require business rules to resolve.

Detecting Duplicate Records Using SQL

A common approach is grouping records and counting occurrences.

Example:

			
SELECT
    CustomerID,
    COUNT(*) AS RecordCount
FROM Customers
GROUP BY CustomerID
HAVING COUNT(*) > 1;

		

This identifies customers appearing multiple times.

Removing Duplicates Using SQL

A common technique uses ROW_NUMBER().

Example:

			
WITH RankedRows AS
(
    SELECT *,
           ROW_NUMBER() OVER
           (
             PARTITION BY CustomerID
             ORDER BY LastModifiedDate DESC
           ) AS RowNum
    FROM Customers
)
SELECT *
FROM RankedRows
WHERE RowNum = 1;

		

This keeps only the most recent record for each customer.

Handling Duplicates Using PySpark

PySpark provides the dropDuplicates() method.

Example:

customers_df = customers_df.dropDuplicates(["CustomerID"])

This removes duplicate records based on CustomerID.

For multiple columns:

			
customers_df = customers_df.dropDuplicates(
    ["CustomerID","Email"]
)

This is frequently used during Lakehouse ingestion processes.

Delta Lake and Duplicate Prevention

Delta Lake tables support MERGE operations that help prevent duplicates.

Example:

			
MERGE INTO Customers AS Target
USING StagingCustomers AS Source
ON Target.CustomerID = Source.CustomerID
WHEN MATCHED THEN
    UPDATE SET *
WHEN NOT MATCHED THEN
    INSERT *

		

Benefits include:

Upserts
Incremental loads
Duplicate prevention
Simplified ETL logic

This is a common Fabric pattern.

Understanding Missing Data

Missing data occurs when expected values are absent.

Example:

CustomerID	Name	Email
1001	John Smith	NULL

Missing values may result from:

Source system limitations
User entry errors
Integration failures
Incomplete records

Why Missing Data Matters

Missing values can cause:

Reporting inaccuracies
Failed transformations
Machine learning issues
Data quality concerns

Example:

Calculating average salary becomes unreliable if many salary values are missing.

Identifying Missing Values

SQL example:

			
SELECT *
FROM Customers
WHERE Email IS NULL;

PySpark example:

			
customers_df.filter(
    customers_df.Email.isNull()
)

Handling Missing Data

Several approaches exist.

Option 1: Replace Missing Values

Example:

			
SELECT
    CustomerID,
    COALESCE(Email,'Unknown')
FROM Customers;

PySpark:

df.fillna("Unknown")

Useful when business rules permit default values.

Option 2: Exclude Records

Example:

			
SELECT *
FROM Customers
WHERE Email IS NOT NULL;

Useful when incomplete records cannot be trusted.

Option 3: Use Business Defaults

Example:

Missing Value	Replacement
Country	Unknown
Discount	0
Status	Pending

Common in dimensional modeling.

Dataflows Gen2 and Missing Data

Dataflows Gen2 provides low-code capabilities for:

Replacing null values
Filtering null rows
Conditional column creation
Data validation

These transformations are frequently used during ingestion and cleansing processes.

Understanding Late-Arriving Data

Late-arriving data refers to records that arrive after the expected processing window.

Example:

An order occurs on January 1st but is not received until January 5th.

Causes of Late-Arriving Data

Common causes include:

Network interruptions
Application outages
Batch processing delays
Replication latency
Manual data entry

Late-arriving data is extremely common in enterprise systems.

Late-Arriving Facts

A late-arriving fact occurs when a transaction arrives after related reporting periods have already been processed.

Example:

OrderDate	Arrival Date
Jan 1	Jan 5

The order belongs to January 1 but arrives later.

Challenges include:

Incorrect historical totals
Incomplete reporting
Fact table inconsistencies

Handling Late-Arriving Facts

Common approaches include:

Reprocessing Historical Periods

Reload affected partitions.

Example:

Reload the previous 7 days during each load.

Benefits:

Simple implementation
Common Fabric approach

Watermark-Based Processing

Track the latest processed timestamp.

Example:

			
WHERE LastModifiedDate >
      @LastProcessedTimestamp

This supports efficient incremental loads.

Sliding Window Loads

Instead of loading only today’s data:

Load the last:

3 days
7 days
30 days

This helps capture delayed records.

Late-Arriving Dimensions

A late-arriving dimension occurs when a fact record arrives before its related dimension record.

Example:

Fact arrives:

CustomerID	Sales
1001	500

Dimension record for CustomerID 1001 has not yet arrived.

Handling Late-Arriving Dimensions

Create an Unknown Member

Dimension table:

CustomerKey	CustomerName
-1	Unknown

Fact records temporarily reference the unknown member.

When the actual dimension arrives:

Update dimension
Correct foreign key relationships

This is a common data warehouse design pattern.

Retry Processing

Pause loading until dimensions arrive.

Useful when:

Data quality requirements are strict
Fact loads depend on dimension availability

Data Quality Monitoring

Fabric solutions should include monitoring for:

Duplicate counts
Null counts
Data freshness
Late-arriving records
Invalid keys

Typical monitoring techniques include:

SQL validation queries
Notebook checks
Pipeline activities
Custom logging tables

Common Fabric Tools for Data Quality

Tool	Use Case
Dataflows Gen2	Data cleansing
Pipelines	Orchestration and validation
PySpark	Large-scale deduplication
SQL	Validation and quality checks
Delta Tables	Upserts and merge operations
Warehouse	Data quality reporting

Best Practices

Prevent Duplicates Early

Validate source data before loading.

Use Incremental Loading

Avoid repeatedly processing entire datasets.

Implement MERGE Operations

Upserts help prevent duplicate records.

Monitor Null Values

Track trends in missing data over time.

Create Unknown Dimension Members

Support late-arriving dimensions.

Use Sliding Window Loads

Capture delayed records efficiently.

Maintain Audit Columns

Examples:

			
CreatedDate
ModifiedDate
LoadDate
SourceSystem
BatchID

		

These assist with troubleshooting and recovery.

DP-700 Exam Tips

Remember these key concepts:

Duplicates can be identified using GROUP BY and COUNT().
ROW_NUMBER() is commonly used for deduplication.
PySpark supports deduplication through dropDuplicates().
Delta Lake MERGE operations support upserts and duplicate prevention.
Missing values can be replaced, filtered, or assigned defaults.
COALESCE() is commonly used in SQL to handle NULL values.
Late-arriving facts often require partition reprocessing or sliding windows.
Late-arriving dimensions are commonly handled using Unknown members.
Incremental loading and watermarking help manage delayed data efficiently.
Data quality monitoring is an essential responsibility of data engineers.

Practice Exam Questions

Question 1

A data engineer wants to identify duplicate customer records in a Fabric Warehouse.

Which SQL technique should be used?

A. GROUP BY with COUNT()

B. ORDER BY

C. DISTINCT

D. UNION

Correct Answer: A

Explanation: GROUP BY combined with COUNT() can identify keys that appear multiple times, making it a common duplicate detection technique.

Question 2

Which PySpark method is specifically designed to remove duplicate records?

A. filter()

B. collect()

C. sort()

D. dropDuplicates()

Correct Answer: D

Explanation: dropDuplicates() removes duplicate records based on one or more specified columns.

Question 3

A data engineer wants to keep only the most recent customer record when duplicates exist.

Which SQL function is most appropriate?

A. SUM()

B. MAX()

C. ROW_NUMBER()

D. COUNT()

Correct Answer: C

Explanation: ROW_NUMBER() allows ranking records within groups so the preferred record can be retained.

Question 4

Which SQL function is commonly used to replace NULL values with a default value?

A. CAST()

B. CONCAT()

C. GROUP BY

D. COALESCE()

Correct Answer: D

Explanation: COALESCE() returns the first non-NULL value and is frequently used to replace missing data.

Question 5

A sales transaction occurred on January 1 but was not received until January 5.

What type of issue is this?

A. Duplicate fact

B. Missing dimension

C. Late-arriving fact

D. Slowly changing dimension

Correct Answer: C

Explanation: The transaction belongs to an earlier business date but arrives after the expected processing window.

Question 6

Which loading strategy helps capture records that arrive several days late?

A. Full reload every hour

B. Sliding window load

C. Static partitioning

D. Manual refresh

Correct Answer: B

Explanation: Sliding window loads reprocess recent periods, allowing delayed records to be captured.

Question 7

A fact record arrives before its related dimension record.

What is this situation called?

A. Incremental loading

B. Duplicate key processing

C. Data drift

D. Late-arriving dimension

Correct Answer: D

Explanation: A late-arriving dimension scenario occurs when fact data is received before the associated dimension data.

Question 8

What is the most common solution for handling late-arriving dimensions in a dimensional model?

A. Delete the fact record

B. Skip the load permanently

C. Create an Unknown member in the dimension table

D. Disable incremental processing

Correct Answer: C

Explanation: An Unknown member allows fact records to load while preserving referential integrity until the dimension arrives.

Question 9

Which Delta Lake operation is commonly used to perform inserts and updates while minimizing duplicate records?

A. DELETE

B. MERGE

C. TRUNCATE

D. CREATE VIEW

Correct Answer: B

Explanation: MERGE supports upsert operations and is widely used for incremental loading and deduplication.

Question 10

Which column is most useful when implementing watermark-based incremental loads?

A. LastModifiedDate

B. Region

C. Category

D. ProductName

Correct Answer: A

Explanation: Watermarking relies on timestamps such as LastModifiedDate to identify new or changed records since the previous load.

Go to the DP-700 Exam Prep Hub main page.

Introduction

Understanding Duplicate Data

Types of Duplicates

Exact Duplicates

Partial Duplicates

Detecting Duplicate Records Using SQL

Removing Duplicates Using SQL

Handling Duplicates Using PySpark

Delta Lake and Duplicate Prevention

Understanding Missing Data

Why Missing Data Matters

Identifying Missing Values

Handling Missing Data

Option 1: Replace Missing Values

Option 2: Exclude Records

Option 3: Use Business Defaults

Dataflows Gen2 and Missing Data

Understanding Late-Arriving Data

Causes of Late-Arriving Data

Late-Arriving Facts

Handling Late-Arriving Facts

Reprocessing Historical Periods

Watermark-Based Processing

Sliding Window Loads

Late-Arriving Dimensions

Handling Late-Arriving Dimensions

Create an Unknown Member

Retry Processing

Data Quality Monitoring

Common Fabric Tools for Data Quality

Best Practices

Prevent Duplicates Early

Use Incremental Loading

Implement MERGE Operations

Monitor Null Values

Create Unknown Dimension Members

Use Sliding Window Loads

Maintain Audit Columns

DP-700 Exam Tips

Practice Exam Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Share this:

Related

Leave a comment Cancel reply

Information and resources for the data professionals' community