Handle duplicate, missing, and late-arriving data (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Handle duplicate, missing, and late-arriving data


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the most important responsibilities of a data engineer is ensuring data quality. In real-world data environments, source systems frequently produce data issues that can negatively impact reporting, analytics, machine learning, and business decision-making.

Three of the most common data quality challenges are:

  • Duplicate data
  • Missing data
  • Late-arriving data

Microsoft Fabric provides multiple tools for addressing these challenges, including:

  • Dataflows Gen2
  • Data Pipelines
  • Lakehouses
  • Data Warehouses
  • PySpark Notebooks
  • SQL
  • Delta Lake tables

For the DP-700 exam, you should understand:

  • How duplicate data occurs
  • How to identify and remove duplicates
  • How to handle missing values
  • Strategies for managing late-arriving facts and dimensions
  • Data quality best practices in Fabric
  • Tools and techniques available across Fabric workloads

Understanding Duplicate Data

Duplicate data occurs when the same logical record appears more than once.

Example:

CustomerIDNameEmail
1001John Smithjohn@email.com
1001John Smithjohn@email.com

Duplicates can occur due to:

  • Multiple source system loads
  • Pipeline reruns
  • Source application errors
  • Data integration mistakes
  • Incomplete change tracking

Duplicate records can lead to:

  • Inflated sales totals
  • Incorrect customer counts
  • Poor analytics results
  • Data integrity issues

Types of Duplicates

Exact Duplicates

Every field is identical.

Example:

OrderIDAmount
5001250
5001250

These are usually the easiest to detect.


Partial Duplicates

Records represent the same entity but differ slightly.

Example:

CustomerIDName
1001John Smith
1001Jon Smith

These may require business rules to resolve.


Detecting Duplicate Records Using SQL

A common approach is grouping records and counting occurrences.

Example:

SELECT
CustomerID,
COUNT(*) AS RecordCount
FROM Customers
GROUP BY CustomerID
HAVING COUNT(*) > 1;

This identifies customers appearing multiple times.


Removing Duplicates Using SQL

A common technique uses ROW_NUMBER().

Example:

WITH RankedRows AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY CustomerID
ORDER BY LastModifiedDate DESC
) AS RowNum
FROM Customers
)
SELECT *
FROM RankedRows
WHERE RowNum = 1;

This keeps only the most recent record for each customer.


Handling Duplicates Using PySpark

PySpark provides the dropDuplicates() method.

Example:

customers_df = customers_df.dropDuplicates(["CustomerID"])

This removes duplicate records based on CustomerID.

For multiple columns:

customers_df = customers_df.dropDuplicates(
["CustomerID","Email"]
)

This is frequently used during Lakehouse ingestion processes.


Delta Lake and Duplicate Prevention

Delta Lake tables support MERGE operations that help prevent duplicates.

Example:

MERGE INTO Customers AS Target
USING StagingCustomers AS Source
ON Target.CustomerID = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *

Benefits include:

  • Upserts
  • Incremental loads
  • Duplicate prevention
  • Simplified ETL logic

This is a common Fabric pattern.


Understanding Missing Data

Missing data occurs when expected values are absent.

Example:

CustomerIDNameEmail
1001John SmithNULL

Missing values may result from:

  • Source system limitations
  • User entry errors
  • Integration failures
  • Incomplete records

Why Missing Data Matters

Missing values can cause:

  • Reporting inaccuracies
  • Failed transformations
  • Machine learning issues
  • Data quality concerns

Example:

Calculating average salary becomes unreliable if many salary values are missing.


Identifying Missing Values

SQL example:

SELECT *
FROM Customers
WHERE Email IS NULL;

PySpark example:

customers_df.filter(
customers_df.Email.isNull()
)

Handling Missing Data

Several approaches exist.

Option 1: Replace Missing Values

Example:

SELECT
CustomerID,
COALESCE(Email,'Unknown')
FROM Customers;

PySpark:

df.fillna("Unknown")

Useful when business rules permit default values.


Option 2: Exclude Records

Example:

SELECT *
FROM Customers
WHERE Email IS NOT NULL;

Useful when incomplete records cannot be trusted.


Option 3: Use Business Defaults

Example:

Missing ValueReplacement
CountryUnknown
Discount0
StatusPending

Common in dimensional modeling.


Dataflows Gen2 and Missing Data

Dataflows Gen2 provides low-code capabilities for:

  • Replacing null values
  • Filtering null rows
  • Conditional column creation
  • Data validation

These transformations are frequently used during ingestion and cleansing processes.


Understanding Late-Arriving Data

Late-arriving data refers to records that arrive after the expected processing window.

Example:

An order occurs on January 1st but is not received until January 5th.


Causes of Late-Arriving Data

Common causes include:

  • Network interruptions
  • Application outages
  • Batch processing delays
  • Replication latency
  • Manual data entry

Late-arriving data is extremely common in enterprise systems.


Late-Arriving Facts

A late-arriving fact occurs when a transaction arrives after related reporting periods have already been processed.

Example:

OrderDateArrival Date
Jan 1Jan 5

The order belongs to January 1 but arrives later.

Challenges include:

  • Incorrect historical totals
  • Incomplete reporting
  • Fact table inconsistencies

Handling Late-Arriving Facts

Common approaches include:

Reprocessing Historical Periods

Reload affected partitions.

Example:

Reload the previous 7 days during each load.

Benefits:

  • Simple implementation
  • Common Fabric approach

Watermark-Based Processing

Track the latest processed timestamp.

Example:

WHERE LastModifiedDate >
@LastProcessedTimestamp

This supports efficient incremental loads.


Sliding Window Loads

Instead of loading only today’s data:

Load the last:

  • 3 days
  • 7 days
  • 30 days

This helps capture delayed records.


Late-Arriving Dimensions

A late-arriving dimension occurs when a fact record arrives before its related dimension record.

Example:

Fact arrives:

CustomerIDSales
1001500

Dimension record for CustomerID 1001 has not yet arrived.


Handling Late-Arriving Dimensions

Create an Unknown Member

Dimension table:

CustomerKeyCustomerName
-1Unknown

Fact records temporarily reference the unknown member.

When the actual dimension arrives:

  • Update dimension
  • Correct foreign key relationships

This is a common data warehouse design pattern.


Retry Processing

Pause loading until dimensions arrive.

Useful when:

  • Data quality requirements are strict
  • Fact loads depend on dimension availability

Data Quality Monitoring

Fabric solutions should include monitoring for:

  • Duplicate counts
  • Null counts
  • Data freshness
  • Late-arriving records
  • Invalid keys

Typical monitoring techniques include:

  • SQL validation queries
  • Notebook checks
  • Pipeline activities
  • Custom logging tables

Common Fabric Tools for Data Quality

ToolUse Case
Dataflows Gen2Data cleansing
PipelinesOrchestration and validation
PySparkLarge-scale deduplication
SQLValidation and quality checks
Delta TablesUpserts and merge operations
WarehouseData quality reporting

Best Practices

Prevent Duplicates Early

Validate source data before loading.


Use Incremental Loading

Avoid repeatedly processing entire datasets.


Implement MERGE Operations

Upserts help prevent duplicate records.


Monitor Null Values

Track trends in missing data over time.


Create Unknown Dimension Members

Support late-arriving dimensions.


Use Sliding Window Loads

Capture delayed records efficiently.


Maintain Audit Columns

Examples:

CreatedDate
ModifiedDate
LoadDate
SourceSystem
BatchID

These assist with troubleshooting and recovery.


DP-700 Exam Tips

Remember these key concepts:

  • Duplicates can be identified using GROUP BY and COUNT().
  • ROW_NUMBER() is commonly used for deduplication.
  • PySpark supports deduplication through dropDuplicates().
  • Delta Lake MERGE operations support upserts and duplicate prevention.
  • Missing values can be replaced, filtered, or assigned defaults.
  • COALESCE() is commonly used in SQL to handle NULL values.
  • Late-arriving facts often require partition reprocessing or sliding windows.
  • Late-arriving dimensions are commonly handled using Unknown members.
  • Incremental loading and watermarking help manage delayed data efficiently.
  • Data quality monitoring is an essential responsibility of data engineers.

Practice Exam Questions

Question 1

A data engineer wants to identify duplicate customer records in a Fabric Warehouse.

Which SQL technique should be used?

A. GROUP BY with COUNT()

B. ORDER BY

C. DISTINCT

D. UNION

Correct Answer: A

Explanation: GROUP BY combined with COUNT() can identify keys that appear multiple times, making it a common duplicate detection technique.


Question 2

Which PySpark method is specifically designed to remove duplicate records?

A. filter()

B. collect()

C. sort()

D. dropDuplicates()

Correct Answer: D

Explanation: dropDuplicates() removes duplicate records based on one or more specified columns.


Question 3

A data engineer wants to keep only the most recent customer record when duplicates exist.

Which SQL function is most appropriate?

A. SUM()

B. MAX()

C. ROW_NUMBER()

D. COUNT()

Correct Answer: C

Explanation: ROW_NUMBER() allows ranking records within groups so the preferred record can be retained.


Question 4

Which SQL function is commonly used to replace NULL values with a default value?

A. CAST()

B. CONCAT()

C. GROUP BY

D. COALESCE()

Correct Answer: D

Explanation: COALESCE() returns the first non-NULL value and is frequently used to replace missing data.


Question 5

A sales transaction occurred on January 1 but was not received until January 5.

What type of issue is this?

A. Duplicate fact

B. Missing dimension

C. Late-arriving fact

D. Slowly changing dimension

Correct Answer: C

Explanation: The transaction belongs to an earlier business date but arrives after the expected processing window.


Question 6

Which loading strategy helps capture records that arrive several days late?

A. Full reload every hour

B. Sliding window load

C. Static partitioning

D. Manual refresh

Correct Answer: B

Explanation: Sliding window loads reprocess recent periods, allowing delayed records to be captured.


Question 7

A fact record arrives before its related dimension record.

What is this situation called?

A. Incremental loading

B. Duplicate key processing

C. Data drift

D. Late-arriving dimension

Correct Answer: D

Explanation: A late-arriving dimension scenario occurs when fact data is received before the associated dimension data.


Question 8

What is the most common solution for handling late-arriving dimensions in a dimensional model?

A. Delete the fact record

B. Skip the load permanently

C. Create an Unknown member in the dimension table

D. Disable incremental processing

Correct Answer: C

Explanation: An Unknown member allows fact records to load while preserving referential integrity until the dimension arrives.


Question 9

Which Delta Lake operation is commonly used to perform inserts and updates while minimizing duplicate records?

A. DELETE

B. MERGE

C. TRUNCATE

D. CREATE VIEW

Correct Answer: B

Explanation: MERGE supports upsert operations and is widely used for incremental loading and deduplication.


Question 10

Which column is most useful when implementing watermark-based incremental loads?

A. LastModifiedDate

B. Region

C. Category

D. ProductName

Correct Answer: A

Explanation: Watermarking relies on timestamps such as LastModifiedDate to identify new or changed records since the previous load.


Go to the DP-700 Exam Prep Hub main page.

Leave a comment