This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
--> Ingest and transform batch data
--> Handle duplicate, missing, and late-arriving data
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
One of the most important responsibilities of a data engineer is ensuring data quality. In real-world data environments, source systems frequently produce data issues that can negatively impact reporting, analytics, machine learning, and business decision-making.
Three of the most common data quality challenges are:
- Duplicate data
- Missing data
- Late-arriving data
Microsoft Fabric provides multiple tools for addressing these challenges, including:
- Dataflows Gen2
- Data Pipelines
- Lakehouses
- Data Warehouses
- PySpark Notebooks
- SQL
- Delta Lake tables
For the DP-700 exam, you should understand:
- How duplicate data occurs
- How to identify and remove duplicates
- How to handle missing values
- Strategies for managing late-arriving facts and dimensions
- Data quality best practices in Fabric
- Tools and techniques available across Fabric workloads
Understanding Duplicate Data
Duplicate data occurs when the same logical record appears more than once.
Example:
| CustomerID | Name | |
|---|---|---|
| 1001 | John Smith | john@email.com |
| 1001 | John Smith | john@email.com |
Duplicates can occur due to:
- Multiple source system loads
- Pipeline reruns
- Source application errors
- Data integration mistakes
- Incomplete change tracking
Duplicate records can lead to:
- Inflated sales totals
- Incorrect customer counts
- Poor analytics results
- Data integrity issues
Types of Duplicates
Exact Duplicates
Every field is identical.
Example:
| OrderID | Amount |
|---|---|
| 5001 | 250 |
| 5001 | 250 |
These are usually the easiest to detect.
Partial Duplicates
Records represent the same entity but differ slightly.
Example:
| CustomerID | Name |
|---|---|
| 1001 | John Smith |
| 1001 | Jon Smith |
These may require business rules to resolve.
Detecting Duplicate Records Using SQL
A common approach is grouping records and counting occurrences.
Example:
SELECT CustomerID, COUNT(*) AS RecordCountFROM CustomersGROUP BY CustomerIDHAVING COUNT(*) > 1;
This identifies customers appearing multiple times.
Removing Duplicates Using SQL
A common technique uses ROW_NUMBER().
Example:
WITH RankedRows AS( SELECT *, ROW_NUMBER() OVER ( PARTITION BY CustomerID ORDER BY LastModifiedDate DESC ) AS RowNum FROM Customers)SELECT *FROM RankedRowsWHERE RowNum = 1;
This keeps only the most recent record for each customer.
Handling Duplicates Using PySpark
PySpark provides the dropDuplicates() method.
Example:
customers_df = customers_df.dropDuplicates(["CustomerID"])
This removes duplicate records based on CustomerID.
For multiple columns:
customers_df = customers_df.dropDuplicates( ["CustomerID","Email"])
This is frequently used during Lakehouse ingestion processes.
Delta Lake and Duplicate Prevention
Delta Lake tables support MERGE operations that help prevent duplicates.
Example:
MERGE INTO Customers AS TargetUSING StagingCustomers AS SourceON Target.CustomerID = Source.CustomerIDWHEN MATCHED THEN UPDATE SET *WHEN NOT MATCHED THEN INSERT *
Benefits include:
- Upserts
- Incremental loads
- Duplicate prevention
- Simplified ETL logic
This is a common Fabric pattern.
Understanding Missing Data
Missing data occurs when expected values are absent.
Example:
| CustomerID | Name | |
|---|---|---|
| 1001 | John Smith | NULL |
Missing values may result from:
- Source system limitations
- User entry errors
- Integration failures
- Incomplete records
Why Missing Data Matters
Missing values can cause:
- Reporting inaccuracies
- Failed transformations
- Machine learning issues
- Data quality concerns
Example:
Calculating average salary becomes unreliable if many salary values are missing.
Identifying Missing Values
SQL example:
SELECT *FROM CustomersWHERE Email IS NULL;
PySpark example:
customers_df.filter( customers_df.Email.isNull())
Handling Missing Data
Several approaches exist.
Option 1: Replace Missing Values
Example:
SELECT CustomerID, COALESCE(Email,'Unknown')FROM Customers;
PySpark:
df.fillna("Unknown")
Useful when business rules permit default values.
Option 2: Exclude Records
Example:
SELECT *FROM CustomersWHERE Email IS NOT NULL;
Useful when incomplete records cannot be trusted.
Option 3: Use Business Defaults
Example:
| Missing Value | Replacement |
|---|---|
| Country | Unknown |
| Discount | 0 |
| Status | Pending |
Common in dimensional modeling.
Dataflows Gen2 and Missing Data
Dataflows Gen2 provides low-code capabilities for:
- Replacing null values
- Filtering null rows
- Conditional column creation
- Data validation
These transformations are frequently used during ingestion and cleansing processes.
Understanding Late-Arriving Data
Late-arriving data refers to records that arrive after the expected processing window.
Example:
An order occurs on January 1st but is not received until January 5th.
Causes of Late-Arriving Data
Common causes include:
- Network interruptions
- Application outages
- Batch processing delays
- Replication latency
- Manual data entry
Late-arriving data is extremely common in enterprise systems.
Late-Arriving Facts
A late-arriving fact occurs when a transaction arrives after related reporting periods have already been processed.
Example:
| OrderDate | Arrival Date |
|---|---|
| Jan 1 | Jan 5 |
The order belongs to January 1 but arrives later.
Challenges include:
- Incorrect historical totals
- Incomplete reporting
- Fact table inconsistencies
Handling Late-Arriving Facts
Common approaches include:
Reprocessing Historical Periods
Reload affected partitions.
Example:
Reload the previous 7 days during each load.
Benefits:
- Simple implementation
- Common Fabric approach
Watermark-Based Processing
Track the latest processed timestamp.
Example:
WHERE LastModifiedDate > @LastProcessedTimestamp
This supports efficient incremental loads.
Sliding Window Loads
Instead of loading only today’s data:
Load the last:
- 3 days
- 7 days
- 30 days
This helps capture delayed records.
Late-Arriving Dimensions
A late-arriving dimension occurs when a fact record arrives before its related dimension record.
Example:
Fact arrives:
| CustomerID | Sales |
|---|---|
| 1001 | 500 |
Dimension record for CustomerID 1001 has not yet arrived.
Handling Late-Arriving Dimensions
Create an Unknown Member
Dimension table:
| CustomerKey | CustomerName |
|---|---|
| -1 | Unknown |
Fact records temporarily reference the unknown member.
When the actual dimension arrives:
- Update dimension
- Correct foreign key relationships
This is a common data warehouse design pattern.
Retry Processing
Pause loading until dimensions arrive.
Useful when:
- Data quality requirements are strict
- Fact loads depend on dimension availability
Data Quality Monitoring
Fabric solutions should include monitoring for:
- Duplicate counts
- Null counts
- Data freshness
- Late-arriving records
- Invalid keys
Typical monitoring techniques include:
- SQL validation queries
- Notebook checks
- Pipeline activities
- Custom logging tables
Common Fabric Tools for Data Quality
| Tool | Use Case |
|---|---|
| Dataflows Gen2 | Data cleansing |
| Pipelines | Orchestration and validation |
| PySpark | Large-scale deduplication |
| SQL | Validation and quality checks |
| Delta Tables | Upserts and merge operations |
| Warehouse | Data quality reporting |
Best Practices
Prevent Duplicates Early
Validate source data before loading.
Use Incremental Loading
Avoid repeatedly processing entire datasets.
Implement MERGE Operations
Upserts help prevent duplicate records.
Monitor Null Values
Track trends in missing data over time.
Create Unknown Dimension Members
Support late-arriving dimensions.
Use Sliding Window Loads
Capture delayed records efficiently.
Maintain Audit Columns
Examples:
CreatedDateModifiedDateLoadDateSourceSystemBatchID
These assist with troubleshooting and recovery.
DP-700 Exam Tips
Remember these key concepts:
- Duplicates can be identified using GROUP BY and COUNT().
- ROW_NUMBER() is commonly used for deduplication.
- PySpark supports deduplication through dropDuplicates().
- Delta Lake MERGE operations support upserts and duplicate prevention.
- Missing values can be replaced, filtered, or assigned defaults.
- COALESCE() is commonly used in SQL to handle NULL values.
- Late-arriving facts often require partition reprocessing or sliding windows.
- Late-arriving dimensions are commonly handled using Unknown members.
- Incremental loading and watermarking help manage delayed data efficiently.
- Data quality monitoring is an essential responsibility of data engineers.
Practice Exam Questions
Question 1
A data engineer wants to identify duplicate customer records in a Fabric Warehouse.
Which SQL technique should be used?
A. GROUP BY with COUNT()
B. ORDER BY
C. DISTINCT
D. UNION
Correct Answer: A
Explanation: GROUP BY combined with COUNT() can identify keys that appear multiple times, making it a common duplicate detection technique.
Question 2
Which PySpark method is specifically designed to remove duplicate records?
A. filter()
B. collect()
C. sort()
D. dropDuplicates()
Correct Answer: D
Explanation: dropDuplicates() removes duplicate records based on one or more specified columns.
Question 3
A data engineer wants to keep only the most recent customer record when duplicates exist.
Which SQL function is most appropriate?
A. SUM()
B. MAX()
C. ROW_NUMBER()
D. COUNT()
Correct Answer: C
Explanation: ROW_NUMBER() allows ranking records within groups so the preferred record can be retained.
Question 4
Which SQL function is commonly used to replace NULL values with a default value?
A. CAST()
B. CONCAT()
C. GROUP BY
D. COALESCE()
Correct Answer: D
Explanation: COALESCE() returns the first non-NULL value and is frequently used to replace missing data.
Question 5
A sales transaction occurred on January 1 but was not received until January 5.
What type of issue is this?
A. Duplicate fact
B. Missing dimension
C. Late-arriving fact
D. Slowly changing dimension
Correct Answer: C
Explanation: The transaction belongs to an earlier business date but arrives after the expected processing window.
Question 6
Which loading strategy helps capture records that arrive several days late?
A. Full reload every hour
B. Sliding window load
C. Static partitioning
D. Manual refresh
Correct Answer: B
Explanation: Sliding window loads reprocess recent periods, allowing delayed records to be captured.
Question 7
A fact record arrives before its related dimension record.
What is this situation called?
A. Incremental loading
B. Duplicate key processing
C. Data drift
D. Late-arriving dimension
Correct Answer: D
Explanation: A late-arriving dimension scenario occurs when fact data is received before the associated dimension data.
Question 8
What is the most common solution for handling late-arriving dimensions in a dimensional model?
A. Delete the fact record
B. Skip the load permanently
C. Create an Unknown member in the dimension table
D. Disable incremental processing
Correct Answer: C
Explanation: An Unknown member allows fact records to load while preserving referential integrity until the dimension arrives.
Question 9
Which Delta Lake operation is commonly used to perform inserts and updates while minimizing duplicate records?
A. DELETE
B. MERGE
C. TRUNCATE
D. CREATE VIEW
Correct Answer: B
Explanation: MERGE supports upsert operations and is widely used for incremental loading and deduplication.
Question 10
Which column is most useful when implementing watermark-based incremental loads?
A. LastModifiedDate
B. Region
C. Category
D. ProductName
Correct Answer: A
Explanation: Watermarking relies on timestamps such as LastModifiedDate to identify new or changed records since the previous load.
Go to the DP-700 Exam Prep Hub main page.
