Tag: Denormalization

Denormalize data (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Denormalize data


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Denormalization is an important data engineering concept that appears frequently in modern analytics platforms, including Microsoft Fabric. While normalized data models are ideal for transactional systems, analytics workloads often benefit from denormalized structures that improve query performance, simplify reporting, and reduce the number of joins required during analysis.

For the DP-700 exam, you should understand:

  • What denormalization is
  • Why denormalization is used
  • When denormalization should be applied
  • How denormalization is implemented in Microsoft Fabric
  • The tradeoffs between normalized and denormalized models
  • How denormalization relates to Lakehouses, Warehouses, and dimensional modeling

What Is Denormalization?

Denormalization is the process of combining data from multiple related tables into fewer tables to reduce the need for joins during querying.

In a normalized model, information is separated into multiple related tables to eliminate redundancy.

Example normalized structure:

Customer Table

CustomerIDCustomerName
1Contoso
2Fabrikam

Product Table

ProductIDProductName
10Laptop
20Monitor

Sales Table

SaleIDCustomerIDProductIDAmount
10011101200

To generate a report, multiple joins are required.


Denormalized Version

SaleIDCustomerNameProductNameAmount
1001ContosoLaptop1200

The descriptive attributes are stored directly in the fact record.

This eliminates the need for joins when querying.


Why Denormalize Data?

Analytics systems are optimized differently than transactional systems.

The primary goals of denormalization are:

  • Faster query performance
  • Reduced join complexity
  • Improved reporting efficiency
  • Simplified data models
  • Better user experience for analysts

In Microsoft Fabric, analytical workloads often prioritize read performance over minimizing storage consumption.


Normalization vs Denormalization

CharacteristicNormalizedDenormalized
Data redundancyLowHigher
Storage efficiencyBetterLower
Query complexityHigherLower
Join requirementsManyFew
Reporting performanceSlowerFaster
Transaction systemsPreferredRarely used
Analytics systemsSometimesCommon

Denormalization in Microsoft Fabric

Denormalization is commonly performed during:

  • Data ingestion
  • Data transformation
  • Data warehouse loading
  • Lakehouse processing
  • ETL/ELT workflows

Fabric engineers frequently use:

  • Dataflows Gen2
  • Notebooks (PySpark)
  • SQL transformations
  • Data Pipelines

to create denormalized analytical datasets.


Common Denormalization Techniques

Flattening Multiple Tables

The most common approach is combining related tables into a single analytical table.

Example:

Before

Customer

CustomerID
CustomerName
Region

Sales

SaleID
CustomerID
Amount

After

SaleID
CustomerID
CustomerName
Region
Amount

This reduces reporting complexity.


Creating Wide Tables

A wide table contains many descriptive columns.

Example:

OrderIDCustomerNameRegionProductNameCategorySalespersonAmount

Analysts can query one table instead of several.


Materializing Joins

Rather than joining tables every time a report executes, the join is performed once during data loading.

Example:

SELECT
s.SaleID,
c.CustomerName,
p.ProductName,
s.Amount
FROM Sales s
JOIN Customer c
ON s.CustomerID = c.CustomerID
JOIN Product p
ON s.ProductID = p.ProductID;

The resulting dataset is stored as a new table.


Aggregated Tables

Sometimes denormalization includes pre-computing aggregates.

Example:

SELECT
Region,
SUM(Amount) AS TotalSales
FROM Sales
GROUP BY Region;

This produces a summarized table optimized for reporting.


Denormalization and Star Schemas

A common DP-700 topic is dimensional modeling.

Star schemas are partially denormalized models.

Example:

Fact Table

FactSales

| DateKey | ProductKey | CustomerKey | SalesAmount |

Dimension Tables

DimCustomer

DimProduct

DimDate

Star schemas intentionally denormalize dimension information while maintaining separate fact and dimension tables.

This approach balances:

  • Performance
  • Simplicity
  • Storage efficiency

Denormalization in Lakehouses

Within Fabric Lakehouses, denormalization is often used to create:

  • Curated Silver tables
  • Gold analytical tables
  • Reporting datasets

Typical flow:

Bronze Layer

Raw source data

Silver Layer

Cleaned and standardized data

Gold Layer

Denormalized business-ready tables

Example:

Bronze Customer
Bronze Orders
Bronze Products
↓ Transform
Gold SalesAnalytics

The Gold table contains business-friendly denormalized data.


Denormalization Using PySpark

Example:

sales_df = spark.table("Sales")
customer_df = spark.table("Customer")
result_df = sales_df.join(
customer_df,
sales_df.CustomerID == customer_df.CustomerID
)
result_df.write.mode("overwrite").saveAsTable("SalesAnalytics")

The resulting table is denormalized.


Denormalization Using SQL

Example:

CREATE TABLE SalesAnalytics AS
SELECT
s.SaleID,
c.CustomerName,
p.ProductName,
s.Amount
FROM Sales s
JOIN Customer c
ON s.CustomerID = c.CustomerID
JOIN Product p
ON s.ProductID = p.ProductID;

This is one of the most common techniques used in Fabric Warehouses.


Benefits of Denormalization

Faster Query Performance

Fewer joins mean faster report execution.


Simpler Queries

Analysts write simpler SQL.

Instead of:

SELECT ...
FROM FactSales
JOIN DimCustomer
JOIN DimProduct
JOIN DimDate

They may query a single table.


Improved Reporting

Power BI reports often perform better against denormalized structures.


Better User Experience

Business users can easily understand a flatter model.


Drawbacks of Denormalization

Increased Storage

Data duplication increases storage consumption.


More Complex Updates

Changes may need to be applied in multiple locations.


Potential Data Inconsistency

Improper ETL processes can create mismatched values.


Longer Load Times

More transformations occur during ingestion.


When to Denormalize

Denormalization is typically appropriate when:

✅ Data is primarily read rather than updated

✅ Reporting performance is important

✅ Large numbers of joins slow queries

✅ Business users require simplified models

✅ Building Gold-layer analytical datasets


When Not to Denormalize

Avoid denormalization when:

❌ Supporting transactional applications

❌ Frequent updates occur

❌ Data consistency is critical

❌ Storage costs are a major concern

❌ Redundancy must be minimized


DP-700 Exam Tips

Remember the Main Goal

Denormalization primarily improves analytical query performance.


Understand Lakehouse Layers

Expect questions about:

  • Bronze = Raw
  • Silver = Refined
  • Gold = Business-ready and often denormalized

Know the Tradeoff

The exam often tests:

Better performance ↔ More data redundancy


Star Schemas Matter

Star schemas are intentionally denormalized and are frequently used in analytics solutions.


SQL and PySpark Are Common Tools

Both SQL and PySpark are commonly used to implement denormalization transformations in Fabric.


Practice Exam Questions

Question 1

What is the primary purpose of denormalization?

A. Minimize storage usage

B. Eliminate duplicate data completely

C. Increase referential integrity

D. Reduce query complexity and improve performance

Answer: D

Explanation: Denormalization reduces the number of joins required and improves query performance for analytical workloads.


Question 2

Which characteristic is typically associated with a denormalized model?

A. Reduced redundancy

B. Increased data redundancy

C. More foreign keys

D. Higher normalization levels

Answer: B

Explanation: Denormalized models intentionally duplicate some data to improve read performance.


Question 3

Which Fabric layer commonly contains denormalized business-ready datasets?

A. Bronze

B. Landing

C. Gold

D. Raw

Answer: C

Explanation: Gold-layer datasets are typically optimized for reporting and analytics and are often denormalized.


Question 4

A report currently joins ten tables to answer business questions. What is the primary benefit of denormalizing the model?

A. Reduced ETL processing

B. Reduced storage requirements

C. Increased normalization

D. Faster query execution

Answer: D

Explanation: Reducing the number of joins generally improves query performance.


Question 5

Which operation is most commonly used when denormalizing data?

A. Joining related tables

B. Index rebuilding

C. Partitioning

D. Creating primary keys

Answer: A

Explanation: Denormalization commonly involves combining related tables through joins.


Question 6

Which workload benefits most from denormalized structures?

A. Transaction processing

B. Online order entry

C. Inventory updates

D. Business intelligence reporting

Answer: D

Explanation: Reporting and analytics workloads benefit from simplified structures and faster query performance.


Question 7

What is a potential disadvantage of denormalization?

A. Slower reads

B. Increased storage consumption

C. Reduced reporting performance

D. Fewer columns available

Answer: B

Explanation: Because data may be duplicated, denormalized models often consume more storage.


Question 8

A Fabric engineer creates a table that combines customer, product, and sales information into a single reporting table. What technique is being used?

A. Data partitioning

B. Schema evolution

C. Streaming ingestion

D. Denormalization

Answer: D

Explanation: Combining multiple related tables into a single analytical table is a classic denormalization technique.


Question 9

Which schema design is considered partially denormalized and commonly used in analytics?

A. Snowflake schema

B. Third Normal Form

C. Star schema

D. Operational schema

Answer: C

Explanation: Star schemas intentionally denormalize dimension data to improve analytical performance.


Question 10

Why might a data engineer choose a denormalized Gold-layer table instead of querying multiple Silver-layer tables?

A. To simplify reporting and improve performance

B. To eliminate all ETL processing

C. To reduce data quality requirements

D. To increase normalization

Answer: A

Explanation: Denormalized Gold-layer tables provide a business-friendly structure that supports faster reporting and easier querying.


Go to the DP-700 Exam Prep Hub main page.