Optimize a Lakehouse table (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
   --> Optimize performance
      --> Optimize a Lakehouse table


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Optimizing Lakehouse tables is a critical skill for the DP-700 certification exam and for real-world Microsoft Fabric data engineering solutions. As data volumes grow, poorly optimized Lakehouse tables can lead to slow query performance, increased compute consumption, longer notebook execution times, delayed report refreshes, and higher operational costs.

Microsoft Fabric Lakehouses use the Delta Lake format as their storage foundation. Delta Lake provides ACID transactions, schema enforcement, versioning, and performance optimization features that enable scalable analytics workloads. However, data engineers must actively manage and optimize Lakehouse tables to maintain high performance.

For the DP-700 exam, you should understand:

  • How Delta tables are stored
  • Causes of poor Lakehouse performance
  • File management and compaction
  • Table optimization techniques
  • Partitioning strategies
  • Data skipping
  • Z-Ordering concepts
  • VACUUM operations
  • Query optimization best practices
  • Monitoring table performance

Understanding Lakehouse Tables

A Lakehouse table in Microsoft Fabric is typically stored as a Delta table within OneLake.

A Delta table consists of:

  • Data files (typically Parquet)
  • Delta transaction logs
  • Metadata
  • Version history

This architecture provides:

  • ACID transactions
  • Time travel
  • Reliable updates and deletes
  • Scalable analytics

Although Delta Lake automatically handles many storage operations, performance can degrade over time if tables are not maintained properly.


Why Lakehouse Tables Require Optimization

Over time, data ingestion processes create:

  • Large numbers of files
  • Small files
  • Fragmented storage
  • Uneven data distribution

Common symptoms include:

  • Slow SQL queries
  • Long Spark job runtimes
  • Delayed report refreshes
  • Increased resource consumption
  • Poor filtering performance

Optimization activities help maintain efficient storage and query execution.


The Small File Problem

One of the most common performance issues is excessive small files.

Consider a streaming ingestion process that writes:

  • Thousands of files per hour
  • Each file only a few kilobytes

Eventually, the table may contain millions of small files.

Why Small Files Hurt Performance

Every query must:

  • Read file metadata
  • Open file handles
  • Scan numerous files

The overhead often becomes greater than the actual data processing.

Example:

ScenarioFile Count
Optimized table100 files
Fragmented table50,000 files

The optimized table will generally perform significantly better.


Table Compaction

Compaction combines many small files into fewer larger files.

Benefits include:

  • Faster query execution
  • Reduced metadata overhead
  • Improved scan efficiency
  • Better Spark performance

Compaction is one of the most important optimization tasks for Delta tables.

Example

Before compaction:

10,000 files × 5 MB

After compaction:

100 files × 500 MB

The total data size remains similar, but query performance often improves substantially.


Using OPTIMIZE

The OPTIMIZE command is commonly used to compact Delta files.

Example:

OPTIMIZE Sales

The command:

  • Consolidates small files
  • Improves storage efficiency
  • Enhances query performance

For the DP-700 exam, understand that OPTIMIZE primarily addresses file fragmentation and small file issues.


Data Skipping

Delta Lake stores statistics about data files.

These statistics help Fabric eliminate unnecessary file scans.

This capability is known as data skipping.

Example:

A query requests:

WHERE OrderDate >= '2026-01-01'

If a file only contains data from 2024, Fabric can skip reading that file entirely.

Benefits include:

  • Reduced I/O
  • Faster query performance
  • Lower compute consumption

Z-Ordering

Z-Ordering improves data locality by physically organizing related values together.

This is particularly useful when queries repeatedly filter on specific columns.

Example:

OPTIMIZE Sales
ZORDER BY (CustomerID)

Benefits:

  • Better file pruning
  • Faster filtering
  • Improved query performance

Good Candidates for Z-Ordering

Columns frequently used in:

  • WHERE clauses
  • JOIN operations
  • Report filters
  • Dashboard slicers

Examples:

  • CustomerID
  • ProductID
  • OrderDate
  • Region

Poor Candidates

Columns with:

  • Extremely high cardinality and random access patterns
  • Rarely used filters
  • Constantly changing query patterns

Partitioning Strategies

Partitioning physically separates data into directories.

Example:

Sales
├── Year=2024
├── Year=2025
└── Year=2026

Queries targeting a specific year can read only the relevant partition.

Benefits:

  • Reduced data scanning
  • Faster query execution
  • Improved scalability

Choosing Partition Columns

Good partition columns typically:

  • Appear frequently in filters
  • Have moderate cardinality
  • Create balanced partitions

Examples:

  • Year
  • Month
  • Region
  • BusinessUnit

Over-Partitioning Risks

Too many partitions can create performance problems.

Poor example:

Partition by CustomerID

If there are millions of customers:

  • Millions of folders
  • Small files
  • Metadata overhead

This often performs worse than a non-partitioned table.

Rule of Thumb

Partition only when:

  • Query patterns justify it
  • Data volumes are large
  • Cardinality is manageable

VACUUM Operations

Delta tables retain historical files to support:

  • Transactions
  • Rollbacks
  • Time travel

Over time, these files consume storage.

VACUUM removes obsolete files.

Example:

VACUUM Sales

Benefits:

  • Reduces storage consumption
  • Removes unneeded files
  • Improves storage efficiency

Important Exam Point

VACUUM does not improve query performance directly.

Its primary purpose is storage cleanup.


Optimizing Data Types

Using appropriate data types improves efficiency.

Examples:

Better ChoiceAvoid
INTSTRING for numeric values
DATESTRING dates
SMALLINTOversized numeric types

Benefits:

  • Smaller storage footprint
  • Faster filtering
  • Improved joins
  • Better compression

Query Optimization Techniques

Sometimes the table is not the problem—the query is.

Use Predicate Filtering

Good:

SELECT *
FROM Sales
WHERE Year = 2026

Avoid:

SELECT *
FROM Sales

Filtering reduces scanned data.


Select Required Columns

Good:

SELECT CustomerID, SalesAmount

Avoid:

SELECT *

Reading fewer columns improves performance.


Reduce Unnecessary Joins

Complex joins increase execution time.

Use only required tables and columns.


Monitoring Lakehouse Performance

Several Fabric tools help identify optimization opportunities.

SQL Query Monitoring

Review:

  • Query duration
  • Resource usage
  • Execution plans

Notebook Monitoring

Identify:

  • Long-running Spark jobs
  • Excessive shuffles
  • Skewed workloads

Capacity Metrics

Monitor:

  • CPU utilization
  • Memory usage
  • Workload concurrency

Workspace Monitoring

Look for:

  • Refresh delays
  • Pipeline bottlenecks
  • Query slowdowns

Common Optimization Workflow

A typical optimization process might include:

Step 1

Identify slow queries.

Step 2

Determine whether excessive file counts exist.

Step 3

Run OPTIMIZE.

Step 4

Evaluate partitioning strategy.

Step 5

Consider Z-Ordering.

Step 6

Review query design.

Step 7

Run VACUUM when appropriate.


Best Practices

Use OPTIMIZE Regularly

Especially after:

  • Large batch loads
  • Frequent incremental loads
  • Streaming ingestion

Avoid Excessive Small Files

Batch writes when possible.

Partition Carefully

Avoid high-cardinality partition columns.

Use Z-Ordering Selectively

Apply to heavily filtered columns.

Monitor Query Performance

Optimization should be driven by workload patterns.

Schedule Maintenance

Automate optimization processes where possible.


DP-700 Exam Tips

Remember these key points:

  • Delta tables are the foundation of Fabric Lakehouse storage.
  • Small files are a major cause of poor performance.
  • OPTIMIZE primarily addresses file compaction.
  • Z-Ordering improves filtering performance.
  • Partitioning reduces scanned data but must be used carefully.
  • Over-partitioning can degrade performance.
  • VACUUM removes obsolete files and reduces storage consumption.
  • Data skipping helps eliminate unnecessary file reads.
  • Query optimization and table optimization work together.
  • Monitoring tools help identify performance bottlenecks.

Practice Exam Questions

Question 1

A Lakehouse table contains hundreds of thousands of very small Delta files after months of incremental loads. Which action should you take first?

A. Run OPTIMIZE on the table
B. Run VACUUM on the table
C. Create a new semantic model
D. Increase workspace permissions

Correct Answer: A

Explanation:
OPTIMIZE compacts small files into larger files, reducing metadata overhead and improving query performance. VACUUM removes obsolete files but does not address file fragmentation.


Question 2

What is the primary purpose of the VACUUM command?

A. Improve filtering performance
B. Create partitions automatically
C. Remove obsolete files no longer needed by Delta Lake
D. Rebuild semantic models

Correct Answer: C

Explanation:
VACUUM removes old files that are no longer required for Delta transaction history and time travel, helping reduce storage consumption.


Question 3

Which column is generally the best candidate for partitioning a large sales table?

A. OrderID with millions of unique values
B. TransactionGUID with millions of unique values
C. ProductDescription
D. SalesYear

Correct Answer: D

Explanation:
SalesYear is commonly used in filtering and has manageable cardinality, making it an effective partition column.


Question 4

What problem does data skipping help solve?

A. Excessive security permissions
B. Reading files that cannot possibly contain matching data
C. Semantic model refresh failures
D. Notebook authentication errors

Correct Answer: B

Explanation:
Data skipping uses file statistics to eliminate unnecessary file reads during query execution.


Question 5

A table is frequently filtered using CustomerID. Which optimization technique is most likely to improve performance?

A. Z-Ordering on CustomerID
B. Deleting transaction logs
C. Removing partitions entirely
D. Disabling Delta Lake features

Correct Answer: A

Explanation:
Z-Ordering organizes data based on frequently filtered columns, improving file pruning and query performance.


Question 6

What is a common risk of over-partitioning?

A. Increased data skipping efficiency
B. Reduced storage consumption
C. Excessive numbers of small partitions and files
D. Automatic query acceleration

Correct Answer: C

Explanation:
Over-partitioning can create many small directories and files, leading to metadata overhead and degraded performance.


Question 7

Which query pattern is generally most efficient?

A. SELECT * FROM Sales
B. SELECT CustomerID, SalesAmount FROM Sales WHERE Year = 2026
C. SELECT * FROM Sales CROSS JOIN Products
D. SELECT DISTINCT * FROM Sales

Correct Answer: B

Explanation:
Filtering rows and selecting only required columns minimizes data scanning and improves query efficiency.


Question 8

Which statement about OPTIMIZE is correct?

A. It removes all Delta transaction logs
B. It creates semantic model aggregations
C. It converts Delta tables to Parquet-only tables
D. It compacts many small files into fewer larger files

Correct Answer: D

Explanation:
OPTIMIZE primarily improves performance through file compaction and reduction of small-file fragmentation.


Question 9

A data engineer partitions a table by CustomerID containing 20 million unique customers. What is the most likely result?

A. Improved performance in all scenarios
B. Automatic Z-Ordering
C. Poor performance due to excessive partition cardinality
D. Elimination of Delta logs

Correct Answer: C

Explanation:
Partitioning by extremely high-cardinality columns creates excessive partitions and often harms performance.


Question 10

Which statement best describes Z-Ordering?

A. It removes deleted records permanently
B. It physically organizes related values together to improve query filtering
C. It automatically creates partitions for every column
D. It converts Delta tables into warehouse tables

Correct Answer: B

Explanation:
Z-Ordering improves data locality, helping Fabric skip more files and accelerate queries that filter on selected columns.


Go to the DP-700 Exam Prep Hub main page.

Leave a comment