This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
--> Optimize performance
--> Optimize a Lakehouse table
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Optimizing Lakehouse tables is a critical skill for the DP-700 certification exam and for real-world Microsoft Fabric data engineering solutions. As data volumes grow, poorly optimized Lakehouse tables can lead to slow query performance, increased compute consumption, longer notebook execution times, delayed report refreshes, and higher operational costs.
Microsoft Fabric Lakehouses use the Delta Lake format as their storage foundation. Delta Lake provides ACID transactions, schema enforcement, versioning, and performance optimization features that enable scalable analytics workloads. However, data engineers must actively manage and optimize Lakehouse tables to maintain high performance.
For the DP-700 exam, you should understand:
- How Delta tables are stored
- Causes of poor Lakehouse performance
- File management and compaction
- Table optimization techniques
- Partitioning strategies
- Data skipping
- Z-Ordering concepts
- VACUUM operations
- Query optimization best practices
- Monitoring table performance
Understanding Lakehouse Tables
A Lakehouse table in Microsoft Fabric is typically stored as a Delta table within OneLake.
A Delta table consists of:
- Data files (typically Parquet)
- Delta transaction logs
- Metadata
- Version history
This architecture provides:
- ACID transactions
- Time travel
- Reliable updates and deletes
- Scalable analytics
Although Delta Lake automatically handles many storage operations, performance can degrade over time if tables are not maintained properly.
Why Lakehouse Tables Require Optimization
Over time, data ingestion processes create:
- Large numbers of files
- Small files
- Fragmented storage
- Uneven data distribution
Common symptoms include:
- Slow SQL queries
- Long Spark job runtimes
- Delayed report refreshes
- Increased resource consumption
- Poor filtering performance
Optimization activities help maintain efficient storage and query execution.
The Small File Problem
One of the most common performance issues is excessive small files.
Consider a streaming ingestion process that writes:
- Thousands of files per hour
- Each file only a few kilobytes
Eventually, the table may contain millions of small files.
Why Small Files Hurt Performance
Every query must:
- Read file metadata
- Open file handles
- Scan numerous files
The overhead often becomes greater than the actual data processing.
Example:
| Scenario | File Count |
|---|---|
| Optimized table | 100 files |
| Fragmented table | 50,000 files |
The optimized table will generally perform significantly better.
Table Compaction
Compaction combines many small files into fewer larger files.
Benefits include:
- Faster query execution
- Reduced metadata overhead
- Improved scan efficiency
- Better Spark performance
Compaction is one of the most important optimization tasks for Delta tables.
Example
Before compaction:
10,000 files × 5 MB
After compaction:
100 files × 500 MB
The total data size remains similar, but query performance often improves substantially.
Using OPTIMIZE
The OPTIMIZE command is commonly used to compact Delta files.
Example:
OPTIMIZE Sales
The command:
- Consolidates small files
- Improves storage efficiency
- Enhances query performance
For the DP-700 exam, understand that OPTIMIZE primarily addresses file fragmentation and small file issues.
Data Skipping
Delta Lake stores statistics about data files.
These statistics help Fabric eliminate unnecessary file scans.
This capability is known as data skipping.
Example:
A query requests:
WHERE OrderDate >= '2026-01-01'
If a file only contains data from 2024, Fabric can skip reading that file entirely.
Benefits include:
- Reduced I/O
- Faster query performance
- Lower compute consumption
Z-Ordering
Z-Ordering improves data locality by physically organizing related values together.
This is particularly useful when queries repeatedly filter on specific columns.
Example:
OPTIMIZE SalesZORDER BY (CustomerID)
Benefits:
- Better file pruning
- Faster filtering
- Improved query performance
Good Candidates for Z-Ordering
Columns frequently used in:
- WHERE clauses
- JOIN operations
- Report filters
- Dashboard slicers
Examples:
- CustomerID
- ProductID
- OrderDate
- Region
Poor Candidates
Columns with:
- Extremely high cardinality and random access patterns
- Rarely used filters
- Constantly changing query patterns
Partitioning Strategies
Partitioning physically separates data into directories.
Example:
Sales ├── Year=2024 ├── Year=2025 └── Year=2026
Queries targeting a specific year can read only the relevant partition.
Benefits:
- Reduced data scanning
- Faster query execution
- Improved scalability
Choosing Partition Columns
Good partition columns typically:
- Appear frequently in filters
- Have moderate cardinality
- Create balanced partitions
Examples:
- Year
- Month
- Region
- BusinessUnit
Over-Partitioning Risks
Too many partitions can create performance problems.
Poor example:
Partition by CustomerID
If there are millions of customers:
- Millions of folders
- Small files
- Metadata overhead
This often performs worse than a non-partitioned table.
Rule of Thumb
Partition only when:
- Query patterns justify it
- Data volumes are large
- Cardinality is manageable
VACUUM Operations
Delta tables retain historical files to support:
- Transactions
- Rollbacks
- Time travel
Over time, these files consume storage.
VACUUM removes obsolete files.
Example:
VACUUM Sales
Benefits:
- Reduces storage consumption
- Removes unneeded files
- Improves storage efficiency
Important Exam Point
VACUUM does not improve query performance directly.
Its primary purpose is storage cleanup.
Optimizing Data Types
Using appropriate data types improves efficiency.
Examples:
| Better Choice | Avoid |
|---|---|
| INT | STRING for numeric values |
| DATE | STRING dates |
| SMALLINT | Oversized numeric types |
Benefits:
- Smaller storage footprint
- Faster filtering
- Improved joins
- Better compression
Query Optimization Techniques
Sometimes the table is not the problem—the query is.
Use Predicate Filtering
Good:
SELECT *FROM SalesWHERE Year = 2026
Avoid:
SELECT *FROM Sales
Filtering reduces scanned data.
Select Required Columns
Good:
SELECT CustomerID, SalesAmount
Avoid:
SELECT *
Reading fewer columns improves performance.
Reduce Unnecessary Joins
Complex joins increase execution time.
Use only required tables and columns.
Monitoring Lakehouse Performance
Several Fabric tools help identify optimization opportunities.
SQL Query Monitoring
Review:
- Query duration
- Resource usage
- Execution plans
Notebook Monitoring
Identify:
- Long-running Spark jobs
- Excessive shuffles
- Skewed workloads
Capacity Metrics
Monitor:
- CPU utilization
- Memory usage
- Workload concurrency
Workspace Monitoring
Look for:
- Refresh delays
- Pipeline bottlenecks
- Query slowdowns
Common Optimization Workflow
A typical optimization process might include:
Step 1
Identify slow queries.
Step 2
Determine whether excessive file counts exist.
Step 3
Run OPTIMIZE.
Step 4
Evaluate partitioning strategy.
Step 5
Consider Z-Ordering.
Step 6
Review query design.
Step 7
Run VACUUM when appropriate.
Best Practices
Use OPTIMIZE Regularly
Especially after:
- Large batch loads
- Frequent incremental loads
- Streaming ingestion
Avoid Excessive Small Files
Batch writes when possible.
Partition Carefully
Avoid high-cardinality partition columns.
Use Z-Ordering Selectively
Apply to heavily filtered columns.
Monitor Query Performance
Optimization should be driven by workload patterns.
Schedule Maintenance
Automate optimization processes where possible.
DP-700 Exam Tips
Remember these key points:
- Delta tables are the foundation of Fabric Lakehouse storage.
- Small files are a major cause of poor performance.
- OPTIMIZE primarily addresses file compaction.
- Z-Ordering improves filtering performance.
- Partitioning reduces scanned data but must be used carefully.
- Over-partitioning can degrade performance.
- VACUUM removes obsolete files and reduces storage consumption.
- Data skipping helps eliminate unnecessary file reads.
- Query optimization and table optimization work together.
- Monitoring tools help identify performance bottlenecks.
Practice Exam Questions
Question 1
A Lakehouse table contains hundreds of thousands of very small Delta files after months of incremental loads. Which action should you take first?
A. Run OPTIMIZE on the table
B. Run VACUUM on the table
C. Create a new semantic model
D. Increase workspace permissions
Correct Answer: A
Explanation:
OPTIMIZE compacts small files into larger files, reducing metadata overhead and improving query performance. VACUUM removes obsolete files but does not address file fragmentation.
Question 2
What is the primary purpose of the VACUUM command?
A. Improve filtering performance
B. Create partitions automatically
C. Remove obsolete files no longer needed by Delta Lake
D. Rebuild semantic models
Correct Answer: C
Explanation:
VACUUM removes old files that are no longer required for Delta transaction history and time travel, helping reduce storage consumption.
Question 3
Which column is generally the best candidate for partitioning a large sales table?
A. OrderID with millions of unique values
B. TransactionGUID with millions of unique values
C. ProductDescription
D. SalesYear
Correct Answer: D
Explanation:
SalesYear is commonly used in filtering and has manageable cardinality, making it an effective partition column.
Question 4
What problem does data skipping help solve?
A. Excessive security permissions
B. Reading files that cannot possibly contain matching data
C. Semantic model refresh failures
D. Notebook authentication errors
Correct Answer: B
Explanation:
Data skipping uses file statistics to eliminate unnecessary file reads during query execution.
Question 5
A table is frequently filtered using CustomerID. Which optimization technique is most likely to improve performance?
A. Z-Ordering on CustomerID
B. Deleting transaction logs
C. Removing partitions entirely
D. Disabling Delta Lake features
Correct Answer: A
Explanation:
Z-Ordering organizes data based on frequently filtered columns, improving file pruning and query performance.
Question 6
What is a common risk of over-partitioning?
A. Increased data skipping efficiency
B. Reduced storage consumption
C. Excessive numbers of small partitions and files
D. Automatic query acceleration
Correct Answer: C
Explanation:
Over-partitioning can create many small directories and files, leading to metadata overhead and degraded performance.
Question 7
Which query pattern is generally most efficient?
A. SELECT * FROM Sales
B. SELECT CustomerID, SalesAmount FROM Sales WHERE Year = 2026
C. SELECT * FROM Sales CROSS JOIN Products
D. SELECT DISTINCT * FROM Sales
Correct Answer: B
Explanation:
Filtering rows and selecting only required columns minimizes data scanning and improves query efficiency.
Question 8
Which statement about OPTIMIZE is correct?
A. It removes all Delta transaction logs
B. It creates semantic model aggregations
C. It converts Delta tables to Parquet-only tables
D. It compacts many small files into fewer larger files
Correct Answer: D
Explanation:
OPTIMIZE primarily improves performance through file compaction and reduction of small-file fragmentation.
Question 9
A data engineer partitions a table by CustomerID containing 20 million unique customers. What is the most likely result?
A. Improved performance in all scenarios
B. Automatic Z-Ordering
C. Poor performance due to excessive partition cardinality
D. Elimination of Delta logs
Correct Answer: C
Explanation:
Partitioning by extremely high-cardinality columns creates excessive partitions and often harms performance.
Question 10
Which statement best describes Z-Ordering?
A. It removes deleted records permanently
B. It physically organizes related values together to improve query filtering
C. It automatically creates partitions for every column
D. It converts Delta tables into warehouse tables
Correct Answer: B
Explanation:
Z-Ordering improves data locality, helping Fabric skip more files and accelerate queries that filter on selected columns.
Go to the DP-700 Exam Prep Hub main page.
