Working with SQL to select, filter, and aggregate data is a core skill for analytics engineers using Microsoft Fabric. Whether querying data in a warehouse, lakehouse SQL analytics endpoint, or semantic model via DirectQuery, SQL enables precise data retrieval and summarization for reporting, dashboards, and analytics solutions.
For DP-600, you should understand how to construct SQL queries that perform:
Selecting specific data columns
Filtering rows based on conditions
Aggregating values with grouping and summary functions
SQL Data Selection
Selecting data refers to using the SELECT clause to choose which columns or expressions to return.
Example:
SELECT
CustomerID,
OrderDate,
SalesAmount
FROM Sales;
Use * to return all columns: SELECT * FROM Sales;
Use expressions to compute derived values: SELECT OrderDate, SalesAmount, SalesAmount * 1.1 AS AdjustedRevenue FROM Sales;
Exam Tip: Be purposeful in selecting only needed columns to improve performance.
SQL Data Filtering
Filtering data determines which rows are returned based on conditions using the WHERE clause.
Basic Filtering:
SELECT *
FROM Sales
WHERE OrderDate >= '2025-01-01';
Combined Conditions:
AND: WHERE Country = 'USA' AND SalesAmount > 1000
OR: WHERE Region = 'East' OR Region = 'West'
Null and Missing Value Filters:
WHERE SalesAmount IS NOT NULL
Exam Tip: Understand how WHERE filters reduce dataset size before aggregation.
SQL Aggregation
Aggregation summarizes grouped rows using functions like SUM, COUNT, AVG, MIN, and MAX.
Basic Aggregation:
SELECT
SUM(SalesAmount) AS TotalSales
FROM Sales;
Grouped Aggregation:
SELECT
Country,
SUM(SalesAmount) AS TotalSales,
COUNT(*) AS OrderCount
FROM Sales
GROUP BY Country;
Filtering After Aggregation:
Use HAVING instead of WHERE to filter aggregated results:
SELECT
Country,
SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Country
HAVING SUM(SalesAmount) > 100000;
Exam Tip:
Use WHERE for row-level filters before grouping.
Use HAVING to filter group-level aggregates.
Combining Select, Filter, and Aggregate
A complete SQL query often blends all three:
SELECT
ProductCategory,
COUNT(*) AS Orders,
SUM(SalesAmount) AS TotalSales,
AVG(SalesAmount) AS AvgSale
FROM Sales
WHERE OrderDate BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY ProductCategory
ORDER BY TotalSales DESC;
This example:
Selects specific columns and expressions
Filters by date range
Aggregates by product category
Orders results by summary metric
SQL in Different Fabric Workloads
Workload
SQL Usage
Warehouse
Standard T-SQL for BI queries
Lakehouse SQL Analytics
SQL against Delta tables
Semantic Models via DirectQuery
SQL pushed to source where supported
Dataflows/Power Query
SQL-like operations through M (not direct SQL)
Performance and Pushdown
When using SQL in Fabric:
Engines push filters and aggregations down to the data source for performance.
Select only needed columns early to limit data movement.
Avoid SELECT * in production queries unless necessary.
Key SQL Concepts for the Exam
Concept
Why It Matters
SELECT
Defines what data to retrieve
WHERE
Filters data before aggregation
GROUP BY
Organizes rows into groups
HAVING
Filters after aggregation
Aggregate functions
Summarize numeric data
Understanding how these work together is essential for creating analytics-ready datasets.
Common Exam Scenarios
You may be asked to:
Write SQL to filter data based on conditions
Summarize data across groups
Decide whether to use WHERE or HAVING
Identify the correct SQL pattern for a reporting requirement
Example exam prompt:
“Which SQL query correctly returns the total sales per region, only for regions with more than 1,000 orders?”
Understanding aggregate filters (HAVING) and groupings will be key.
Final Exam Tips
If a question mentions:
“Return summary metrics”
“Only include rows that meet conditions”
“Group results by category”
…you’re looking at combining SELECT, WHERE, and GROUP BY in SQL.
WHERE filters rows before aggregation
HAVING filters after aggregation
GROUP BY is required for per-group metrics
Use aggregate functions intentionally
Performance matters — avoid unnecessary columns
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
1. Which SQL clause is used to filter rows before aggregation occurs?
A. HAVING B. GROUP BY C. WHERE D. ORDER BY
✅ Correct Answer: C
Explanation: The WHERE clause filters individual rows before any aggregation or grouping takes place. HAVING filters results after aggregation.
2. You need to calculate total sales per product category. Which clause is required?
A. WHERE B. GROUP BY C. ORDER BY D. HAVING
✅ Correct Answer: B
Explanation: GROUP BY groups rows so aggregate functions (such as SUM) can be calculated per category.
3. Which function returns the number of rows in each group?
A. SUM() B. COUNT() C. AVG() D. MAX()
✅ Correct Answer: B
Explanation: COUNT() counts the number of rows in a group. It is commonly used to count records or transactions.
4. Which query correctly filters aggregated results?
A.
WHERE SUM(SalesAmount) > 10000
B.
HAVING SUM(SalesAmount) > 10000
C.
GROUP BY SUM(SalesAmount) > 10000
D.
ORDER BY SUM(SalesAmount) > 10000
✅ Correct Answer: B
Explanation: HAVING is used to filter aggregated values. WHERE cannot reference aggregate functions.
5. Which SQL statement returns the total number of orders?
A.
SELECT COUNT(*) FROM Orders;
B.
SELECT SUM(*) FROM Orders;
C.
SELECT TOTAL(Orders) FROM Orders;
D.
SELECT COUNT(Orders) FROM Orders;
✅ Correct Answer: A
Explanation: COUNT(*) counts all rows in a table, making it the correct way to return total order count.
6. Which clause is used to sort aggregated query results?
A. GROUP BY B. WHERE C. ORDER BY D. HAVING
✅ Correct Answer: C
Explanation: ORDER BY sorts the final result set, including aggregated columns.
7. What happens if a column in the SELECT statement is not included in the GROUP BY clause or an aggregate function?
A. The query runs but returns incorrect results B. SQL automatically groups it C. The query fails D. The column is ignored
✅ Correct Answer: C
Explanation: In SQL, any column in SELECT must either be aggregated or included in GROUP BY.
8. Which query returns average sales amount per country?
A.
SELECT Country, AVG(SalesAmount)
FROM Sales;
B.
SELECT Country, AVG(SalesAmount)
FROM Sales
GROUP BY Country;
C.
SELECT Country, SUM(SalesAmount)
GROUP BY Country;
D.
SELECT AVG(SalesAmount)
FROM Sales
GROUP BY Country;
✅ Correct Answer: B
Explanation: Grouping by Country allows AVG(SalesAmount) to be calculated per country.
9. Which filter removes rows with NULL values in a column?
A.
WHERE SalesAmount = NULL
B.
WHERE SalesAmount <> NULL
C.
WHERE SalesAmount IS NOT NULL
D.
WHERE NOT NULL SalesAmount
✅ Correct Answer: C
Explanation: SQL uses IS NULL and IS NOT NULL to check for null values.
10. Which SQL pattern is most efficient for analytics queries in Microsoft Fabric?
A. Selecting all columns and filtering later B. Using SELECT * for simplicity C. Filtering early and selecting only needed columns D. Aggregating without grouping
✅ Correct Answer: C
Explanation: Filtering early and selecting only required columns improves performance by reducing data movement—an important Fabric best practice.
In Microsoft Fabric, the Visual Query Editor (VQE) provides a low-code, graphical experience for querying data across lakehouses, warehouses, and semantic models. It allows analytics engineers to explore, shape, and summarize data without writing SQL or KQL, while still generating optimized queries behind the scenes.
For the DP-600 exam, you should understand what the Visual Query Editor is, where it’s used, and how to perform common data analysis tasks such as selecting columns, filtering rows, and aggregating data.
What Is the Visual Query Editor?
The Visual Query Editor is a graphical query-building interface available in multiple Fabric experiences, including:
Lakehouse SQL analytics endpoint
Warehouse
Power BI (Direct Lake and DirectQuery scenarios)
Data exploration within Fabric items
Instead of writing queries manually, you interact with:
Tables and columns
Drag-and-drop operations
Menus for filters, grouping, and aggregations
Fabric then translates these actions into optimized SQL or engine-specific queries.
Selecting Data
Selecting data in the Visual Query Editor focuses on choosing the right columns and datasets for analysis.
Key Capabilities
Select or deselect columns from one or more tables
Rename columns for readability
Reorder columns for analysis or reporting
Combine columns from related tables (via existing relationships)
Exam Tips
Selecting fewer columns improves performance and reduces data transfer.
Column renaming in VQE affects the query result, not the underlying table schema.
The Visual Query Editor respects relationships defined in semantic models and warehouses.
Filtering Data
Filtering allows you to limit rows based on conditions, ensuring only relevant data is included.
Common Filter Types
Equality filters (e.g., Status = "Active")
Range filters (e.g., dates, numeric thresholds)
Text filters (contains, starts with, ends with)
Null / non-null filters
Relative date filters (last 7 days, current month)
Where Filtering Is Applied
At the query level, not permanently in the data source
Before aggregation (important for correct results)
Exam Tips
Filters applied in the Visual Query Editor are executed at the data source when possible (query folding).
Filtering early improves performance and reduces memory usage.
Be aware of how filters interact with aggregations.
Aggregating Data
Aggregation summarizes data by grouping rows and applying calculations.
Common Aggregations
Sum
Count / Count Distinct
Average
Min / Max
Grouping Data
Select one or more columns as group-by fields
Apply aggregations to numeric or date columns
Results return one row per group
Examples
Total sales by product category
Count of orders per customer
Average response time by day
Exam Tips
Aggregations in the Visual Query Editor are conceptually similar to GROUP BY in SQL.
Aggregated queries reduce dataset size and improve performance.
Understand the difference between row-level data and aggregated results.
Behind the Scenes: Generated Queries
Although the Visual Query Editor is low-code, Fabric generates:
SQL queries for warehouses and lakehouse SQL endpoints
Optimized engine-specific queries for semantic models
This ensures:
Efficient execution
Compatibility with Direct Lake and DirectQuery
Consistent results across Fabric experiences
Exam Tip
You are not required to read or write the generated SQL, but you should understand that the Visual Query Editor does not bypass query optimization.
When to Use the Visual Query Editor
Use the Visual Query Editor when:
Quickly exploring unfamiliar datasets
Building queries without writing code
Creating reusable query logic for reports
Teaching or collaborating with less SQL-focused users
Avoid it when:
Complex transformations are required (use SQL, Spark, or Dataflows)
Highly customized logic is needed beyond supported operations
Key Exam Takeaways
For the DP-600 exam, remember:
The Visual Query Editor is a graphical query-building tool in Microsoft Fabric.
It supports selecting columns, filtering rows, and aggregating data.
Operations are translated into optimized queries executed at the data source.
Filtering occurs before aggregation, affecting results and performance.
It is commonly used with lakehouses, warehouses, and semantic models.
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
Know the purpose and scope of the Visual Query Editor
Know how to selecting, filtering, and aggregating data
Understand execution order and performance implications
Know when to use (and not use) the Visual Query Editor
Question 1
What is the primary purpose of the Visual Query Editor in Microsoft Fabric?
A. To permanently modify table schemas B. To build queries visually without writing SQL C. To replace semantic models D. To manage workspace permissions
Correct Answer:B
Explanation: The Visual Query Editor provides a low-code, graphical interface for building queries. It does not modify schemas, replace models, or manage security.
Question 2
When you deselect a column in the Visual Query Editor, what happens?
A. The column is deleted from the source table B. The column is hidden permanently for all users C. The column is excluded only from the query results D. The column data type is changed
Correct Answer:C
Explanation: Column selection affects only the query output, not the underlying data or schema.
Question 3
Why is it considered a best practice to select only required columns in a query?
A. It enforces data security B. It reduces query complexity and improves performance C. It enables Direct Lake mode D. It prevents duplicate rows
Correct Answer:B
Explanation: Selecting fewer columns reduces data movement and memory usage, leading to better performance.
Question 4
Which type of filter is commonly used to restrict data to a recent time period?
A. Equality filter B. Text filter C. Relative date filter D. Aggregate filter
Correct Answer:C
Explanation: Relative date filters (e.g., “Last 30 days”) dynamically adjust based on the current date and are commonly used in analytics.
Question 5
At what stage of query execution are filters applied in the Visual Query Editor?
A. After aggregation B. After the query result is returned C. Before aggregation D. Only in the Power BI report layer
Correct Answer:C
Explanation: Filters are applied before aggregation, ensuring accurate summary results and better performance.
Question 6
Which aggregation requires grouping to produce meaningful results?
A. SUM B. COUNT C. GROUP BY D. MIN
Correct Answer:C
Explanation: Grouping defines how rows are summarized. Aggregations like SUM or COUNT rely on GROUP BY logic to produce per-group results.
Question 7
You want to see total sales by product category. Which Visual Query Editor actions are required?
A. Filter Product Category and sort by Sales B. Group by Product Category and apply SUM to Sales C. Count Product Category and filter Sales D. Rename Product Category and aggregate rows
Correct Answer:B
Explanation: This scenario requires grouping on Product Category and applying a SUM aggregation to the Sales column.
Question 8
What happens behind the scenes when you build a query using the Visual Query Editor?
A. Fabric stores a cached dataset only B. Fabric generates optimized SQL or engine-specific queries C. Fabric converts the query into DAX D. Fabric disables query folding
Correct Answer:B
Explanation: The Visual Query Editor translates visual actions into optimized queries (such as SQL) that execute at the data source.
Question 9
Which Fabric items commonly support querying through the Visual Query Editor?
A. Pipelines and notebooks only B. Dashboards only C. Lakehouses, warehouses, and semantic models D. Eventhouses only
Correct Answer:C
Explanation: The Visual Query Editor is widely used across lakehouses, warehouses, and semantic models in Fabric.
Question 10
When should you avoid using the Visual Query Editor?
A. When exploring new datasets B. When building quick aggregations C. When complex transformation logic is required D. When filtering data
Correct Answer:C
Explanation: For advanced or complex transformations, SQL, Spark, or Dataflows are more appropriate than the Visual Query Editor.
Converting data types is a fundamental transformation task in data preparation. It helps ensure data consistency, accurate calculations, filter behavior, sorting, joins, and overall query correctness. In Microsoft Fabric, data type conversion can happen in Power Query, SQL, or Spark depending on the workload and where you are in your data pipeline.
This article explains why, where, and how you convert data types in Fabric, with an emphasis on real-world scenarios and exam relevance.
Why Convert Data Types?
Data type mismatches can lead to:
Erroneous joins (e.g., joining text to numeric)
Incorrect aggregations (e.g., sums treating numbers as text)
Filtering issues (e.g., date strings not filtering as dates)
Unexpected sort order (e.g., text sorts differently from numbers)
In analytics, getting data types right is critical for both the correctness of results and query performance.
Common Data Types in Analytics
Here are some common data types you’ll work with:
Category
Examples
Numeric
INT, BIGINT, DECIMAL, FLOAT
Text
STRING, VARCHAR
Date/Time
DATE, TIME, DATETIME, TIMESTAMP
Boolean
TRUE / FALSE
Where Data Type Conversion Occurs in Fabric
Depending on workload and tool, you may convert data types in:
Power Query (Dataflows Gen2 & Lakehouses)
Visual change type steps (Menu → Transform → Data Type)
Applied steps stored in the query
Useful for low-code transformation
SQL (Warehouse & Lakehouse SQL Analytics)
CAST, CONVERT, or TRY_CAST in SQL
Applies at query time or when persisting transformed data
Spark (Lakehouse Notebooks)
Explicit schema definitions
Transformation commands like withColumn() with type conversion functions
Each environment has trade-offs. For example, Power Query is user-friendly but may not scale like SQL or Spark for very large datasets.
How to Convert Data Types
In Power Query
Select the column
Go to Transform → Data Type
Choose the correct type (e.g., Whole Number, Decimal Number, Date)
Power Query generates a Change Type step that applies at refresh.
In SQL
SELECT
CAST(order_amount AS DECIMAL(18,2)) AS order_amount,
CONVERT(DATE, order_date) AS order_date
FROM Sales;
CAST() and CONVERT() are standard.
Some engines support TRY_CAST() to avoid errors on incompatible values.
When creating semantic models (Power BI datasets):
Data types determine field behavior (e.g., date hierarchies)
Incorrect types can cause:
Incorrect aggregations
Misleading visuals
DAX errors
Always validate types before importing data into the model.
Best Practices
Always validate data values before conversion
Use schema enforcement where possible (e.g., Spark schema)
Avoid implicit type conversions during joins
Keep logs or steps of transformations for reproducibility
Key Takeaways for the DP-600 Exam
Know why data type conversion matters for analytics
Be able to choose the right tool (Power Query / SQL / Spark) for the context
Understand common conversions (text→numeric, text→date, boolean conversion)
Recognize when conversion must occur in the pipeline for correctness and performance
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Expect scenario-based questions rather than direct definitions
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Keep in mind that if a question mentions unexpected calculations, broken joins, or filtering issues, always consider data type mismatches as a possible root cause.
Question 1
Why is converting column data types important in an analytics solution?
A. It reduces storage costs B. It ensures accurate calculations, joins, and filtering C. It improves report visuals automatically D. It encrypts sensitive data
✅ Correct Answer: B
Explanation: Correct data types ensure accurate aggregations, proper join behavior, correct filtering, and predictable sorting.
Question 2
Which Fabric tool provides a visual, low-code interface for changing column data types?
A. SQL Analytics endpoint B. Spark notebooks C. Power Query D. Eventhouse
✅ Correct Answer: C
Explanation: Power Query allows users to change data types through a graphical interface and automatically records the steps.
Question 3
What is a common risk when converting text values to numeric data types?
A. Increased storage usage B. Duplicate rows C. Conversion errors or null values D. Slower report rendering
✅ Correct Answer: C
Explanation: Text values that are not valid numbers may cause conversion failures or be converted to nulls, depending on the method used.
Question 4
Which SQL function safely attempts to convert a value and returns NULL if conversion fails?
A. CAST B. CONVERT C. TRY_CAST D. FORMAT
✅ Correct Answer: C
Explanation: TRY_CAST avoids query failures by returning NULL when a value cannot be converted.
Question 5
When should data types ideally be converted in a Fabric analytics pipeline?
A. At report query time B. After publishing reports C. Early in the transformation process D. Only in the semantic model
✅ Correct Answer: C
Explanation: Converting data types early prevents downstream issues in joins, aggregations, and semantic models.
Question 6
Which data type is most appropriate for calendar-based filtering and time intelligence?
A. Text B. Integer C. Date or DateTime D. Boolean
✅ Correct Answer: C
Explanation: Date and DateTime types enable proper time-based filtering, hierarchies, and time intelligence calculations.
Question 7
Which Spark operation converts a column’s data type?
A. changeType() B. convert() C. cast() D. toType()
✅ Correct Answer: C
Explanation: The cast() method is used in Spark to convert a column’s data type.
Question 8
Why can implicit data type conversion during joins be problematic?
A. It improves performance B. It hides data lineage C. It may cause incorrect matches or slow performance D. It automatically removes duplicates
✅ Correct Answer: C
Explanation: Implicit conversions can prevent index usage and lead to incorrect or inefficient joins.
Question 9
A numeric column is stored as text and sorts incorrectly (e.g., 1, 10, 2). What is the cause?
A. Incorrect aggregation B. Missing values C. Wrong data type D. Duplicate rows
✅ Correct Answer: C
Explanation: Text sorting is lexicographical, not numeric, leading to incorrect ordering.
Question 10
What is the impact of incorrect data types in a Power BI semantic model?
A. Only visuals are affected B. Aggregations, filters, and DAX behavior may be incorrect C. Reports fail to load D. Sensitivity labels are removed
✅ Correct Answer: B
Explanation: Data types influence how fields behave in calculations, visuals, and DAX expressions.
Ensuring data quality is foundational for reliable analytics. Duplicate records, missing values, and nulls can lead to inaccurate aggregations, misleading insights, and broken joins. Microsoft Fabric provides multiple tools and techniques to identify, investigate, and resolve these issues during data preparation.
Why Data Quality Matters
Poor data quality can cause:
Incorrect business metrics (e.g., inflated counts)
Failed joins or mismatches
Incorrect aggregates or KPIs
Discrepancies across reports
The DP-600 exam expects you to know how to detect and fix these issues using Fabric’s transformation tools — without degrading performance or losing important data.
Key Data Quality Issues
1. Duplicate Data
Duplicates occur when the same record appears multiple times. Common causes:
Repeated ingestion jobs
Incorrect joins
Source system errors
Impact of duplicates:
Inflated metrics
Misleading counts
Distorted analytics
2. Missing Data
Missing data refers to complete absence of expected rows for certain categories or time periods.
Examples:
No sales records for a specific store in a date range
Missing customer segments
Impact:
Bias in analysis
Understated performance
3. Null Values
Nulls represent unknown or undefined values in a dataset.
Common cases:
Missing customer name
Missing numeric values
Unpopulated fields in incomplete records
Consequences:
SQL functions may ignore nulls
Aggregations may be skewed
Joins may fail or produce incorrect results
Tools and Techniques in Microsoft Fabric
1. Power Query (Dataflows Gen2 / Lakehouse)
Power Query provides a visual and programmatic interface to clean data:
Remove duplicates: Home → Remove Rows → Remove Duplicates
Replace or fill nulls: Transform → Replace Values Or use Fill Up / Fill Down
Filter nulls: Filter rows where column is null or not null
Benefits:
No-code/low-code
Reusable transformation steps
Easy preview and validation
2. SQL (Warehouses / Lakehouse SQL Analytics)
Using SQL, you can identify and fix issues:
Detect duplicates:
SELECT Col1, Col2, COUNT(*) AS Cnt
FROM table
GROUP BY Col1, Col2
HAVING COUNT(*) > 1;
Remove duplicates (example pattern):
WITH RankedRows AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY keycol ORDER BY keycol) AS rn
FROM table
)
SELECT * FROM RankedRows WHERE rn = 1;
Replace nulls:
SELECT COALESCE(column, 0) AS column_fixed
FROM table;
3. Spark (Lakehouses via Notebooks)
Identify nulls:
df.filter(df["column"].isNull()).show()
Drop duplicates:
df.dropDuplicates(["keycol"])
Fill nulls:
df.na.fill({"column": "Unknown"})
Best Practices for Resolution
Addressing Duplicates
Use business keys (unique identifiers) to define duplicates
Validate whether duplicates are true duplicates or legitimate repeats
Document deduplication logic
Handling Nulls
Use domain knowledge to decide substitute values
Zero for numeric
“Unknown” or “Not Provided” for text
Preserve nulls when they carry meaning (e.g., missing responses)
Handling Missing Data
Understand the business meaning
Is absence valid?
Should data be imputed?
Or should missing rows be generated via reference tables?
Data Profiling
Use profiling to understand distributions and quality:
Column completeness
Unique value distribution
Null frequency
Data profiling helps you decide which cleaning steps are required.
When to Clean Data in Fabric
Data quality transformations should be performed:
Early in the pipeline (at the ingestion or transformation layer)
Before building semantic models
Before aggregations or joins
Before publishing curated datasets
Early cleaning prevents issues from propagating into semantic models and reports.
Exam Scenarios
In DP-600 exam questions, you might see scenarios like:
Metrics appear inflated due to duplicate records
Reports show missing date ranges
Joins fail due to null key values
Aggregations ignore null values
Your job is to choose the correct transformation action — e.g., filtering nulls, deduplicating, replacing values, or imputing missing data — and the best tool (Power Query vs SQL vs Spark).
Key Takeaways
Duplicate rows inflate counts and distort analytics.
Missing rows can bias time-series or segment analysis.
Null values can break joins and cause incorrect aggregation results.
Effective resolution relies on understanding business context and using the right Fabric tools.
Clean data early for better downstream performance and governance.
Final Exam Tip If a metric doesn’t look right, think data quality first — missing or null values and duplicates are one of the most common real-world issues covered in DP-600 scenarios.
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
General Exam Tips for this section If something looks wrong in a report:
Too high? → Check for duplicates
Blank or missing? → Check for nulls or missing rows
Not joining correctly? → Check nulls and key integrity
Question 1
Which issue is most likely to cause inflated totals in aggregated metrics?
A. Null values in numeric columns B. Missing rows for a time period C. Duplicate records D. Incorrect column data types
✅ Correct Answer: C
Explanation: Duplicate records result in the same data being counted more than once, which inflates sums, counts, and averages.
Question 2
In Power Query, which action is used to remove duplicate rows?
A. Filter Rows B. Group By C. Remove Duplicates D. Replace Values
✅ Correct Answer: C
Explanation: The Remove Duplicates operation removes repeated rows based on selected columns.
Question 3
Which SQL function is commonly used to replace null values with a default value?
A. NULLIF B. ISNULL or COALESCE C. COUNT D. CAST
✅ Correct Answer: B
Explanation: ISNULL() and COALESCE() return a specified value when a column contains NULL.
Question 4
Why can null values cause problems in joins?
A. Nulls increase query runtime B. Nulls are treated as zero C. Nulls never match other values D. Nulls are automatically filtered
✅ Correct Answer: C
Explanation: NULL values do not match any value (including other NULLs), which can cause rows to be excluded from join results.
Question 5
Which scenario best justifies keeping null values rather than replacing them?
A. The column is used in joins B. The null indicates “unknown” or “not applicable” C. The column is numeric D. The column has duplicates
✅ Correct Answer: B
Explanation: Nulls may carry important business meaning and should be preserved when they accurately represent missing or unknown information.
Question 6
Which Fabric tool is most appropriate for visual data profiling to identify missing and null values?
A. Power BI visuals B. Power Query C. Semantic models D. Eventhouse
✅ Correct Answer: B
Explanation: Power Query provides built-in data profiling features such as column distribution, column quality, and column profile.
Question 7
What is the purpose of using an anti join when checking data quality?
A. To merge tables B. To append data C. To identify unmatched records D. To replace null values
✅ Correct Answer: C
Explanation: Anti joins return rows that do not have a match in another table, making them ideal for identifying missing or orphaned records.
Question 8
Which approach is considered a best practice for handling data quality issues?
A. Fix issues only in reports B. Clean data as late as possible C. Resolve issues early in the pipeline D. Ignore null values
✅ Correct Answer: C
Explanation: Resolving data quality issues early prevents them from propagating into semantic models and reports.
Question 9
Which Spark operation removes duplicate rows from a DataFrame?
A. filter() B. groupBy() C. dropDuplicates() D. distinctColumns()
✅ Correct Answer: C
Explanation: dropDuplicates() removes duplicate rows based on one or more columns.
Question 10
A report is missing values for several dates. What is the most likely cause?
A. Duplicate rows B. Incorrect aggregation logic C. Missing source data D. Incorrect data type conversion
✅ Correct Answer: C
Explanation: Missing dates usually indicate that source records are absent rather than null or duplicated.
Merging or joining data is a fundamental transformation task in Microsoft Fabric. It enables you to combine related data from multiple tables or sources into a single dataset for analysis, modeling, or reporting. This skill is essential for preparing clean, well-structured data in lakehouses, warehouses, dataflows, and Power BI semantic models.
For the DP-600 exam, you are expected to understand when, where, and how to merge or join data using the appropriate Fabric tools, as well as the implications for performance, data quality, and modeling.
Merge vs. Join: Key Distinction
Although often used interchangeably, the terms have slightly different meanings depending on the tool:
Merge
Commonly used in Power Query
Combines tables by matching rows based on one or more key columns
Produces a new column that can be expanded
Join
Commonly used in SQL and Spark
Combines tables using explicit join logic (JOIN clauses)
Output schema is defined directly in the query
Where Merging and Joining Occur in Fabric
Fabric Experience
How It’s Done
Power Query (Dataflows Gen2, Lakehouse)
Merge Queries UI
Warehouse
SQL JOIN statements
Lakehouse (Spark notebooks)
DataFrame joins
Power BI Desktop
Power Query merges
Common Join Types (Exam-Critical)
Understanding join types is heavily tested:
Inner Join
Returns only matching rows from both tables
Left Outer Join
Returns all rows from the left table and matching rows from the right
Right Outer Join
Returns all rows from the right table and matching rows from the left
Full Outer Join
Returns all rows from both tables
Left Anti / Right Anti Join
Returns rows with no match in the other table
👉 Exam tip: Anti joins are commonly used to identify missing or unmatched data.
Join Keys and Data Quality Considerations
Before merging or joining data, it’s critical to ensure:
Join columns:
Have matching data types
Are cleaned and standardized
Represent the same business entity
Duplicate values in join keys can:
Create unexpected row multiplication
Impact aggregations and performance
Performance and Design Considerations
Prefer SQL joins or Spark joins for large datasets rather than Power Query
Filter and clean data before joining to reduce data volume
In dimensional modeling:
Fact tables typically join to dimension tables using left joins
Avoid unnecessary joins in the semantic layer when they can be handled upstream
Common Use Cases
Combining fact data with descriptive attributes
Enriching transactional data with reference or lookup tables
Building dimension tables for star schema models
Validating data completeness using anti joins
Exam Tips and Pitfalls
Don’t confuse merge vs. append (append stacks rows vertically)
Know which tool to use based on:
Data size
Refresh frequency
Complexity
Expect scenario questions asking:
Which join type to use
Where the join should occur in the architecture
Key Takeaways
Merging and joining data is essential for data preparation in Fabric
Different Fabric experiences offer different ways to join data
Correct join type and clean join keys are critical for accuracy
Performance and modeling best practices matter for the DP-600 exam
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
Question 1
What is the primary purpose of merging or joining data in Microsoft Fabric?
A. To reduce storage costs B. To vertically stack tables C. To combine related data based on a common key D. To encrypt sensitive columns
✅ Correct Answer: C
Explanation: Merging or joining data combines related datasets horizontally using shared key columns so that related attributes appear in a single dataset.
Question 2
In Power Query, what is the result of a Merge Queries operation?
A. Rows from both tables are appended B. A new table is automatically created C. A new column containing related table data is added D. A relationship is created in the semantic model
✅ Correct Answer: C
Explanation: Power Query merges add a column that contains matching rows from the second table, which can then be expanded.
Question 3
Which join type returns only rows that exist in both tables?
A. Left outer join B. Right outer join C. Full outer join D. Inner join
✅ Correct Answer: D
Explanation: An inner join returns only rows with matching keys in both tables.
Question 4
You want to keep all rows from a fact table and bring in matching dimension attributes. Which join type should you use?
A. Inner join B. Left outer join C. Right outer join D. Full outer join
✅ Correct Answer: B
Explanation: A left outer join preserves all rows from the left (fact) table while bringing in matching rows from the dimension table.
Question 5
Which join type is most useful for identifying records that do not have a match in another table?
A. Inner join B. Full outer join C. Left anti join D. Right outer join
✅ Correct Answer: C
Explanation: A left anti join returns rows from the left table that do not have matching rows in the right table, making it ideal for data quality checks.
Question 6
What issue can occur when joining tables that contain duplicate values in the join key?
A. Data type conversion errors B. Row multiplication C. Data loss D. Query failure
✅ Correct Answer: B
Explanation: Duplicate keys can cause one-to-many or many-to-many matches, resulting in more rows than expected after the join.
Question 7
Which Fabric experience is best suited for performing joins on very large datasets?
A. Power BI Desktop B. Power Query C. Warehouse using SQL D. Excel
✅ Correct Answer: C
Explanation: SQL joins in a warehouse are optimized for large-scale data processing and typically outperform Power Query for large datasets.
Question 8
Which operation should not be confused with merging or joining data?
A. Append B. Inner join C. Left join D. Anti join
✅ Correct Answer: A
Explanation: Append stacks tables vertically (row-wise), while merges and joins combine tables horizontally (column-wise).
Question 9
What should you verify before merging two tables?
A. That both tables have the same number of rows B. That join columns use compatible data types C. That all columns are indexed D. That the tables are in the same workspace
✅ Correct Answer: B
Explanation: Join columns must have compatible data types and clean values; otherwise, matches may fail or produce incorrect results.
Question 10
From a modeling best-practice perspective, where should complex joins ideally be performed?
A. In Power BI visuals B. In DAX measures C. Upstream in lakehouse or warehouse transformations D. At query time in reports
✅ Correct Answer: C
Explanation: Performing joins upstream simplifies semantic models, improves performance, and ensures consistency across reports.
Data denormalization is a transformation strategy that restructures data to improve query performance and simplify analytics—especially in analytical workloads such as reporting, dashboarding, and BI. In Microsoft Fabric, denormalization plays a key role when preparing data for efficient consumption in lakehouses, warehouses, and semantic models.
This article explains what denormalization means, why it’s important for analytics, how to implement it in Fabric, and when to use it versus normalized structures.
What Is Denormalization?
Denormalization is the process of combining data from multiple tables or sources into a single, flattened structure. The goal is to reduce the number of joins and simplify querying at the expense of some redundancy.
In contrast:
Normalized data avoids redundancy by splitting data into many related tables.
Denormalized data often duplicates data intentionally to speed up analytical queries.
Why Denormalize Data for Analytics?
Denormalization is widely used in analytics because it:
Improves query performance: Fewer joins mean faster queries—especially for BI tools like Power BI.
Simplifies report logic: Flattened tables make it easier for report authors to understand and use data.
Reduces semantic model complexity: Fewer relationships and tables can improve both model performance and maintainability.
Preparing data for semantic models that are consumed by BI tools.
Performance is critical for dashboards and reports.
Data rarely changes (or changes can be managed with refresh logic).
Users require self-service analytics with minimal SQL complexity.
Where to Denormalize in Microsoft Fabric
Denormalization can be implemented in different Fabric components depending on workload and transformation needs:
1. Dataflows Gen2
Use Power Query to merge tables and create flattened structures
Ideal for low-code scenarios targeting OneLake
Great for building reusable tables
2. Lakehouses
Use Spark SQL or T-SQL to perform joins and build denormalized tables
Useful for large-scale ELT transformations
3. Warehouse
Use SQL to create flattened analytic tables optimized for BI
Supports indexing and performance tuning
4. Notebooks
Use PySpark or Spark SQL for complex or iterative denormalization logic
How to Denormalize Data
Typical Techniques
Merge or Join tables: Combine fact and dimension tables into a single analytic table
Pre-compute derived values: Compute metrics or concatenated fields ahead of time
Flatten hierarchies: Add attributes from parent tables directly into child records
Pivot or unpivot: Adjust layout to match analytics needs
Example (Conceptual Join)
Instead of querying these tables:
SELECT
s.SalesID,
d.CustomerName,
p.ProductName
FROM FactSales s
JOIN DimCustomer d ON s.CustomerID = d.CustomerID
JOIN DimProduct p ON s.ProductID = p.ProductID;
Create a denormalized “SalesAnalytics” table:
SELECT
s.SalesID,
s.SalesDate,
d.CustomerName,
p.ProductName,
s.SalesAmount
INTO DenormSalesAnalytics
FROM FactSales s
JOIN DimCustomer d ON s.CustomerID = d.CustomerID
JOIN DimProduct p ON s.ProductID = p.ProductID;
This single table can then be queried directly by BI tools without joins.
Trade-Offs of Denormalization
While denormalization improves performance and simplicity, it also introduces trade-offs:
Pros
Faster, simpler queries
Better analytics experience
Easier semantic model design
Cons
Data redundancy
Larger storage footprint
More complex refresh and update logic
Higher maintenance if source schemas change
Integrating Denormalization with Semantic Models
Denormalized tables are often used as sources for Power BI semantic models to:
Reduce row-level relationships
Improve report refresh times
Simplify model structure
Support consistent business metrics
Because semantic models work best with wide tables and straightforward relationships, denormalized sources are ideal.
Best Practices for Denormalization
Denormalize only where it delivers clear performance or usability benefits
Document transformation logic for future maintainability
Use pipelines or Dataflows Gen2 for repeatable and auditable ELT flows
Monitor refresh performance and adjust partitions or indexes
When Not to Denormalize
Avoid denormalization when:
Data integrity rules are strict and must avoid redundancy
Source systems change frequently
You are performing OLTP-style operations (transactional systems)
Storage and refresh cost outweigh performance gains
What to Know for the DP-600 Exam
You should be comfortable with:
The definition and purpose of denormalization
Recognizing when it’s appropriate in analytics workloads
How to implement denormalization in Fabric components
The trade-offs involved in denormalizing data
How denormalized structures optimize semantic models and BI
Final Exam Tip If a question emphasizes reducing joins, improving query performance, and simplifying reporting, you’re likely dealing with denormalization. If it emphasizes transactional integrity and normalized structures, that’s not the scenario for denormalization.
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
1. What is the primary purpose of denormalizing data for analytics workloads?
A. Reduce data duplication B. Improve transactional integrity C. Improve query performance and simplify analytics D. Enforce strict normalization rules
Correct Answer: C
Explanation: Denormalization intentionally introduces redundancy to reduce joins, simplify queries, and improve performance—key requirements for analytics and BI workloads.
2. Which type of workload benefits most from denormalized data?
A. OLTP transaction processing B. Real-time device telemetry ingestion C. BI reporting and dashboarding D. Application logging
Correct Answer: C
Explanation: BI reporting and analytics benefit from flattened, denormalized structures because they reduce query complexity and improve performance.
3. What is a common technique used to denormalize data?
A. Normalizing dimension tables B. Splitting wide tables into smaller ones C. Merging multiple related tables into one D. Removing foreign keys
Correct Answer: C
Explanation: Denormalization commonly involves merging fact and dimension data into a single table to reduce the need for joins during querying.
4. Which Microsoft Fabric component is best suited for low-code denormalization?
A. Notebooks B. SQL analytics endpoint C. Dataflows Gen2 D. Eventhouse
Correct Answer: C
Explanation: Dataflows Gen2 use Power Query to perform low-code transformations such as merging tables and creating flattened datasets.
5. What is a key trade-off introduced by denormalization?
A. Reduced query performance B. Increased data redundancy C. Reduced storage reliability D. Loss of query flexibility
Correct Answer: B
Explanation: Denormalization duplicates data across rows or tables, which increases redundancy and can complicate updates and refresh processes.
6. Why is denormalized data often used as a source for Power BI semantic models?
A. Power BI cannot handle relationships B. Denormalized tables simplify models and improve performance C. Semantic models require flattened data only D. Denormalized data reduces licensing costs
Correct Answer: B
Explanation: Flattened tables reduce the number of relationships and joins, improving performance and making semantic models easier to design and maintain.
7. In which scenario should denormalization generally be avoided?
A. Preparing a reporting data mart B. Building a self-service analytics dataset C. Supporting frequent transactional updates D. Optimizing dashboard query speed
Correct Answer: C
Explanation: Denormalization is not ideal for transactional systems where frequent updates and strict data integrity are required.
8. Where is denormalization commonly implemented in Microsoft Fabric?
A. User interface settings B. Workspace-level permissions C. Lakehouses, warehouses, and Dataflows Gen2 D. Real-Time hub only
Correct Answer: C
Explanation: Denormalization is a data transformation task typically implemented in Fabric lakehouses, warehouses, notebooks, or Dataflows Gen2.
9. What is a common benefit of denormalizing data earlier in the data pipeline?
A. Reduced need for data validation B. Improved consistency across analytics assets C. Automatic enforcement of row-level security D. Lower data ingestion costs
Correct Answer: B
Explanation: Denormalizing upstream ensures that all downstream analytics assets consume the same enriched and flattened datasets, improving consistency.
10. Which phrase best indicates that denormalization is an appropriate solution?
A. “Strict transactional consistency is required” B. “Data must be updated in real time per record” C. “Queries require many joins and are slow” D. “Source systems change frequently”
Correct Answer: C
Explanation: Denormalization is commonly applied when complex joins cause performance issues and simplified querying is required.
Designing and implementing an effective schema is foundational to efficient analytics. In Microsoft Fabric, structuring your data into a star schema dramatically improves query performance, simplifies reporting, and aligns with best practices for BI workloads.
This article explains what a star schema is, why it matters in Fabric, and how to implement it in a lakehouse or data warehouse.
What Is a Star Schema?
A star schema is a relational modeling technique that organizes data into two primary types of tables:
Fact tables: Contain measurable, quantitative data (metrics, transactions, events).
Dimension tables describe contextual attributes. Common characteristics:
Contain descriptive fields
Usually smaller than fact tables
Often used for filtering/grouping
Examples: Customer, product, date, geography
Implementing a Star Schema in a Lakehouse
Lakehouses in Fabric support Delta format tables and both Spark SQL and T-SQL analytics endpoints.
Steps to Implement:
Ingest raw data into your lakehouse (as files or staging tables).
Transform data:
Cleanse and conform fields
Derive business keys
Create dimension tables:
Deduplicate
Add descriptive attributes
Create fact tables:
Join transactional data to dimension keys
Store numeric measures
Optimize:
Partition and Z-ORDER for performance
Tools You Might Use:
Notebooks (PySpark)
Lakehouse SQL
Data pipelines
Exam Tip: Lakehouses are ideal when you need flexibility, schema evolution, or combined batch + exploratory analytics.
Implementing a Star Schema in a Warehouse
Data warehouses in Fabric provide a SQL-optimized store designed for BI workloads.
Steps to Implement:
Stage raw data in warehouse tables
Build conforming dimension tables
Build fact tables with proper keys
Add constraints and indexes (as appropriate)
Optimize with materialized views or aggregations
Warehouse advantages:
Strong query performance for BI
Native SQL analytics
Excellent integration with Power BI and semantic models
Exam Tip: Choose a warehouse when your priority is high-performance BI analytics with well-defined dimensional models.
Common Star Schema Patterns
Conformed Dimensions
Dimensions shared across multiple fact tables
Ensures consistent filtering and reporting across business processes
Slowly Changing Dimensions (SCD)
Maintain historical attribute changes
Types include Type 1 (overwrite) and Type 2 (versioning)
Fact Table Grain
Define the “grain” (level of detail) clearly—for example, “one row per sales transaction.”
Star Schema and Power BI Semantic Models
Semantic models often sit on top of star schemas:
Fact tables become measure containers
Dimensions become filtering hierarchies
Reduces DAX complexity
Improves performance
Best Practice: Structure your lakehouse or warehouse into a star schema before building the semantic model.
Star Schema in Lakehouse vs Warehouse
Feature
Lakehouse
Warehouse
Query engines
Spark & SQL
SQL only
Best for
Mixed workloads (big data + SQL)
BI & reporting
Optimization
Partition/Z-ORDER
Indexing and statistics
Tooling
Notebooks, pipelines
SQL scripts, BI artifacts
Schema complexity
Flexible
Rigid
Governance and Performance Considerations
Use consistent keys across facts and dimensions
Validate referential integrity where possible
Avoid wide, unindexed tables for BI queries
Apply sensitivity labels on schemas for governance
Document schema and business logic
What to Know for the DP-600 Exam
Be prepared to:
Explain the purpose of star schema components
Identify when to implement star schema in lakehouses vs warehouses
Recognize patterns like conformed dimensions and SCDs
Understand performance implications of schema design
Relate star schema design to Power BI and semantic models
Final Exam Tip If the question emphasizes high-performance reporting, simple joins, and predictable filtering, think star schema. If it mentions big data exploration or flexible schema evolution, star schema in a lakehouse may be part of the answer.
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
1. What is the defining characteristic of a star schema?
A. Multiple fact tables connected through bridge tables B. A central fact table connected directly to dimension tables C. Fully normalized transactional tables D. A schema optimized for OLTP workloads
Correct Answer: B
Explanation: A star schema consists of a central fact table directly linked to surrounding dimension tables, forming a star-like structure optimized for analytics.
2. Which type of data is stored in a fact table?
A. Descriptive attributes such as names and categories B. Hierarchical metadata for navigation C. Quantitative, measurable values D. User access permissions
Correct Answer: C
Explanation: Fact tables store numeric measures (e.g., sales amount, quantity) that are aggregated during analytical queries.
3. Which table type is typically smaller and used for filtering and grouping?
A. Fact table B. Dimension table C. Bridge table D. Staging table
Correct Answer: B
Explanation: Dimension tables store descriptive attributes and are commonly used for filtering, grouping, and slicing fact data in reports.
4. Why are star schemas preferred for Power BI semantic models?
A. They eliminate the need for relationships B. They align naturally with BI tools and optimize query performance C. They reduce OneLake storage usage D. They replace DAX calculations
Correct Answer: B
Explanation: Power BI and other BI tools are optimized for star schemas, which simplify joins, reduce model complexity, and improve performance.
5. When implementing a star schema in a Fabric lakehouse, which storage format is typically used?
A. CSV B. JSON C. Parquet D. Delta
Correct Answer: D
Explanation: Fabric lakehouses store tables in Delta format, which supports ACID transactions and efficient analytical querying.
6. Which scenario most strongly suggests using a warehouse instead of a lakehouse for a star schema?
A. Schema evolution and exploratory data science B. High-performance, SQL-based BI reporting C. Streaming ingestion of real-time events D. Semi-structured data exploration
Correct Answer: B
Explanation: Fabric warehouses are optimized for SQL-based analytics and BI workloads, making them ideal for star schemas supporting reporting scenarios.
7. What does the “grain” of a fact table describe?
A. The number of dimensions in the table B. The level of detail represented by each row C. The size of the table in storage D. The indexing strategy
Correct Answer: B
Explanation: The grain defines the level of detail for each row in the fact table (e.g., one row per transaction or per day).
8. What is a conformed dimension?
A. A dimension used by only one fact table B. A dimension that contains only numeric values C. A shared dimension used consistently across multiple fact tables D. A dimension generated dynamically at query time
Correct Answer: C
Explanation: Conformed dimensions are shared across multiple fact tables, enabling consistent filtering and reporting across different business processes.
9. Which design choice improves performance when querying star schemas?
A. Highly normalized dimension tables B. Complex many-to-many relationships C. Simple joins between fact and dimension tables D. Storing dimensions inside the fact table
Correct Answer: C
Explanation: Star schemas minimize join complexity by using simple, direct relationships between facts and dimensions, improving query performance.
10. Which statement best describes how star schemas fit into the Fabric analytics lifecycle?
A. They replace semantic models entirely B. They are used only for real-time analytics C. They provide an analytics-ready structure for reporting and modeling D. They are required only for data ingestion
Correct Answer: C
Explanation: Star schemas organize data into an analytics-ready structure that supports semantic models, reporting, and scalable BI workloads.
Creating views, functions, and stored procedures is a core data transformation and modeling skill for analytics engineers working in Microsoft Fabric. These objects help abstract complexity, improve reusability, enforce business logic, and optimize downstream analytics and reporting.
This section of the DP-600 exam focuses on when, where, and how to use these objects effectively across Fabric components such as Lakehouses, Warehouses, and SQL analytics endpoints.
Views
What are Views?
A view is a virtual table defined by a SQL query. It does not store data itself but presents data dynamically from underlying tables.
Where Views Are Used in Fabric
Fabric Data Warehouse
Lakehouse SQL analytics endpoint
Exposed to Power BI semantic models and other consumers
Common Use Cases
Simplify complex joins and transformations
Present curated, analytics-ready datasets
Enforce column-level or row-level filtering logic
Provide a stable schema over evolving raw data
Key Characteristics
Always reflect the latest data
Can be used like tables in SELECT statements
Improve maintainability and readability
Can support security patterns when combined with permissions
Exam Tip
Know that views are ideal for logical transformations, not heavy compute or data persistence.
Functions
What are Functions?
Functions encapsulate reusable logic and return a value or a table. They help standardize calculations and transformations across queries.
Types of Functions (SQL)
Scalar functions: Return a single value (e.g., formatted date, calculated metric)
Table-valued functions (TVFs): Return a result set that behaves like a table
Where Functions Are Used in Fabric
Fabric Warehouses
SQL analytics endpoints for Lakehouses
Common Use Cases
Standardized business calculations
Reusable transformation logic
Parameterized filtering or calculations
Cleaner and more modular SQL code
Key Characteristics
Improve consistency across queries
Can be referenced in views and stored procedures
May impact performance if overused in large queries
Exam Tip
Functions promote reuse and consistency, but should be used thoughtfully to avoid performance overhead.
Stored Procedures
What are Stored Procedures?
Stored procedures are precompiled SQL code blocks that can accept parameters and perform multiple operations.
Where Stored Procedures Are Used in Fabric
Fabric Data Warehouses
SQL endpoints that support procedural logic
Common Use Cases
Complex transformation workflows
Batch processing logic
Conditional logic and control-of-flow (IF/ELSE, loops)
Data loading, validation, and orchestration steps
Key Characteristics
Can perform multiple SQL statements
Can accept input and output parameters
Improve performance by reducing repeated compilation
Support automation and operational workflows
Exam Tip
Stored procedures are best for procedural logic and orchestration, not ad-hoc analytics queries.
Choosing Between Views, Functions, and Stored Procedures
Object
Best Used For
Views
Simplifying data access and shaping datasets
Functions
Reusable calculations and logic
Stored Procedures
Complex, parameter-driven workflows
Understanding why you would choose one over another is frequently tested on the DP-600 exam.
Integration with Power BI and Analytics
Views are commonly consumed by Power BI semantic models
Functions help ensure consistent calculations across reports
Stored procedures are typically part of data preparation or orchestration, not directly consumed by reports
Governance and Best Practices
Use clear naming conventions (e.g., vw_, fn_, sp_)
Document business logic embedded in SQL objects
Minimize logic duplication across objects
Apply permissions carefully to control access
Balance reusability with performance considerations
What to Know for the DP-600 Exam
You should be comfortable with:
When to use views vs. functions vs. stored procedures
How these objects support data transformation
Their role in analytics-ready data preparation
How they integrate with Lakehouses, Warehouses, and Power BI
Performance and governance implications
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions to guide you
Expect scenario-based questions rather than direct definitions
1. What is the primary purpose of creating a view in a Fabric lakehouse or warehouse?
A. To permanently store transformed data B. To execute procedural logic with parameters C. To provide a virtual, query-based representation of data D. To orchestrate batch data loads
Correct Answer: C
Explanation: A view is a virtual table defined by a SQL query. It does not store data but dynamically presents data from underlying tables, making it ideal for simplifying access and shaping analytics-ready datasets.
2. Which Fabric component commonly exposes views directly to Power BI semantic models?
A. Eventhouse B. SQL analytics endpoint C. Dataflow Gen2 D. Real-Time hub
Correct Answer: B
Explanation: The SQL analytics endpoint (for lakehouses and warehouses) exposes tables and views that Power BI semantic models can consume using SQL-based connectivity.
3. When should you use a scalar function instead of a view?
A. When you need to return a dataset with multiple rows B. When you need to encapsulate reusable calculation logic C. When you need to perform batch updates D. When you want to persist transformed data
Correct Answer: B
Explanation: Scalar functions are designed to return a single value and are ideal for reusable calculations such as formatting, conditional logic, or standardized metrics.
4. Which object type can return a result set that behaves like a table?
A. Scalar function B. Stored procedure C. Table-valued function D. View index
Correct Answer: C
Explanation: A table-valued function (TVF) returns a table and can be used in FROM clauses, similar to a view but with parameterization support.
5. Which scenario is the best use case for a stored procedure?
A. Creating a simplified reporting dataset B. Applying row-level filters for security C. Running conditional logic with multiple SQL steps D. Exposing data to Power BI reports
Correct Answer: C
Explanation: Stored procedures are best suited for procedural logic, including conditional branching, looping, and executing multiple SQL statements as part of a workflow.
6. Why are views commonly preferred over duplicating transformation logic in reports?
A. Views improve report rendering speed automatically B. Views centralize and standardize transformation logic C. Views permanently store transformed data D. Views replace semantic models
Correct Answer: B
Explanation: Views allow transformation logic to be defined once and reused consistently across multiple reports and consumers, improving maintainability and governance.
7. What is a potential downside of overusing functions in large SQL queries?
A. Increased storage costs B. Reduced data freshness C. Potential performance degradation D. Loss of security enforcement
Correct Answer: C
Explanation: Functions, especially scalar functions, can negatively impact query performance when used extensively on large datasets due to repeated execution per row.
8. Which object is most appropriate for parameter-driven data preparation steps in a warehouse?
A. View B. Scalar function C. Table D. Stored procedure
Correct Answer: D
Explanation: Stored procedures support parameters, control-of-flow logic, and multiple statements, making them ideal for complex, repeatable data preparation tasks.
9. How do views support governance and security in Microsoft Fabric?
A. By encrypting data at rest B. By defining workspace-level permissions C. By exposing only selected columns or filtered rows D. By controlling OneLake storage access
Correct Answer: C
Explanation: Views can limit the columns and rows exposed to users, helping implement logical data access patterns when combined with permissions and security models.
10. Which statement best describes how these objects fit into Fabric’s analytics lifecycle?
A. They replace Power BI semantic models B. They are primarily used for real-time streaming C. They prepare and standardize data for downstream analytics D. They manage infrastructure-level security
Correct Answer: C
Explanation: Views, functions, and stored procedures play a key role in transforming, standardizing, and preparing data for consumption by semantic models, reports, and analytics tools.
One of the most important architectural decisions a Microsoft Fabric Analytics Engineer must make is selecting the right analytical store for a given workload. For the DP-600 exam, this topic tests your ability to choose between a Lakehouse, Warehouse, or Eventhouse based on data type, query patterns, latency requirements, and user personas.
Overview of the Three Options
Microsoft Fabric provides three primary analytics storage and query experiences:
Option
Primary Purpose
Lakehouse
Flexible analytics on files and tables using Spark and SQL
Warehouse
Enterprise-grade SQL analytics and BI reporting
Eventhouse
Real-time and near-real-time analytics on streaming data
Understanding why and when to use each is critical for DP-600 success.
Lakehouse
What Is a Lakehouse?
A Lakehouse combines the flexibility of a data lake with the structure of a data warehouse. Data is stored in Delta Lake format in OneLake and can be accessed using both Spark and SQL.
When to Choose a Lakehouse
Choose a Lakehouse when you need:
Flexible schema (schema-on-read or schema-on-write)
Exam signal words:flexible, raw data, Spark, data science, experimentation
Warehouse
What Is a Warehouse?
A Warehouse is a fully managed, SQL-first analytical store optimized for business intelligence and reporting. It enforces schema-on-write and provides a traditional relational experience.
When to Choose a Warehouse
Choose a Warehouse when you need:
Strong SQL-based analytics
High-performance reporting
Well-defined schemas and governance
Centralized enterprise BI
Compatibility with Power BI Import or DirectQuery
Key Characteristics
T-SQL only (no Spark)
Optimized for structured data
Best for star/snowflake schemas
Familiar experience for SQL developers
Exam signal words:enterprise BI, reporting, structured, governed, SQL-first
Eventhouse
What Is an Eventhouse?
An Eventhouse is optimized for real-time and streaming analytics, built on KQL (Kusto Query Language). It is designed to handle high-velocity event data.
When to Choose an Eventhouse
Choose an Eventhouse when you need:
Near-real-time or real-time analytics
Streaming data ingestion
Operational or telemetry analytics
Event-based dashboards and alerts
Key Characteristics
Uses KQL for querying
Integrates with Eventstreams
Handles massive ingestion rates
Optimized for time-series data
Exam signal words:streaming, telemetry, IoT, real-time, events
Choosing the Right Option (Exam-Critical)
The DP-600 exam often presents scenarios where multiple options could work, but only one best fits the requirements.
Decision Matrix
Requirement
Best Choice
Raw + curated data
Lakehouse
Complex Spark transformations
Lakehouse
Enterprise BI reporting
Warehouse
Strong governance and schemas
Warehouse
Streaming or telemetry data
Eventhouse
Near-real-time dashboards
Eventhouse
SQL-only users
Warehouse
Data science workloads
Lakehouse
Common Exam Scenarios
You may be asked to:
Choose a storage type for a new analytics solution
Migrate from traditional systems to Fabric
Support both engineers and analysts
Enable real-time monitoring
Balance governance with flexibility
Always identify:
Data type (batch vs streaming)
Latency requirements
User personas
Query language
Governance needs
Best Practices to Remember
Use Lakehouse as a flexible foundation for analytics
Use Warehouse for polished, governed BI solutions
Use Eventhouse for real-time operational insights
Avoid forcing one option to handle all workloads
Let business requirements—not familiarity—drive the choice
Key Takeaway For the DP-600 exam, choosing between a Lakehouse, Warehouse, or Eventhouse is about aligning data characteristics and access patterns with the right Fabric experience. Lakehouses provide flexibility, Warehouses deliver enterprise BI performance, and Eventhouses enable real-time analytics. The correct answer is almost always the one that best fits the scenario constraints.
Practice Questions:
Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …
Identifying and understand why an option is correct (or incorrect) — not just which one
Look for and understand the usage scenario of keywords in exam questions, with the below possible association:
Expect scenario-based questions rather than direct definitions
1. Which Microsoft Fabric component is BEST suited for flexible analytics on both files and tables using Spark and SQL?
A. Warehouse B. Eventhouse C. Lakehouse D. Semantic model
Correct Answer: C
Explanation: A Lakehouse stores data in Delta format in OneLake and supports both Spark and SQL, making it ideal for flexible analytics across files and tables.
2. A team of data scientists needs to experiment with raw and curated data using notebooks. Which option should they choose?
A. Warehouse B. Eventhouse C. Semantic model D. Lakehouse
Correct Answer: D
Explanation: Lakehouses are designed for data engineering and data science workloads, offering Spark-based notebooks and flexible schema handling.
3. Which option is MOST appropriate for enterprise BI reporting with well-defined schemas and strong governance?
A. Lakehouse B. Warehouse C. Eventhouse D. OneLake
Correct Answer: B
Explanation: Warehouses are SQL-first, schema-on-write systems optimized for structured data, governance, and high-performance BI reporting.
4. A solution must support near-real-time analytics on streaming IoT telemetry data. Which Fabric component should be used?
A. Lakehouse B. Warehouse C. Eventhouse D. Dataflow Gen2
Correct Answer: C
Explanation: Eventhouses are optimized for high-velocity streaming data and real-time analytics using KQL.
5. Which query language is primarily used to analyze data in an Eventhouse?
A. T-SQL B. Spark SQL C. DAX D. KQL
Correct Answer: D
Explanation: Eventhouses are built on KQL (Kusto Query Language), which is optimized for querying event and time-series data.
6. A business analytics team requires fast dashboard performance and is familiar only with SQL. Which option best meets this requirement?
A. Lakehouse B. Warehouse C. Eventhouse D. Spark notebook
Correct Answer: B
Explanation: Warehouses provide a traditional SQL experience optimized for BI dashboards and reporting performance.
7. Which characteristic BEST distinguishes a Lakehouse from a Warehouse?
A. Lakehouses support Power BI B. Warehouses store data in OneLake C. Lakehouses support Spark-based processing D. Warehouses cannot be governed
Correct Answer: C
Explanation: Lakehouses uniquely support Spark-based processing, enabling advanced transformations and data science workloads.
8. A solution must store structured batch data and unstructured files in the same analytical store. Which option should be selected?
A. Warehouse B. Eventhouse C. Semantic model D. Lakehouse
Correct Answer: D
Explanation: Lakehouses support both structured tables and unstructured or semi-structured files within the same environment.
9. Which scenario MOST strongly indicates the need for an Eventhouse?
A. Monthly financial reporting B. Slowly changing dimension modeling C. Real-time operational monitoring D. Ad hoc SQL analysis
Correct Answer: C
Explanation: Eventhouses are designed for real-time analytics on streaming data, making them ideal for operational monitoring scenarios.
10. When choosing between a Lakehouse, Warehouse, or Eventhouse on the DP-600 exam, which factor is MOST important?
A. Personal familiarity with the tool B. The default Fabric option C. Data characteristics and latency requirements D. Workspace size
Correct Answer: C
Explanation: DP-600 emphasizes selecting the correct component based on data type (batch vs streaming), latency needs, user personas, and governance—not personal preference.
Information and resources for the data professionals' community