This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
--> Ingest and transform batch data
--> Group and aggregate data
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Grouping and aggregating data are among the most common and important data transformation operations performed by data engineers. Organizations rarely analyze raw transactional data directly. Instead, they summarize, categorize, and aggregate data to create meaningful business metrics such as total sales, average order value, monthly revenue, customer counts, inventory levels, and operational KPIs.
In Microsoft Fabric, grouping and aggregation can be performed using several technologies, including:
- SQL in Fabric Data Warehouses and Lakehouses
- PySpark notebooks
- Dataflows Gen2
- KQL (Kusto Query Language)
- Data pipelines as part of larger ETL/ELT processes
For the DP-700 exam, you should understand:
- Why grouping and aggregation are important
- When to aggregate data
- How to implement aggregations using SQL, PySpark, KQL, and Dataflows Gen2
- Common aggregation functions
- Performance considerations
- Aggregations in dimensional modeling and analytics solutions
Why Group and Aggregate Data?
Raw data often contains millions or billions of records.
For example:
| OrderID | CustomerID | OrderDate | Amount |
|---|---|---|---|
| 1001 | C001 | 2026-01-01 | 250 |
| 1002 | C001 | 2026-01-02 | 150 |
| 1003 | C002 | 2026-01-02 | 300 |
Business users usually want summarized information such as:
| CustomerID | TotalSales |
|---|---|
| C001 | 400 |
| C002 | 300 |
This transformation is accomplished through grouping and aggregation.
Benefits include:
- Faster analytics
- Reduced storage requirements
- Easier reporting
- Improved dashboard performance
- Simplified business intelligence models
Understanding Grouping
Grouping combines records that share common values.
Examples:
Group by:
- Customer
- Product
- Region
- Department
- Date
- Month
- Year
Example:
| Region | Sales |
|---|---|
| East | 100 |
| East | 200 |
| West | 300 |
Grouped by Region:
| Region | TotalSales |
|---|---|
| East | 300 |
| West | 300 |
The GROUP BY operation creates logical categories before aggregation calculations occur.
Common Aggregation Functions
Data engineers should be familiar with the most common aggregation functions.
SUM()
Calculates totals.
Example:
SELECT Region, SUM(SalesAmount) AS TotalSalesFROM SalesGROUP BY Region;
Result:
| Region | TotalSales |
|---|---|
| East | 500000 |
| West | 750000 |
COUNT()
Counts rows.
SELECT Region, COUNT(*) AS OrderCountFROM SalesGROUP BY Region;
Used for:
- Number of customers
- Number of transactions
- Number of products
AVG()
Calculates averages.
SELECT ProductCategory, AVG(SalesAmount) AS AverageSaleFROM SalesGROUP BY ProductCategory;
Used for:
- Average order value
- Average response time
- Average inventory level
MIN()
Returns the smallest value.
SELECT MIN(OrderDate)FROM Orders;
Used for:
- Earliest order
- Lowest temperature
- Minimum cost
MAX()
Returns the largest value.
SELECT MAX(OrderDate)FROM Orders;
Used for:
- Latest transaction
- Highest sales amount
- Maximum inventory quantity
Grouping and Aggregation Using SQL
SQL is the most common approach for aggregation in Fabric.
Example:
SELECT ProductCategory, SUM(SalesAmount) AS TotalSales, COUNT(*) AS OrderCount, AVG(SalesAmount) AS AverageSaleFROM SalesGROUP BY ProductCategory;
This query:
- Groups records by category
- Calculates total sales
- Counts orders
- Calculates average sales
Multi-Column Grouping
You can group by multiple columns.
Example:
SELECT Year, Region, SUM(SalesAmount) AS TotalSalesFROM SalesGROUP BY Year, Region;
Result:
| Year | Region | TotalSales |
|---|---|---|
| 2025 | East | 500000 |
| 2025 | West | 700000 |
| 2026 | East | 550000 |
This provides more granular analysis.
Filtering Aggregated Results with HAVING
WHERE filters rows before aggregation.
HAVING filters after aggregation.
Example:
SELECT Region, SUM(SalesAmount) AS TotalSalesFROM SalesGROUP BY RegionHAVING SUM(SalesAmount) > 1000000;
Only regions exceeding $1 million in sales are returned.
Aggregation Using PySpark
PySpark is commonly used for large-scale aggregation operations in Lakehouses.
Example:
from pyspark.sql.functions import sumsales_df.groupBy("Region") \ .agg(sum("SalesAmount").alias("TotalSales")) \ .show()
Result:
| Region | TotalSales |
|---|---|
| East | 500000 |
| West | 750000 |
Multiple Aggregations in PySpark
from pyspark.sql.functions import sum, avg, countsales_df.groupBy("Region").agg( sum("SalesAmount").alias("TotalSales"), avg("SalesAmount").alias("AvgSales"), count("*").alias("OrderCount"))
This performs several calculations in a single operation.
Aggregation Using KQL
KQL is commonly used in Real-Time Intelligence workloads.
Example:
Sales| summarize TotalSales=sum(SalesAmount) by Region
Result:
| Region | TotalSales |
|---|---|
| East | 500000 |
| West | 750000 |
Multiple Aggregations in KQL
Sales| summarize TotalSales=sum(SalesAmount), AvgSales=avg(SalesAmount), OrderCount=count() by Region
This pattern is common in real-time analytics.
Aggregation in Dataflows Gen2
Dataflows Gen2 provides a low-code interface.
Using the Group By transformation, users can:
- Sum values
- Count rows
- Calculate averages
- Find minimum values
- Find maximum values
Typical steps:
- Connect to source
- Select Group By
- Choose grouping columns
- Define aggregation functions
- Load results
This approach is useful for citizen developers and low-code ETL scenarios.
Aggregation in Dimensional Models
Aggregations are commonly used before loading data into fact and dimension tables.
Example:
Raw transactions:
| OrderID | Customer | Amount |
|---|---|---|
| 1 | A | 100 |
| 2 | A | 200 |
| 3 | B | 300 |
Aggregated customer sales:
| Customer | TotalSales |
|---|---|
| A | 300 |
| B | 300 |
This summary table can support reporting and dashboards.
Fact Table Aggregations
Fact tables often store:
- Transaction-level facts
- Daily summaries
- Monthly summaries
Examples:
Transaction Fact
| OrderID | Amount |
|---|---|
| 1001 | 50 |
Daily Aggregate Fact
| Date | TotalSales |
|---|---|
| 2026-01-01 | 50000 |
Aggregated fact tables improve query performance.
Window Aggregations vs Group Aggregations
Data engineers should understand the difference.
Group Aggregation
Returns one row per group.
SELECT Region, SUM(SalesAmount)FROM SalesGROUP BY Region;
Window Aggregation
Preserves detail rows.
SELECT OrderID, Region, SalesAmount, SUM(SalesAmount) OVER(PARTITION BY Region) AS RegionTotalFROM Sales;
Useful for:
- Running totals
- Rankings
- Percentages
- Advanced analytics
Performance Considerations
Grouping and aggregation can be expensive.
Best practices include:
Filter Early
Reduce data before aggregation.
WHERE OrderDate >= '2026-01-01'
Aggregate Close to the Source
Avoid moving unnecessary detailed records.
Use Partitioning
Partitioning helps Spark process data efficiently.
Examples:
- Date
- Region
- Customer segment
Use Delta Tables
Delta tables improve performance through:
- Data skipping
- File optimization
- Efficient query execution
Avoid Excessive Cardinality
Grouping on highly unique columns can reduce efficiency.
Bad example:
GROUP BY TransactionID
Good example:
GROUP BY Region
DP-700 Exam Tips
Remember the following:
- GROUP BY creates logical groups before aggregation.
- HAVING filters aggregated results.
- SQL uses GROUP BY.
- PySpark uses groupBy() and agg().
- KQL uses summarize.
- Dataflows Gen2 provides Group By transformations.
- Aggregated fact tables improve reporting performance.
- Window functions preserve detailed rows while performing calculations.
- Aggregations are frequently used when preparing dimensional models.
- Filtering before aggregation improves performance.
Practice Exam Questions
Question 1
A data engineer needs to calculate total sales by region in a Fabric Warehouse.
Which SQL function should be used?
A. AVG()
B. COUNT()
C. SUM()
D. MAX()
Correct Answer: C
Explanation: SUM() calculates the total of numeric values. AVG() calculates averages, COUNT() counts rows, and MAX() returns the largest value.
Question 2
A Fabric notebook must calculate the number of orders per customer.
Which aggregation function should be used?
A. COUNT()
B. AVG()
C. MIN()
D. MAX()
Correct Answer: A
Explanation: COUNT() returns the number of rows in each group, making it ideal for counting orders.
Question 3
You need to remove regions with total sales less than $500,000 after aggregation.
Which SQL clause should you use?
A. ORDER BY
B. WHERE
C. DISTINCT
D. HAVING
Correct Answer: D
Explanation: HAVING filters aggregated results after the GROUP BY operation is completed.
Question 4
Which KQL operator is primarily used for aggregation?
A. project
B. summarize
C. extend
D. join
Correct Answer: B
Explanation: The summarize operator performs grouping and aggregation in KQL.
Question 5
A Fabric Dataflow Gen2 developer wants to calculate average sales by product category.
Which transformation should be used?
A. Merge
B. Append
C. Group By
D. Split Column
Correct Answer: C
Explanation: The Group By transformation supports aggregation operations such as averages, sums, counts, minimums, and maximums.
Question 6
What is the primary purpose of a GROUP BY clause?
A. Sort rows
B. Remove duplicates
C. Filter rows
D. Create logical groups for aggregation
Correct Answer: D
Explanation: GROUP BY organizes rows into groups before aggregate calculations are performed.
Question 7
Which PySpark operation performs grouping before aggregation?
A. select()
B. filter()
C. groupBy()
D. orderBy()
Correct Answer: C
Explanation: groupBy() defines the grouping columns that will be used by aggregation functions.
Question 8
Which scenario is most appropriate for a window aggregation?
A. Total sales by region only
B. Average salary by department only
C. Customer counts by state only
D. Display each transaction along with the total sales for its region
Correct Answer: D
Explanation: Window functions preserve detail rows while calculating aggregates across a defined partition.
Question 9
A data engineer groups a dataset by TransactionID, where every TransactionID is unique.
What is the likely result?
A. Improved aggregation performance
B. Reduced cardinality
C. Limited performance benefits because each group contains one row
D. Automatic partition optimization
Correct Answer: C
Explanation: Grouping by a highly unique column creates many groups and often provides little analytical value.
Question 10
When preparing data for a dimensional model, why are aggregated tables often created?
A. To increase data duplication
B. To improve reporting and query performance
C. To eliminate dimension tables
D. To replace fact tables entirely
Correct Answer: B
Explanation: Pre-aggregated tables reduce the amount of data that must be processed during reporting, improving performance and user experience.
Go to the DP-700 Exam Prep Hub main page.
