This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Group and aggregate data

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Grouping and aggregating data are among the most common and important data transformation operations performed by data engineers. Organizations rarely analyze raw transactional data directly. Instead, they summarize, categorize, and aggregate data to create meaningful business metrics such as total sales, average order value, monthly revenue, customer counts, inventory levels, and operational KPIs.

In Microsoft Fabric, grouping and aggregation can be performed using several technologies, including:

SQL in Fabric Data Warehouses and Lakehouses
PySpark notebooks
Dataflows Gen2
KQL (Kusto Query Language)
Data pipelines as part of larger ETL/ELT processes

For the DP-700 exam, you should understand:

Why grouping and aggregation are important
When to aggregate data
How to implement aggregations using SQL, PySpark, KQL, and Dataflows Gen2
Common aggregation functions
Performance considerations
Aggregations in dimensional modeling and analytics solutions

Why Group and Aggregate Data?

Raw data often contains millions or billions of records.

For example:

OrderID	CustomerID	OrderDate	Amount
1001	C001	2026-01-01	250
1002	C001	2026-01-02	150
1003	C002	2026-01-02	300

Business users usually want summarized information such as:

CustomerID	TotalSales
C001	400
C002	300

This transformation is accomplished through grouping and aggregation.

Benefits include:

Faster analytics
Reduced storage requirements
Easier reporting
Improved dashboard performance
Simplified business intelligence models

Understanding Grouping

Grouping combines records that share common values.

Examples:

Group by:

Customer
Product
Region
Department
Date
Month
Year

Example:

Region	Sales
East	100
East	200
West	300

Grouped by Region:

Region	TotalSales
East	300
West	300

The GROUP BY operation creates logical categories before aggregation calculations occur.

Common Aggregation Functions

Data engineers should be familiar with the most common aggregation functions.

SUM()

Calculates totals.

Example:

			
SELECT Region,
       SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Region;

Result:

Region	TotalSales
East	500000
West	750000

COUNT()

Counts rows.

			
SELECT Region,
       COUNT(*) AS OrderCount
FROM Sales
GROUP BY Region;

Used for:

Number of customers
Number of transactions
Number of products

AVG()

Calculates averages.

			
SELECT ProductCategory,
       AVG(SalesAmount) AS AverageSale
FROM Sales
GROUP BY ProductCategory;

Used for:

Average order value
Average response time
Average inventory level

MIN()

Returns the smallest value.

			
SELECT MIN(OrderDate)
FROM Orders;

Used for:

Earliest order
Lowest temperature
Minimum cost

MAX()

Returns the largest value.

			
SELECT MAX(OrderDate)
FROM Orders;

Used for:

Latest transaction
Highest sales amount
Maximum inventory quantity

Grouping and Aggregation Using SQL

SQL is the most common approach for aggregation in Fabric.

Example:

			
SELECT
    ProductCategory,
    SUM(SalesAmount) AS TotalSales,
    COUNT(*) AS OrderCount,
    AVG(SalesAmount) AS AverageSale
FROM Sales
GROUP BY ProductCategory;

		

This query:

Groups records by category
Calculates total sales
Counts orders
Calculates average sales

Multi-Column Grouping

You can group by multiple columns.

Example:

			
SELECT
    Year,
    Region,
    SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Year, Region;

		

Result:

Year	Region	TotalSales
2025	East	500000
2025	West	700000
2026	East	550000

This provides more granular analysis.

Filtering Aggregated Results with HAVING

WHERE filters rows before aggregation.

HAVING filters after aggregation.

Example:

			
SELECT Region,
       SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Region
HAVING SUM(SalesAmount) > 1000000;

		

Only regions exceeding $1 million in sales are returned.

Aggregation Using PySpark

PySpark is commonly used for large-scale aggregation operations in Lakehouses.

Example:

			
from pyspark.sql.functions import sum
sales_df.groupBy("Region") \
        .agg(sum("SalesAmount").alias("TotalSales")) \
        .show()

Result:

Region	TotalSales
East	500000
West	750000

Multiple Aggregations in PySpark

			
from pyspark.sql.functions import sum, avg, count
sales_df.groupBy("Region").agg(
    sum("SalesAmount").alias("TotalSales"),
    avg("SalesAmount").alias("AvgSales"),
    count("*").alias("OrderCount")
)

		

This performs several calculations in a single operation.

Aggregation Using KQL

KQL is commonly used in Real-Time Intelligence workloads.

Example:

			
Sales
| summarize TotalSales=sum(SalesAmount)
    by Region

Result:

Region	TotalSales
East	500000
West	750000

Multiple Aggregations in KQL

			
Sales
| summarize
    TotalSales=sum(SalesAmount),
    AvgSales=avg(SalesAmount),
    OrderCount=count()
    by Region

		

This pattern is common in real-time analytics.

Aggregation in Dataflows Gen2

Dataflows Gen2 provides a low-code interface.

Using the Group By transformation, users can:

Sum values
Count rows
Calculate averages
Find minimum values
Find maximum values

Typical steps:

Connect to source
Select Group By
Choose grouping columns
Define aggregation functions
Load results

This approach is useful for citizen developers and low-code ETL scenarios.

Aggregation in Dimensional Models

Aggregations are commonly used before loading data into fact and dimension tables.

Example:

Raw transactions:

OrderID	Customer	Amount
1	A	100
2	A	200
3	B	300

Aggregated customer sales:

Customer	TotalSales
A	300
B	300

This summary table can support reporting and dashboards.

Fact Table Aggregations

Fact tables often store:

Transaction-level facts
Daily summaries
Monthly summaries

Examples:

Transaction Fact

OrderID	Amount
1001	50

Daily Aggregate Fact

Date	TotalSales
2026-01-01	50000

Aggregated fact tables improve query performance.

Window Aggregations vs Group Aggregations

Data engineers should understand the difference.

Group Aggregation

Returns one row per group.

			
SELECT Region,
       SUM(SalesAmount)
FROM Sales
GROUP BY Region;

Window Aggregation

Preserves detail rows.

			
SELECT
    OrderID,
    Region,
    SalesAmount,
    SUM(SalesAmount)
        OVER(PARTITION BY Region)
        AS RegionTotal
FROM Sales;

		

Useful for:

Running totals
Rankings
Percentages
Advanced analytics

Performance Considerations

Grouping and aggregation can be expensive.

Best practices include:

Filter Early

Reduce data before aggregation.

WHERE OrderDate >= '2026-01-01'

Aggregate Close to the Source

Avoid moving unnecessary detailed records.

Use Partitioning

Partitioning helps Spark process data efficiently.

Examples:

Date
Region
Customer segment

Use Delta Tables

Delta tables improve performance through:

Data skipping
File optimization
Efficient query execution

Avoid Excessive Cardinality

Grouping on highly unique columns can reduce efficiency.

Bad example:

GROUP BY TransactionID

Good example:

GROUP BY Region

DP-700 Exam Tips

Remember the following:

GROUP BY creates logical groups before aggregation.
HAVING filters aggregated results.
SQL uses GROUP BY.
PySpark uses groupBy() and agg().
KQL uses summarize.
Dataflows Gen2 provides Group By transformations.
Aggregated fact tables improve reporting performance.
Window functions preserve detailed rows while performing calculations.
Aggregations are frequently used when preparing dimensional models.
Filtering before aggregation improves performance.

Practice Exam Questions

Question 1

A data engineer needs to calculate total sales by region in a Fabric Warehouse.

Which SQL function should be used?

A. AVG()

B. COUNT()

C. SUM()

D. MAX()

Correct Answer: C

Explanation: SUM() calculates the total of numeric values. AVG() calculates averages, COUNT() counts rows, and MAX() returns the largest value.

Question 2

A Fabric notebook must calculate the number of orders per customer.

Which aggregation function should be used?

A. COUNT()

B. AVG()

C. MIN()

D. MAX()

Correct Answer: A

Explanation: COUNT() returns the number of rows in each group, making it ideal for counting orders.

Question 3

You need to remove regions with total sales less than $500,000 after aggregation.

Which SQL clause should you use?

A. ORDER BY

B. WHERE

C. DISTINCT

D. HAVING

Correct Answer: D

Explanation: HAVING filters aggregated results after the GROUP BY operation is completed.

Question 4

Which KQL operator is primarily used for aggregation?

A. project

B. summarize

C. extend

D. join

Correct Answer: B

Explanation: The summarize operator performs grouping and aggregation in KQL.

Question 5

A Fabric Dataflow Gen2 developer wants to calculate average sales by product category.

Which transformation should be used?

A. Merge

B. Append

C. Group By

D. Split Column

Correct Answer: C

Explanation: The Group By transformation supports aggregation operations such as averages, sums, counts, minimums, and maximums.

Question 6

What is the primary purpose of a GROUP BY clause?

A. Sort rows

B. Remove duplicates

C. Filter rows

D. Create logical groups for aggregation

Correct Answer: D

Explanation: GROUP BY organizes rows into groups before aggregate calculations are performed.

Question 7

Which PySpark operation performs grouping before aggregation?

A. select()

B. filter()

C. groupBy()

D. orderBy()

Correct Answer: C

Explanation: groupBy() defines the grouping columns that will be used by aggregation functions.

Question 8

Which scenario is most appropriate for a window aggregation?

A. Total sales by region only

B. Average salary by department only

C. Customer counts by state only

D. Display each transaction along with the total sales for its region

Correct Answer: D

Explanation: Window functions preserve detail rows while calculating aggregates across a defined partition.

Question 9

A data engineer groups a dataset by TransactionID, where every TransactionID is unique.

What is the likely result?

A. Improved aggregation performance

B. Reduced cardinality

C. Limited performance benefits because each group contains one row

D. Automatic partition optimization

Correct Answer: C

Explanation: Grouping by a highly unique column creates many groups and often provides little analytical value.

Question 10

When preparing data for a dimensional model, why are aggregated tables often created?

A. To increase data duplication

B. To improve reporting and query performance

C. To eliminate dimension tables

D. To replace fact tables entirely

Correct Answer: B

Explanation: Pre-aggregated tables reduce the amount of data that must be processed during reporting, improving performance and user experience.

Go to the DP-700 Exam Prep Hub main page.

Introduction

Why Group and Aggregate Data?

Understanding Grouping

Common Aggregation Functions

SUM()

COUNT()

AVG()

MIN()

MAX()

Grouping and Aggregation Using SQL

Multi-Column Grouping

Filtering Aggregated Results with HAVING

Aggregation Using PySpark

Multiple Aggregations in PySpark

Aggregation Using KQL

Multiple Aggregations in KQL

Aggregation in Dataflows Gen2

Aggregation in Dimensional Models

Fact Table Aggregations

Transaction Fact

Daily Aggregate Fact

Window Aggregations vs Group Aggregations

Group Aggregation

Window Aggregation

Performance Considerations

Filter Early

Aggregate Close to the Source

Use Partitioning

Use Delta Tables

Avoid Excessive Cardinality

DP-700 Exam Tips

Practice Exam Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Share this:

Related

Leave a comment Cancel reply

Information and resources for the data professionals' community