Group and aggregate data (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Group and aggregate data


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Grouping and aggregating data are among the most common and important data transformation operations performed by data engineers. Organizations rarely analyze raw transactional data directly. Instead, they summarize, categorize, and aggregate data to create meaningful business metrics such as total sales, average order value, monthly revenue, customer counts, inventory levels, and operational KPIs.

In Microsoft Fabric, grouping and aggregation can be performed using several technologies, including:

  • SQL in Fabric Data Warehouses and Lakehouses
  • PySpark notebooks
  • Dataflows Gen2
  • KQL (Kusto Query Language)
  • Data pipelines as part of larger ETL/ELT processes

For the DP-700 exam, you should understand:

  • Why grouping and aggregation are important
  • When to aggregate data
  • How to implement aggregations using SQL, PySpark, KQL, and Dataflows Gen2
  • Common aggregation functions
  • Performance considerations
  • Aggregations in dimensional modeling and analytics solutions

Why Group and Aggregate Data?

Raw data often contains millions or billions of records.

For example:

OrderIDCustomerIDOrderDateAmount
1001C0012026-01-01250
1002C0012026-01-02150
1003C0022026-01-02300

Business users usually want summarized information such as:

CustomerIDTotalSales
C001400
C002300

This transformation is accomplished through grouping and aggregation.

Benefits include:

  • Faster analytics
  • Reduced storage requirements
  • Easier reporting
  • Improved dashboard performance
  • Simplified business intelligence models

Understanding Grouping

Grouping combines records that share common values.

Examples:

Group by:

  • Customer
  • Product
  • Region
  • Department
  • Date
  • Month
  • Year

Example:

RegionSales
East100
East200
West300

Grouped by Region:

RegionTotalSales
East300
West300

The GROUP BY operation creates logical categories before aggregation calculations occur.


Common Aggregation Functions

Data engineers should be familiar with the most common aggregation functions.

SUM()

Calculates totals.

Example:

SELECT Region,
SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Region;

Result:

RegionTotalSales
East500000
West750000

COUNT()

Counts rows.

SELECT Region,
COUNT(*) AS OrderCount
FROM Sales
GROUP BY Region;

Used for:

  • Number of customers
  • Number of transactions
  • Number of products

AVG()

Calculates averages.

SELECT ProductCategory,
AVG(SalesAmount) AS AverageSale
FROM Sales
GROUP BY ProductCategory;

Used for:

  • Average order value
  • Average response time
  • Average inventory level

MIN()

Returns the smallest value.

SELECT MIN(OrderDate)
FROM Orders;

Used for:

  • Earliest order
  • Lowest temperature
  • Minimum cost

MAX()

Returns the largest value.

SELECT MAX(OrderDate)
FROM Orders;

Used for:

  • Latest transaction
  • Highest sales amount
  • Maximum inventory quantity

Grouping and Aggregation Using SQL

SQL is the most common approach for aggregation in Fabric.

Example:

SELECT
ProductCategory,
SUM(SalesAmount) AS TotalSales,
COUNT(*) AS OrderCount,
AVG(SalesAmount) AS AverageSale
FROM Sales
GROUP BY ProductCategory;

This query:

  1. Groups records by category
  2. Calculates total sales
  3. Counts orders
  4. Calculates average sales

Multi-Column Grouping

You can group by multiple columns.

Example:

SELECT
Year,
Region,
SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Year, Region;

Result:

YearRegionTotalSales
2025East500000
2025West700000
2026East550000

This provides more granular analysis.


Filtering Aggregated Results with HAVING

WHERE filters rows before aggregation.

HAVING filters after aggregation.

Example:

SELECT Region,
SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY Region
HAVING SUM(SalesAmount) > 1000000;

Only regions exceeding $1 million in sales are returned.


Aggregation Using PySpark

PySpark is commonly used for large-scale aggregation operations in Lakehouses.

Example:

from pyspark.sql.functions import sum
sales_df.groupBy("Region") \
.agg(sum("SalesAmount").alias("TotalSales")) \
.show()

Result:

RegionTotalSales
East500000
West750000

Multiple Aggregations in PySpark

from pyspark.sql.functions import sum, avg, count
sales_df.groupBy("Region").agg(
sum("SalesAmount").alias("TotalSales"),
avg("SalesAmount").alias("AvgSales"),
count("*").alias("OrderCount")
)

This performs several calculations in a single operation.


Aggregation Using KQL

KQL is commonly used in Real-Time Intelligence workloads.

Example:

Sales
| summarize TotalSales=sum(SalesAmount)
by Region

Result:

RegionTotalSales
East500000
West750000

Multiple Aggregations in KQL

Sales
| summarize
TotalSales=sum(SalesAmount),
AvgSales=avg(SalesAmount),
OrderCount=count()
by Region

This pattern is common in real-time analytics.


Aggregation in Dataflows Gen2

Dataflows Gen2 provides a low-code interface.

Using the Group By transformation, users can:

  • Sum values
  • Count rows
  • Calculate averages
  • Find minimum values
  • Find maximum values

Typical steps:

  1. Connect to source
  2. Select Group By
  3. Choose grouping columns
  4. Define aggregation functions
  5. Load results

This approach is useful for citizen developers and low-code ETL scenarios.


Aggregation in Dimensional Models

Aggregations are commonly used before loading data into fact and dimension tables.

Example:

Raw transactions:

OrderIDCustomerAmount
1A100
2A200
3B300

Aggregated customer sales:

CustomerTotalSales
A300
B300

This summary table can support reporting and dashboards.


Fact Table Aggregations

Fact tables often store:

  • Transaction-level facts
  • Daily summaries
  • Monthly summaries

Examples:

Transaction Fact

OrderIDAmount
100150

Daily Aggregate Fact

DateTotalSales
2026-01-0150000

Aggregated fact tables improve query performance.


Window Aggregations vs Group Aggregations

Data engineers should understand the difference.

Group Aggregation

Returns one row per group.

SELECT Region,
SUM(SalesAmount)
FROM Sales
GROUP BY Region;

Window Aggregation

Preserves detail rows.

SELECT
OrderID,
Region,
SalesAmount,
SUM(SalesAmount)
OVER(PARTITION BY Region)
AS RegionTotal
FROM Sales;

Useful for:

  • Running totals
  • Rankings
  • Percentages
  • Advanced analytics

Performance Considerations

Grouping and aggregation can be expensive.

Best practices include:

Filter Early

Reduce data before aggregation.

WHERE OrderDate >= '2026-01-01'

Aggregate Close to the Source

Avoid moving unnecessary detailed records.


Use Partitioning

Partitioning helps Spark process data efficiently.

Examples:

  • Date
  • Region
  • Customer segment

Use Delta Tables

Delta tables improve performance through:

  • Data skipping
  • File optimization
  • Efficient query execution

Avoid Excessive Cardinality

Grouping on highly unique columns can reduce efficiency.

Bad example:

GROUP BY TransactionID

Good example:

GROUP BY Region

DP-700 Exam Tips

Remember the following:

  • GROUP BY creates logical groups before aggregation.
  • HAVING filters aggregated results.
  • SQL uses GROUP BY.
  • PySpark uses groupBy() and agg().
  • KQL uses summarize.
  • Dataflows Gen2 provides Group By transformations.
  • Aggregated fact tables improve reporting performance.
  • Window functions preserve detailed rows while performing calculations.
  • Aggregations are frequently used when preparing dimensional models.
  • Filtering before aggregation improves performance.

Practice Exam Questions

Question 1

A data engineer needs to calculate total sales by region in a Fabric Warehouse.

Which SQL function should be used?

A. AVG()

B. COUNT()

C. SUM()

D. MAX()

Correct Answer: C

Explanation: SUM() calculates the total of numeric values. AVG() calculates averages, COUNT() counts rows, and MAX() returns the largest value.


Question 2

A Fabric notebook must calculate the number of orders per customer.

Which aggregation function should be used?

A. COUNT()

B. AVG()

C. MIN()

D. MAX()

Correct Answer: A

Explanation: COUNT() returns the number of rows in each group, making it ideal for counting orders.


Question 3

You need to remove regions with total sales less than $500,000 after aggregation.

Which SQL clause should you use?

A. ORDER BY

B. WHERE

C. DISTINCT

D. HAVING

Correct Answer: D

Explanation: HAVING filters aggregated results after the GROUP BY operation is completed.


Question 4

Which KQL operator is primarily used for aggregation?

A. project

B. summarize

C. extend

D. join

Correct Answer: B

Explanation: The summarize operator performs grouping and aggregation in KQL.


Question 5

A Fabric Dataflow Gen2 developer wants to calculate average sales by product category.

Which transformation should be used?

A. Merge

B. Append

C. Group By

D. Split Column

Correct Answer: C

Explanation: The Group By transformation supports aggregation operations such as averages, sums, counts, minimums, and maximums.


Question 6

What is the primary purpose of a GROUP BY clause?

A. Sort rows

B. Remove duplicates

C. Filter rows

D. Create logical groups for aggregation

Correct Answer: D

Explanation: GROUP BY organizes rows into groups before aggregate calculations are performed.


Question 7

Which PySpark operation performs grouping before aggregation?

A. select()

B. filter()

C. groupBy()

D. orderBy()

Correct Answer: C

Explanation: groupBy() defines the grouping columns that will be used by aggregation functions.


Question 8

Which scenario is most appropriate for a window aggregation?

A. Total sales by region only

B. Average salary by department only

C. Customer counts by state only

D. Display each transaction along with the total sales for its region

Correct Answer: D

Explanation: Window functions preserve detail rows while calculating aggregates across a defined partition.


Question 9

A data engineer groups a dataset by TransactionID, where every TransactionID is unique.

What is the likely result?

A. Improved aggregation performance

B. Reduced cardinality

C. Limited performance benefits because each group contains one row

D. Automatic partition optimization

Correct Answer: C

Explanation: Grouping by a highly unique column creates many groups and often provides little analytical value.


Question 10

When preparing data for a dimensional model, why are aggregated tables often created?

A. To increase data duplication

B. To improve reporting and query performance

C. To eliminate dimension tables

D. To replace fact tables entirely

Correct Answer: B

Explanation: Pre-aggregated tables reduce the amount of data that must be processed during reporting, improving performance and user experience.


Go to the DP-700 Exam Prep Hub main page.

Leave a comment