Transform data by using PySpark, SQL, and KQL (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Transform data by using PySpark, SQL, and KQL


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the most important skills for the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric certification exam is knowing how to transform data using the appropriate technology. Microsoft Fabric provides multiple transformation engines, each optimized for specific workloads:

  • PySpark for large-scale distributed data engineering and advanced transformations
  • SQL for relational data manipulation, warehousing, and analytics
  • KQL (Kusto Query Language) for high-volume log, telemetry, event, and time-series data analysis

A successful Fabric Data Engineer must understand not only how each technology works, but also when to choose one over another.


Understanding the Transformation Options in Microsoft Fabric

Microsoft Fabric supports several data processing experiences:

TechnologyPrimary Use CaseCommon Fabric Components
PySparkBig data processing and engineeringLakehouse, Notebooks
SQLRelational transformations and analyticsWarehouse, SQL Endpoint
KQLStreaming, telemetry, logs, event analyticsEventhouse, Real-Time Intelligence

While all three can transform data, they are designed for different scenarios.


Transforming Data with PySpark

What is PySpark?

PySpark is the Python API for Apache Spark.

Spark is a distributed processing engine that allows data engineers to process extremely large datasets across multiple nodes simultaneously.

Within Microsoft Fabric, PySpark is typically used in:

  • Notebooks
  • Lakehouses
  • Spark Job Definitions

When to Use PySpark

PySpark is ideal when:

  • Working with large-scale datasets
  • Performing complex transformations
  • Processing semi-structured data
  • Building data engineering pipelines
  • Performing machine learning preparation
  • Handling Delta Lake tables

Examples include:

  • Cleaning raw data
  • Parsing JSON files
  • Aggregating billions of records
  • Creating dimensional model tables
  • Performing data quality checks

Reading Data with PySpark

Example:

df = spark.read.format("delta").load("Tables/Sales")

Filtering Data

filtered_df = df.filter(df.Amount > 1000)

Creating New Columns

from pyspark.sql.functions import col
new_df = df.withColumn(
"TaxAmount",
col("Amount") * 0.07
)

Aggregating Data

from pyspark.sql.functions import sum
summary_df = (
df.groupBy("Region")
.agg(sum("Amount").alias("TotalSales"))
)

Writing Results

summary_df.write.mode("overwrite").saveAsTable("SalesSummary")

PySpark Advantages

Scalability

Handles terabytes and petabytes of data.

Distributed Processing

Automatically parallelizes workloads.

Flexibility

Supports:

  • Structured data
  • Semi-structured data
  • Unstructured data

Data Engineering Focus

Excellent for ETL and ELT processes.


PySpark Limitations

  • More complex than SQL
  • Requires programming skills
  • Less familiar to business analysts
  • Higher resource consumption for small workloads

Transforming Data with SQL

What is SQL in Fabric?

SQL remains one of the most commonly used languages in Fabric.

You can use SQL within:

  • Fabric Data Warehouse
  • Lakehouse SQL Endpoint
  • SQL Query Editor
  • Stored Procedures
  • Data Pipelines

When to Use SQL

SQL is ideal for:

  • Relational transformations
  • Data warehouse development
  • Reporting datasets
  • Aggregations
  • Joins
  • Dimensional modeling

Examples:

  • Creating fact tables
  • Loading dimensions
  • Building reporting views
  • Data validation

Filtering Records

SELECT *
FROM Sales
WHERE Amount > 1000;

Aggregations

SELECT
Region,
SUM(Amount) AS TotalSales
FROM Sales
GROUP BY Region;

Joining Tables

SELECT
s.SaleID,
c.CustomerName
FROM Sales s
INNER JOIN Customer c
ON s.CustomerID = c.CustomerID;

Creating Transformation Tables

CREATE TABLE SalesSummary AS
SELECT
Region,
SUM(Amount) AS TotalSales
FROM Sales
GROUP BY Region;

SQL Advantages

Familiarity

Most data professionals know SQL.

Readability

Easy to understand and maintain.

Relational Optimization

Optimized for joins and aggregations.

Warehousing Support

Ideal for star schemas and dimensional models.


SQL Limitations

  • Less effective for complex data engineering workflows
  • Not ideal for large-scale semi-structured data processing
  • Limited flexibility compared to PySpark

Transforming Data with KQL

What is KQL?

Kusto Query Language (KQL) is a read-optimized query language designed for:

  • Telemetry
  • Log analytics
  • Event processing
  • Streaming data
  • Time-series analysis

KQL is commonly used in:

  • Eventhouse
  • Real-Time Intelligence
  • KQL Databases

When to Use KQL

Use KQL when working with:

  • Sensor data
  • IoT events
  • Application logs
  • Security monitoring
  • Streaming datasets
  • Time-series analytics

Examples:

  • Monitoring manufacturing equipment
  • Detecting anomalies
  • Security event analysis
  • Operational dashboards

Filtering Data

Events
| where Temperature > 100

Summarization

Events
| summarize AvgTemp = avg(Temperature)
by DeviceID

Time-Series Analysis

Events
| summarize Count=count()
by bin(Timestamp, 1h)

Detecting Trends

Events
| make-series AvgTemp=avg(Temperature)
on Timestamp
step 1h

KQL Advantages

High Performance

Optimized for large event datasets.

Time-Series Analytics

Excellent for temporal analysis.

Streaming Support

Designed for real-time workloads.

Fast Query Execution

Ideal for operational dashboards.


KQL Limitations

  • Not intended for traditional data warehousing
  • Less suitable for dimensional modeling
  • Not commonly used for batch ETL

Comparing PySpark, SQL, and KQL

RequirementBest Choice
Large-scale ETLPySpark
Data warehouse transformationsSQL
Star schema creationSQL
Streaming analyticsKQL
Time-series analysisKQL
Semi-structured JSON processingPySpark
Machine learning preparationPySpark
Business reporting datasetsSQL
Eventhouse analyticsKQL
Massive Delta Lake processingPySpark

Choosing the Right Transformation Tool

Choose PySpark When

  • Processing very large datasets
  • Working with Data Lake data
  • Building engineering pipelines
  • Handling JSON or Parquet files
  • Performing advanced transformations

Choose SQL When

  • Building warehouses
  • Creating dimensional models
  • Developing reporting datasets
  • Performing relational transformations
  • Creating views and stored procedures

Choose KQL When

  • Working with event streams
  • Analyzing telemetry
  • Investigating logs
  • Performing time-series analysis
  • Monitoring operational systems

Exam Tips

Know the Primary Use Cases

A common DP-700 exam question asks which technology is most appropriate for a scenario.

Remember:

  • PySpark = Big Data Engineering
  • SQL = Relational Analytics and Warehousing
  • KQL = Real-Time and Time-Series Analytics

Understand Fabric Components

Know where each technology is primarily used:

TechnologyFabric Experience
PySparkLakehouse, Notebook
SQLWarehouse, SQL Endpoint
KQLEventhouse

Focus on Scenario-Based Questions

The exam frequently describes a business requirement and asks which technology should be used.

For example:

  • IoT sensors → KQL
  • Warehouse dimension tables → SQL
  • Processing billions of JSON records → PySpark

Practice Exam Questions

Question 1

A data engineer must transform 20 TB of semi-structured JSON data stored in OneLake. Which technology is the best choice?

A. SQL

B. PySpark

C. KQL

D. Power Query

Answer: B

Explanation: PySpark is designed for distributed processing of massive datasets and handles semi-structured formats such as JSON efficiently.


Question 2

A Fabric solution requires creation of a star schema consisting of fact and dimension tables. Which technology is most appropriate?

A. SQL

B. KQL

C. Power BI DAX

D. Data Activator

Answer: A

Explanation: SQL is optimized for relational transformations and dimensional modeling commonly used in data warehouses.


Question 3

A company wants to analyze millions of IoT events arriving continuously from factory equipment. Which technology should be used?

A. KQL

B. Power Query

C. SQL

D. Excel

Answer: A

Explanation: KQL is designed specifically for high-volume event, telemetry, and time-series analysis workloads.


Question 4

Which Fabric component is most closely associated with KQL transformations?

A. Warehouse

B. Notebook

C. SQL Endpoint

D. Eventhouse

Answer: D

Explanation: Eventhouse is the primary Fabric experience for KQL-based analytics and real-time intelligence workloads.


Question 5

A data engineer needs to process Delta Lake tables using distributed compute. Which technology should be selected?

A. KQL

B. SQL

C. PySpark

D. Power BI

Answer: C

Explanation: PySpark integrates directly with Delta Lake and supports scalable distributed processing.


Question 6

Which language is specifically optimized for time-series analysis?

A. SQL

B. KQL

C. Python

D. DAX

Answer: B

Explanation: KQL includes built-in capabilities for temporal aggregation, anomaly detection, and time-series analytics.


Question 7

A Fabric Warehouse team needs to build a reusable transformation layer consisting of joins, aggregations, and views. Which technology should they use?

A. SQL

B. KQL

C. Dataflows Gen2

D. Spark ML

Answer: A

Explanation: SQL is the preferred language for relational transformations and warehouse development.


Question 8

Which technology is generally the best choice for preparing large datasets for machine learning?

A. KQL

B. SQL

C. DAX

D. PySpark

Answer: D

Explanation: PySpark provides scalable data preparation capabilities and integrates well with machine learning workflows.


Question 9

An engineer needs to summarize application log events by hour and identify usage trends. Which technology is most appropriate?

A. PySpark

B. Power Query

C. KQL

D. SQL

Answer: C

Explanation: KQL excels at log analytics, event monitoring, and time-based aggregations.


Question 10

A team needs a transformation language that is familiar to most database developers and optimized for relational joins. Which should they choose?

A. PySpark

B. KQL

C. Power Query

D. SQL

Answer: D

Explanation: SQL remains the standard language for relational querying, joins, aggregations, and warehouse transformations.


Go to the DP-700 Exam Prep Hub main page.

Leave a comment