This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
--> Ingest and transform batch data
--> Transform data by using PySpark, SQL, and KQL
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
One of the most important skills for the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric certification exam is knowing how to transform data using the appropriate technology. Microsoft Fabric provides multiple transformation engines, each optimized for specific workloads:
- PySpark for large-scale distributed data engineering and advanced transformations
- SQL for relational data manipulation, warehousing, and analytics
- KQL (Kusto Query Language) for high-volume log, telemetry, event, and time-series data analysis
A successful Fabric Data Engineer must understand not only how each technology works, but also when to choose one over another.
Understanding the Transformation Options in Microsoft Fabric
Microsoft Fabric supports several data processing experiences:
| Technology | Primary Use Case | Common Fabric Components |
|---|---|---|
| PySpark | Big data processing and engineering | Lakehouse, Notebooks |
| SQL | Relational transformations and analytics | Warehouse, SQL Endpoint |
| KQL | Streaming, telemetry, logs, event analytics | Eventhouse, Real-Time Intelligence |
While all three can transform data, they are designed for different scenarios.
Transforming Data with PySpark
What is PySpark?
PySpark is the Python API for Apache Spark.
Spark is a distributed processing engine that allows data engineers to process extremely large datasets across multiple nodes simultaneously.
Within Microsoft Fabric, PySpark is typically used in:
- Notebooks
- Lakehouses
- Spark Job Definitions
When to Use PySpark
PySpark is ideal when:
- Working with large-scale datasets
- Performing complex transformations
- Processing semi-structured data
- Building data engineering pipelines
- Performing machine learning preparation
- Handling Delta Lake tables
Examples include:
- Cleaning raw data
- Parsing JSON files
- Aggregating billions of records
- Creating dimensional model tables
- Performing data quality checks
Reading Data with PySpark
Example:
df = spark.read.format("delta").load("Tables/Sales")
Filtering Data
filtered_df = df.filter(df.Amount > 1000)
Creating New Columns
from pyspark.sql.functions import colnew_df = df.withColumn( "TaxAmount", col("Amount") * 0.07)
Aggregating Data
from pyspark.sql.functions import sumsummary_df = ( df.groupBy("Region") .agg(sum("Amount").alias("TotalSales")))
Writing Results
summary_df.write.mode("overwrite").saveAsTable("SalesSummary")
PySpark Advantages
Scalability
Handles terabytes and petabytes of data.
Distributed Processing
Automatically parallelizes workloads.
Flexibility
Supports:
- Structured data
- Semi-structured data
- Unstructured data
Data Engineering Focus
Excellent for ETL and ELT processes.
PySpark Limitations
- More complex than SQL
- Requires programming skills
- Less familiar to business analysts
- Higher resource consumption for small workloads
Transforming Data with SQL
What is SQL in Fabric?
SQL remains one of the most commonly used languages in Fabric.
You can use SQL within:
- Fabric Data Warehouse
- Lakehouse SQL Endpoint
- SQL Query Editor
- Stored Procedures
- Data Pipelines
When to Use SQL
SQL is ideal for:
- Relational transformations
- Data warehouse development
- Reporting datasets
- Aggregations
- Joins
- Dimensional modeling
Examples:
- Creating fact tables
- Loading dimensions
- Building reporting views
- Data validation
Filtering Records
SELECT *FROM SalesWHERE Amount > 1000;
Aggregations
SELECT Region, SUM(Amount) AS TotalSalesFROM SalesGROUP BY Region;
Joining Tables
SELECT s.SaleID, c.CustomerNameFROM Sales sINNER JOIN Customer c ON s.CustomerID = c.CustomerID;
Creating Transformation Tables
CREATE TABLE SalesSummary ASSELECT Region, SUM(Amount) AS TotalSalesFROM SalesGROUP BY Region;
SQL Advantages
Familiarity
Most data professionals know SQL.
Readability
Easy to understand and maintain.
Relational Optimization
Optimized for joins and aggregations.
Warehousing Support
Ideal for star schemas and dimensional models.
SQL Limitations
- Less effective for complex data engineering workflows
- Not ideal for large-scale semi-structured data processing
- Limited flexibility compared to PySpark
Transforming Data with KQL
What is KQL?
Kusto Query Language (KQL) is a read-optimized query language designed for:
- Telemetry
- Log analytics
- Event processing
- Streaming data
- Time-series analysis
KQL is commonly used in:
- Eventhouse
- Real-Time Intelligence
- KQL Databases
When to Use KQL
Use KQL when working with:
- Sensor data
- IoT events
- Application logs
- Security monitoring
- Streaming datasets
- Time-series analytics
Examples:
- Monitoring manufacturing equipment
- Detecting anomalies
- Security event analysis
- Operational dashboards
Filtering Data
Events| where Temperature > 100
Summarization
Events| summarize AvgTemp = avg(Temperature) by DeviceID
Time-Series Analysis
Events| summarize Count=count() by bin(Timestamp, 1h)
Detecting Trends
Events| make-series AvgTemp=avg(Temperature) on Timestamp step 1h
KQL Advantages
High Performance
Optimized for large event datasets.
Time-Series Analytics
Excellent for temporal analysis.
Streaming Support
Designed for real-time workloads.
Fast Query Execution
Ideal for operational dashboards.
KQL Limitations
- Not intended for traditional data warehousing
- Less suitable for dimensional modeling
- Not commonly used for batch ETL
Comparing PySpark, SQL, and KQL
| Requirement | Best Choice |
|---|---|
| Large-scale ETL | PySpark |
| Data warehouse transformations | SQL |
| Star schema creation | SQL |
| Streaming analytics | KQL |
| Time-series analysis | KQL |
| Semi-structured JSON processing | PySpark |
| Machine learning preparation | PySpark |
| Business reporting datasets | SQL |
| Eventhouse analytics | KQL |
| Massive Delta Lake processing | PySpark |
Choosing the Right Transformation Tool
Choose PySpark When
- Processing very large datasets
- Working with Data Lake data
- Building engineering pipelines
- Handling JSON or Parquet files
- Performing advanced transformations
Choose SQL When
- Building warehouses
- Creating dimensional models
- Developing reporting datasets
- Performing relational transformations
- Creating views and stored procedures
Choose KQL When
- Working with event streams
- Analyzing telemetry
- Investigating logs
- Performing time-series analysis
- Monitoring operational systems
Exam Tips
Know the Primary Use Cases
A common DP-700 exam question asks which technology is most appropriate for a scenario.
Remember:
- PySpark = Big Data Engineering
- SQL = Relational Analytics and Warehousing
- KQL = Real-Time and Time-Series Analytics
Understand Fabric Components
Know where each technology is primarily used:
| Technology | Fabric Experience |
|---|---|
| PySpark | Lakehouse, Notebook |
| SQL | Warehouse, SQL Endpoint |
| KQL | Eventhouse |
Focus on Scenario-Based Questions
The exam frequently describes a business requirement and asks which technology should be used.
For example:
- IoT sensors → KQL
- Warehouse dimension tables → SQL
- Processing billions of JSON records → PySpark
Practice Exam Questions
Question 1
A data engineer must transform 20 TB of semi-structured JSON data stored in OneLake. Which technology is the best choice?
A. SQL
B. PySpark
C. KQL
D. Power Query
Answer: B
Explanation: PySpark is designed for distributed processing of massive datasets and handles semi-structured formats such as JSON efficiently.
Question 2
A Fabric solution requires creation of a star schema consisting of fact and dimension tables. Which technology is most appropriate?
A. SQL
B. KQL
C. Power BI DAX
D. Data Activator
Answer: A
Explanation: SQL is optimized for relational transformations and dimensional modeling commonly used in data warehouses.
Question 3
A company wants to analyze millions of IoT events arriving continuously from factory equipment. Which technology should be used?
A. KQL
B. Power Query
C. SQL
D. Excel
Answer: A
Explanation: KQL is designed specifically for high-volume event, telemetry, and time-series analysis workloads.
Question 4
Which Fabric component is most closely associated with KQL transformations?
A. Warehouse
B. Notebook
C. SQL Endpoint
D. Eventhouse
Answer: D
Explanation: Eventhouse is the primary Fabric experience for KQL-based analytics and real-time intelligence workloads.
Question 5
A data engineer needs to process Delta Lake tables using distributed compute. Which technology should be selected?
A. KQL
B. SQL
C. PySpark
D. Power BI
Answer: C
Explanation: PySpark integrates directly with Delta Lake and supports scalable distributed processing.
Question 6
Which language is specifically optimized for time-series analysis?
A. SQL
B. KQL
C. Python
D. DAX
Answer: B
Explanation: KQL includes built-in capabilities for temporal aggregation, anomaly detection, and time-series analytics.
Question 7
A Fabric Warehouse team needs to build a reusable transformation layer consisting of joins, aggregations, and views. Which technology should they use?
A. SQL
B. KQL
C. Dataflows Gen2
D. Spark ML
Answer: A
Explanation: SQL is the preferred language for relational transformations and warehouse development.
Question 8
Which technology is generally the best choice for preparing large datasets for machine learning?
A. KQL
B. SQL
C. DAX
D. PySpark
Answer: D
Explanation: PySpark provides scalable data preparation capabilities and integrates well with machine learning workflows.
Question 9
An engineer needs to summarize application log events by hour and identify usage trends. Which technology is most appropriate?
A. PySpark
B. Power Query
C. KQL
D. SQL
Answer: C
Explanation: KQL excels at log analytics, event monitoring, and time-based aggregations.
Question 10
A team needs a transformation language that is familiar to most database developers and optimized for relational joins. Which should they choose?
A. PySpark
B. KQL
C. Power Query
D. SQL
Answer: D
Explanation: SQL remains the standard language for relational querying, joins, aggregations, and warehouse transformations.
Go to the DP-700 Exam Prep Hub main page.
