This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform streaming data
      --> Process data by using Spark structured streaming

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern analytics platforms increasingly require the ability to process data continuously as it arrives rather than waiting for scheduled batch loads. Microsoft Fabric supports this requirement through Spark Structured Streaming, a scalable and fault-tolerant stream processing engine built on Apache Spark.

For the DP-700 exam, you should understand when and how to use Spark Structured Streaming, how it differs from other real-time processing options such as Eventstreams and KQL Querysets, and how to design streaming solutions that write data into OneLake and Delta tables.

Spark Structured Streaming is commonly used when data engineers need to process streaming data with complex transformations, enrichments, joins, aggregations, and machine learning workloads while leveraging the scalability of Spark. (Microsoft Learn)

What Is Spark Structured Streaming?

Spark Structured Streaming is a stream-processing framework built on top of Apache Spark. It treats a continuous stream of incoming data as an unbounded table to which new rows are constantly appended. Developers write code using familiar DataFrame and Spark SQL operations while Spark handles the continuous execution behind the scenes. (Microsoft Learn)

Key characteristics include:

Near real-time processing
Fault tolerance
Automatic recovery
Horizontal scalability
Support for complex transformations
Integration with Delta Lake
Exactly-once processing capabilities through checkpointing and transaction logs (Microsoft Learn)

How Structured Streaming Works

The processing flow typically follows these steps:

Read data from a streaming source.
Apply transformations.
Write results to a destination.
Store checkpoints to track processing progress.
Continue processing new data as it arrives.

Common Sources

Spark Structured Streaming supports sources such as:

Azure Event Hubs
Apache Kafka
JSON files
CSV files
Parquet files
Delta tables
Eventstreams outputs (Microsoft Learn)

Common Destinations

Results can be written to:

Lakehouse Delta tables
OneLake storage
Eventhouses
Memory sinks for testing
Other supported storage locations (Microsoft Learn)

Structured Streaming in Microsoft Fabric

Within Microsoft Fabric, Structured Streaming is most commonly implemented through:

Notebooks
Spark Job Definitions
Lakehouses
Delta tables
OneLake storage

A typical architecture looks like this:

			
Azure Event Hub
       |
       v
Spark Structured Streaming
       |
       v
Delta Table in Lakehouse
       |
       v
SQL Analytics Endpoint
       |
       v
Power BI

		

This pattern enables streaming data to become queryable almost immediately after arrival. (Microsoft Learn)

Structured Streaming vs Batch Processing

Feature	Batch Processing	Structured Streaming
Data arrival	Periodic	Continuous
Processing latency	Minutes or hours	Seconds or minutes
Resource usage	Scheduled	Continuous
Typical use case	Historical reporting	Real-time analytics
Data availability	After load completes	Near real time

Use Batch Processing When:

Data changes infrequently
Overnight processing is acceptable
Real-time insights are unnecessary

Use Structured Streaming When:

IoT devices generate events continuously
Fraud detection requires immediate action
Operational dashboards need live updates
Telemetry data must be analyzed continuously

Micro-Batch Processing

A common DP-700 exam topic is understanding that Structured Streaming typically uses a micro-batch architecture.

Instead of processing every individual event separately, Spark groups events into small batches and processes them continuously.

Example:

			
Incoming Events
-----------------
10:00:00 - Event A
10:00:01 - Event B
10:00:02 - Event C
Micro-batch executes
Process A+B+C together

		

This approach provides:

Better performance
Higher throughput
Easier fault recovery
Familiar Spark execution model (Microsoft Learn)

Reading Streaming Data

Streaming ingestion begins with:

			
df = (
    spark.readStream
         .format("eventhubs")
         .load()
)

		

Important points:

readStream creates a streaming DataFrame.
Data remains continuously available.
Spark automatically detects new events.
The query remains active until stopped. (Microsoft Learn)

Writing Streaming Data

Streaming results are written using writeStream.

Example:

			
query = (
    df.writeStream
      .format("delta")
      .outputMode("append")
      .toTable("SalesEvents")
)

		

Common output modes include:

Mode	Description
Append	Only new rows written
Complete	Entire result rewritten
Update	Only changed rows written

For Fabric data engineering scenarios, Append mode is most common. (Microsoft Learn)

Delta Lake Integration

One of the most important DP-700 concepts is integrating Structured Streaming with Delta Lake.

Benefits include:

ACID transactions
Schema evolution
Time travel
Data versioning
Reliable streaming ingestion

Streaming data can be written directly into Delta tables:

			
.writeStream
.format("delta")
.toTable("Orders")

This creates a continuously updated Delta table within the Lakehouse. (Microsoft Learn)

Checkpointing

Checkpointing is critical for fault tolerance.

Example:

			
.option(
    "checkpointLocation",
    "Files/checkpoints/orders"
)

Checkpoints store:

Processed offsets
Query progress
State information

Benefits:

Prevents duplicate processing
Enables recovery after failures
Supports exactly-once processing semantics

A frequent exam scenario involves identifying missing checkpoint configurations as the root cause of duplicate or reprocessed data. (mindmeshacademy.com)

Triggers

Triggers control how often Spark processes incoming data.

Example:

			
.trigger(
    processingTime="1 minute"
)

Possible trigger strategies:

Trigger Type	Purpose
Continuous processing	Lowest latency
Processing time	Fixed intervals
Available Now	Process all available data and stop

Larger trigger intervals often improve throughput because more events are processed together. (Microsoft Learn)

Stateful vs Stateless Processing

Stateless Processing

Each event is processed independently.

Examples:

Filtering
Column selection
Simple transformations

stream.filter("temperature > 100")

Stateful Processing

Spark maintains information between batches.

Examples:

Running totals
Session windows
Stream aggregations
Deduplication

Stateful processing is more powerful but consumes additional memory and storage resources. (jumpstart.fabric.microsoft.com)

Stream Aggregations

Streaming aggregations allow continuous calculations.

Examples:

Sales totals
Device counts
Average temperatures
Transaction volumes

Example:

			
stream.groupBy("DeviceID") \
      .count()

This continuously updates counts as new events arrive.

Common Streaming Scenarios in Fabric

IoT Monitoring

Sensors continuously send readings.

Process:

			
IoT Devices
     |
Event Hub
     |
Spark Structured Streaming
     |
Lakehouse
     |
Power BI Dashboard

		

Application Telemetry

Applications send logs and metrics continuously.

Use cases:

Performance monitoring
Error tracking
Operational dashboards

Real-Time Business Analytics

Examples include:

Online sales monitoring
Inventory tracking
Customer activity analysis
Fraud detection

Structured Streaming vs Eventstreams

DP-700 often tests when to use each technology.

Requirement	Eventstreams	Structured Streaming
No-code ingestion	Yes	No
Visual design	Yes	No
Complex transformations	Limited	Excellent
Custom code	No	Yes
Machine learning integration	Limited	Excellent
Advanced Spark operations	No	Yes

Use Eventstreams for simple routing and ingestion.

Use Structured Streaming for advanced engineering workloads. (Microsoft Learn)

Production Best Practices

Use Spark Job Definitions

For production workloads, Microsoft recommends Spark Job Definitions rather than leaving notebooks running continuously. They provide better reliability and restart capabilities. (Microsoft Learn)

Configure Retry Policies

Retry policies allow automatic recovery from infrastructure failures. (Microsoft Learn)

Always Use Checkpoints

Never deploy production streaming jobs without checkpoint locations. (mindmeshacademy.com)

Optimize Partitioning

Appropriate partitioning improves throughput and downstream query performance. (Microsoft Learn)

Monitor Streaming Jobs

Use Fabric Monitoring Hub to monitor:

Input rate
Processing rate
Batch duration
Streaming query health (Microsoft Learn)

DP-700 Exam Tips

Remember these frequently tested concepts:

Structured Streaming treats streams as continuously growing tables.
readStream reads streaming data.
writeStream writes streaming data.
Delta tables are common streaming destinations.
Checkpointing enables fault tolerance.
Spark Job Definitions are preferred for production streaming workloads.
Event Hubs is a common streaming source.
Micro-batch processing is the default execution model.
Structured Streaming is preferred when complex transformations are required.
Eventstreams are often preferred for simpler ingestion scenarios.

Practice Exam Questions

Question 1

A company needs to process telemetry data from thousands of IoT devices as soon as it arrives. The solution must perform complex transformations before storing data in a Lakehouse.

Which technology should you choose?

A. Dataflow Gen2
B. Warehouse Stored Procedures
C. Spark Structured Streaming
D. Copy Activity

Correct Answer: C

Explanation: Spark Structured Streaming is designed for continuous data processing and complex transformations on streaming data.

Question 2

What is the primary purpose of a checkpoint location in Structured Streaming?

A. Increase Spark cluster size
B. Store temporary query results
C. Track processing progress and support recovery
D. Compress Delta files

Correct Answer: C

Explanation: Checkpoints store offsets and state information that allow recovery without reprocessing all data.

Question 3

Which method is used to create a streaming DataFrame?

A. readStream()
B. streamRead()
C. loadStreaming()
D. readDelta()

Correct Answer: A

Explanation: readStream() is the Spark API used to create streaming DataFrames.

Question 4

Which destination is most commonly used for Spark Structured Streaming in Microsoft Fabric?

A. Delta table in a Lakehouse
B. Excel workbook
C. Dataflow Gen2
D. Semantic model

Correct Answer: A

Explanation: Delta tables in Lakehouses are the primary streaming storage destination in Fabric.

Question 5

What execution model does Spark Structured Streaming primarily use?

A. Row-by-row execution
B. Continuous SQL polling
C. Micro-batch processing
D. Manual scheduling

Correct Answer: C

Explanation: Structured Streaming processes incoming data as small batches at regular intervals.

Question 6

Which Fabric component is recommended for running production Structured Streaming workloads?

A. Notebook only
B. Dataflow Gen2
C. Pipeline activity
D. Spark Job Definition

Correct Answer: D

Explanation: Spark Job Definitions provide improved reliability, retry policies, and production-grade execution.

Question 7

A streaming job must continuously calculate running totals by customer.

What type of processing is required?

A. Stateless processing
B. Stateful processing
C. Batch processing
D. Snapshot processing

Correct Answer: B

Explanation: Running totals require maintaining state across multiple batches.

Question 8

Which statement about Eventstreams and Structured Streaming is correct?

A. Eventstreams supports more advanced Spark transformations.
B. Structured Streaming is a no-code solution.
C. Structured Streaming supports complex custom code transformations.
D. Eventstreams requires Spark coding.

Correct Answer: C

Explanation: Structured Streaming provides full Spark capabilities and custom coding flexibility.

Question 9

What is the benefit of writing streaming data to Delta tables?

A. Eliminates storage costs
B. Prevents all schema changes
C. Converts data to CSV automatically
D. Provides ACID transactions and reliability

Correct Answer: D

Explanation: Delta Lake provides transactional consistency, schema evolution, and reliable streaming ingestion.

Question 10

A data engineer wants to process incoming events every 60 seconds instead of immediately.

Which feature should be configured?

A. Checkpointing
B. Consumer groups
C. Trigger interval
D. Data partitioning

Correct Answer: C

Explanation: Trigger intervals control how frequently Spark processes incoming streaming data.

Go to the DP-700 Exam Prep Hub main page.

Introduction

What Is Spark Structured Streaming?

How Structured Streaming Works

Common Sources

Common Destinations

Structured Streaming in Microsoft Fabric

Structured Streaming vs Batch Processing

Use Batch Processing When:

Use Structured Streaming When:

Micro-Batch Processing

Reading Streaming Data

Writing Streaming Data

Delta Lake Integration

Checkpointing

Triggers

Stateful vs Stateless Processing

Stateless Processing

Stateful Processing

Stream Aggregations

Common Streaming Scenarios in Fabric

IoT Monitoring

Application Telemetry

Real-Time Business Analytics

Structured Streaming vs Eventstreams

Production Best Practices

Use Spark Job Definitions

Configure Retry Policies

Always Use Checkpoints

Optimize Partitioning

Monitor Streaming Jobs

DP-700 Exam Tips

Practice Exam Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Share this:

Related

Leave a comment Cancel reply

Information and resources for the data professionals' community