This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
--> Ingest and transform streaming data
--> Process data by using Spark structured streaming
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern analytics platforms increasingly require the ability to process data continuously as it arrives rather than waiting for scheduled batch loads. Microsoft Fabric supports this requirement through Spark Structured Streaming, a scalable and fault-tolerant stream processing engine built on Apache Spark.
For the DP-700 exam, you should understand when and how to use Spark Structured Streaming, how it differs from other real-time processing options such as Eventstreams and KQL Querysets, and how to design streaming solutions that write data into OneLake and Delta tables.
Spark Structured Streaming is commonly used when data engineers need to process streaming data with complex transformations, enrichments, joins, aggregations, and machine learning workloads while leveraging the scalability of Spark. (Microsoft Learn)
What Is Spark Structured Streaming?
Spark Structured Streaming is a stream-processing framework built on top of Apache Spark. It treats a continuous stream of incoming data as an unbounded table to which new rows are constantly appended. Developers write code using familiar DataFrame and Spark SQL operations while Spark handles the continuous execution behind the scenes. (Microsoft Learn)
Key characteristics include:
- Near real-time processing
- Fault tolerance
- Automatic recovery
- Horizontal scalability
- Support for complex transformations
- Integration with Delta Lake
- Exactly-once processing capabilities through checkpointing and transaction logs (Microsoft Learn)
How Structured Streaming Works
The processing flow typically follows these steps:
- Read data from a streaming source.
- Apply transformations.
- Write results to a destination.
- Store checkpoints to track processing progress.
- Continue processing new data as it arrives.
Common Sources
Spark Structured Streaming supports sources such as:
- Azure Event Hubs
- Apache Kafka
- JSON files
- CSV files
- Parquet files
- Delta tables
- Eventstreams outputs (Microsoft Learn)
Common Destinations
Results can be written to:
- Lakehouse Delta tables
- OneLake storage
- Eventhouses
- Memory sinks for testing
- Other supported storage locations (Microsoft Learn)
Structured Streaming in Microsoft Fabric
Within Microsoft Fabric, Structured Streaming is most commonly implemented through:
- Notebooks
- Spark Job Definitions
- Lakehouses
- Delta tables
- OneLake storage
A typical architecture looks like this:
Azure Event Hub | vSpark Structured Streaming | vDelta Table in Lakehouse | vSQL Analytics Endpoint | vPower BI
This pattern enables streaming data to become queryable almost immediately after arrival. (Microsoft Learn)
Structured Streaming vs Batch Processing
| Feature | Batch Processing | Structured Streaming |
|---|---|---|
| Data arrival | Periodic | Continuous |
| Processing latency | Minutes or hours | Seconds or minutes |
| Resource usage | Scheduled | Continuous |
| Typical use case | Historical reporting | Real-time analytics |
| Data availability | After load completes | Near real time |
Use Batch Processing When:
- Data changes infrequently
- Overnight processing is acceptable
- Real-time insights are unnecessary
Use Structured Streaming When:
- IoT devices generate events continuously
- Fraud detection requires immediate action
- Operational dashboards need live updates
- Telemetry data must be analyzed continuously
Micro-Batch Processing
A common DP-700 exam topic is understanding that Structured Streaming typically uses a micro-batch architecture.
Instead of processing every individual event separately, Spark groups events into small batches and processes them continuously.
Example:
Incoming Events-----------------10:00:00 - Event A10:00:01 - Event B10:00:02 - Event CMicro-batch executesProcess A+B+C together
This approach provides:
- Better performance
- Higher throughput
- Easier fault recovery
- Familiar Spark execution model (Microsoft Learn)
Reading Streaming Data
Streaming ingestion begins with:
df = ( spark.readStream .format("eventhubs") .load())
Important points:
readStreamcreates a streaming DataFrame.- Data remains continuously available.
- Spark automatically detects new events.
- The query remains active until stopped. (Microsoft Learn)
Writing Streaming Data
Streaming results are written using writeStream.
Example:
query = ( df.writeStream .format("delta") .outputMode("append") .toTable("SalesEvents"))
Common output modes include:
| Mode | Description |
|---|---|
| Append | Only new rows written |
| Complete | Entire result rewritten |
| Update | Only changed rows written |
For Fabric data engineering scenarios, Append mode is most common. (Microsoft Learn)
Delta Lake Integration
One of the most important DP-700 concepts is integrating Structured Streaming with Delta Lake.
Benefits include:
- ACID transactions
- Schema evolution
- Time travel
- Data versioning
- Reliable streaming ingestion
Streaming data can be written directly into Delta tables:
.writeStream.format("delta").toTable("Orders")
This creates a continuously updated Delta table within the Lakehouse. (Microsoft Learn)
Checkpointing
Checkpointing is critical for fault tolerance.
Example:
.option( "checkpointLocation", "Files/checkpoints/orders")
Checkpoints store:
- Processed offsets
- Query progress
- State information
Benefits:
- Prevents duplicate processing
- Enables recovery after failures
- Supports exactly-once processing semantics
A frequent exam scenario involves identifying missing checkpoint configurations as the root cause of duplicate or reprocessed data. (mindmeshacademy.com)
Triggers
Triggers control how often Spark processes incoming data.
Example:
.trigger( processingTime="1 minute")
Possible trigger strategies:
| Trigger Type | Purpose |
|---|---|
| Continuous processing | Lowest latency |
| Processing time | Fixed intervals |
| Available Now | Process all available data and stop |
Larger trigger intervals often improve throughput because more events are processed together. (Microsoft Learn)
Stateful vs Stateless Processing
Stateless Processing
Each event is processed independently.
Examples:
- Filtering
- Column selection
- Simple transformations
stream.filter("temperature > 100")
Stateful Processing
Spark maintains information between batches.
Examples:
- Running totals
- Session windows
- Stream aggregations
- Deduplication
Stateful processing is more powerful but consumes additional memory and storage resources. (jumpstart.fabric.microsoft.com)
Stream Aggregations
Streaming aggregations allow continuous calculations.
Examples:
- Sales totals
- Device counts
- Average temperatures
- Transaction volumes
Example:
stream.groupBy("DeviceID") \ .count()
This continuously updates counts as new events arrive.
Common Streaming Scenarios in Fabric
IoT Monitoring
Sensors continuously send readings.
Process:
IoT Devices |Event Hub |Spark Structured Streaming |Lakehouse |Power BI Dashboard
Application Telemetry
Applications send logs and metrics continuously.
Use cases:
- Performance monitoring
- Error tracking
- Operational dashboards
Real-Time Business Analytics
Examples include:
- Online sales monitoring
- Inventory tracking
- Customer activity analysis
- Fraud detection
Structured Streaming vs Eventstreams
DP-700 often tests when to use each technology.
| Requirement | Eventstreams | Structured Streaming |
|---|---|---|
| No-code ingestion | Yes | No |
| Visual design | Yes | No |
| Complex transformations | Limited | Excellent |
| Custom code | No | Yes |
| Machine learning integration | Limited | Excellent |
| Advanced Spark operations | No | Yes |
Use Eventstreams for simple routing and ingestion.
Use Structured Streaming for advanced engineering workloads. (Microsoft Learn)
Production Best Practices
Use Spark Job Definitions
For production workloads, Microsoft recommends Spark Job Definitions rather than leaving notebooks running continuously. They provide better reliability and restart capabilities. (Microsoft Learn)
Configure Retry Policies
Retry policies allow automatic recovery from infrastructure failures. (Microsoft Learn)
Always Use Checkpoints
Never deploy production streaming jobs without checkpoint locations. (mindmeshacademy.com)
Optimize Partitioning
Appropriate partitioning improves throughput and downstream query performance. (Microsoft Learn)
Monitor Streaming Jobs
Use Fabric Monitoring Hub to monitor:
- Input rate
- Processing rate
- Batch duration
- Streaming query health (Microsoft Learn)
DP-700 Exam Tips
Remember these frequently tested concepts:
- Structured Streaming treats streams as continuously growing tables.
readStreamreads streaming data.writeStreamwrites streaming data.- Delta tables are common streaming destinations.
- Checkpointing enables fault tolerance.
- Spark Job Definitions are preferred for production streaming workloads.
- Event Hubs is a common streaming source.
- Micro-batch processing is the default execution model.
- Structured Streaming is preferred when complex transformations are required.
- Eventstreams are often preferred for simpler ingestion scenarios.
Practice Exam Questions
Question 1
A company needs to process telemetry data from thousands of IoT devices as soon as it arrives. The solution must perform complex transformations before storing data in a Lakehouse.
Which technology should you choose?
A. Dataflow Gen2
B. Warehouse Stored Procedures
C. Spark Structured Streaming
D. Copy Activity
Correct Answer: C
Explanation: Spark Structured Streaming is designed for continuous data processing and complex transformations on streaming data.
Question 2
What is the primary purpose of a checkpoint location in Structured Streaming?
A. Increase Spark cluster size
B. Store temporary query results
C. Track processing progress and support recovery
D. Compress Delta files
Correct Answer: C
Explanation: Checkpoints store offsets and state information that allow recovery without reprocessing all data.
Question 3
Which method is used to create a streaming DataFrame?
A. readStream()
B. streamRead()
C. loadStreaming()
D. readDelta()
Correct Answer: A
Explanation: readStream() is the Spark API used to create streaming DataFrames.
Question 4
Which destination is most commonly used for Spark Structured Streaming in Microsoft Fabric?
A. Delta table in a Lakehouse
B. Excel workbook
C. Dataflow Gen2
D. Semantic model
Correct Answer: A
Explanation: Delta tables in Lakehouses are the primary streaming storage destination in Fabric.
Question 5
What execution model does Spark Structured Streaming primarily use?
A. Row-by-row execution
B. Continuous SQL polling
C. Micro-batch processing
D. Manual scheduling
Correct Answer: C
Explanation: Structured Streaming processes incoming data as small batches at regular intervals.
Question 6
Which Fabric component is recommended for running production Structured Streaming workloads?
A. Notebook only
B. Dataflow Gen2
C. Pipeline activity
D. Spark Job Definition
Correct Answer: D
Explanation: Spark Job Definitions provide improved reliability, retry policies, and production-grade execution.
Question 7
A streaming job must continuously calculate running totals by customer.
What type of processing is required?
A. Stateless processing
B. Stateful processing
C. Batch processing
D. Snapshot processing
Correct Answer: B
Explanation: Running totals require maintaining state across multiple batches.
Question 8
Which statement about Eventstreams and Structured Streaming is correct?
A. Eventstreams supports more advanced Spark transformations.
B. Structured Streaming is a no-code solution.
C. Structured Streaming supports complex custom code transformations.
D. Eventstreams requires Spark coding.
Correct Answer: C
Explanation: Structured Streaming provides full Spark capabilities and custom coding flexibility.
Question 9
What is the benefit of writing streaming data to Delta tables?
A. Eliminates storage costs
B. Prevents all schema changes
C. Converts data to CSV automatically
D. Provides ACID transactions and reliability
Correct Answer: D
Explanation: Delta Lake provides transactional consistency, schema evolution, and reliable streaming ingestion.
Question 10
A data engineer wants to process incoming events every 60 seconds instead of immediately.
Which feature should be configured?
A. Checkpointing
B. Consumer groups
C. Trigger interval
D. Data partitioning
Correct Answer: C
Explanation: Trigger intervals control how frequently Spark processes incoming streaming data.
Go to the DP-700 Exam Prep Hub main page.
