Below are the free Exam Prep Hubs currently available on The Data Community. Bookmark the hubs you are interested in and use them to ensure you are fully prepared for the respective exam.
Each hub contains:
The topic-by-topic (from the official study guide) coverage of the material, making it easy for you to ensure you are covering all aspects of the exam material.
Practice exam questions for each section.
Bonus material to help you prepare
Two (2) Practice Exams with 60 questions each, or Four (4) Practice Exams with 30 questions each – along with answer keys.
Links to useful resources, such as Microsoft Learn content, YouTube video series, and more.
WARNING: AI-900 will retire on June 30, 2026. It will be replaced with AI-901. You can continue to earn this certification after AI-900 retires by passing AI-901.
Welcome to The Data Community! A great online resource for information centered around the broad and important topic of “data”. Thank you for visiting and participating.
Welcome to the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub!
Welcome to the one-stop hub with information for preparing for the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric certification exam. The content for this exam helps you to demonstrate that “you have conceptual knowledge of AI solutions in Azure and the foundational technical skills to work with them”. You will also need “knowledge of Python coding syntax and programming techniques, and you should be familiar with Azure resources”. Upon successful completion of the exam, you earn the Microsoft Certified: Fabric Data Engineer Associate certification.
This hub provides information directly here (topic-by-topic as outlined in the official study guide), links to a number of external resources, tips for preparing for the exam, practice tests, and section questions to help you prepare. Bookmark this page and use it as a guide to ensure that you are fully covering all relevant topics for the DP-700 exam and making use of as many of the resources available as possible.
Audience profile (from Microsoft’s site)
As a candidate for this exam, you should have subject matter expertise with data loading patterns, data architectures, and orchestration processes. Your responsibilities for this role include:
Ingesting and transforming data.
Securing and managing an analytics solution.
Monitoring and optimizing an analytics solution.
You work closely with analytics engineers, architects, analysts, and administrators to design and deploy data engineering solutions for analytics.
You should be skilled at manipulating and transforming data by using Structured Query Language (SQL), PySpark, and Kusto Query Language (KQL).
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Overview
Data transformation is a core component of data engineering solutions in Microsoft Fabric. After data is ingested, it is often cleaned, enriched, standardized, aggregated, joined, filtered, and reshaped before being loaded into analytical storage systems such as Lakehouses, Warehouses, or Real-Time Intelligence solutions.
Monitoring data transformations is critical because transformation failures can introduce incorrect data, reduce performance, impact downstream analytics, and create operational issues that may not be immediately visible to end users.
For the DP-700 exam, you should understand:
How transformations are performed in Microsoft Fabric
Monitoring Dataflows Gen2 transformations
Monitoring Spark notebooks and jobs
Monitoring SQL transformations
Monitoring KQL transformations
Using Monitoring Hub
Tracking execution performance
Detecting transformation failures
Monitoring data quality during transformations
Troubleshooting transformation bottlenecks
Why Transformation Monitoring Matters
A successful data ingestion process does not guarantee successful analytics.
Transformation logic can introduce issues such as:
Missing records
Duplicate records
Incorrect aggregations
Failed joins
Null values
Schema mismatches
Performance bottlenecks
Consider a sales pipeline:
Data is successfully ingested.
A transformation joins sales records to customer data.
The customer table schema changes.
The join fails.
Although ingestion succeeds, reporting becomes inaccurate because transformation processing failed.
Monitoring helps identify these problems quickly.
Common Transformation Technologies in Fabric
Several Fabric workloads perform transformations.
Dataflows Gen2
Dataflows Gen2 provide low-code transformation capabilities using Power Query.
Common operations include:
Filtering rows
Removing columns
Merging queries
Appending datasets
Data type conversions
Aggregations
Spark Notebooks
Spark notebooks support large-scale transformations using:
PySpark
Spark SQL
Scala
R
Spark is commonly used for enterprise-scale transformation workloads.
Warehouses
Fabric Warehouses perform transformations using T-SQL.
Examples include:
Data cleansing
Joins
Aggregations
MERGE operations
Dimensional model loading
KQL Databases and Eventhouses
KQL transformations are frequently used for:
Streaming analytics
Event processing
Real-time aggregations
Time-series analysis
Monitoring Hub
The Monitoring Hub serves as the primary monitoring interface for Fabric workloads.
It provides visibility into:
Dataflows
Notebooks
Pipelines
Spark jobs
Warehouse operations
Real-Time Intelligence workloads
Key information includes:
Status
Start time
End time
Duration
Error messages
Historical executions
For DP-700, understanding Monitoring Hub capabilities is important.
Monitoring Dataflow Gen2 Transformations
Dataflows Gen2 provide execution history and refresh monitoring.
You can monitor:
Refresh success
Refresh failures
Refresh duration
Processing status
Common Dataflow Monitoring Scenarios
Transformation Failures
Examples:
Invalid data types
Missing columns
Unsupported operations
Slow Refreshes
Examples:
Large source volumes
Complex joins
Multiple merge operations
Source Connectivity Problems
Examples:
Authentication failures
Source unavailability
Monitoring Spark Transformations
Spark workloads are frequently used for large-scale ETL and ELT processing.
Monitoring focuses on:
Job status
Stage execution
Resource utilization
Task failures
Query execution performance
Spark Monitoring Metrics
Job Duration
Measures total runtime.
Long runtimes may indicate:
Large data volumes
Inefficient code
Resource limitations
Executor Utilization
Shows how effectively cluster resources are being used.
Shuffle Operations
Large shuffles can significantly impact performance.
Excessive shuffling often occurs after:
Large joins
Repartition operations
Aggregations
Task Failures
Task failures often indicate:
Data issues
Memory pressure
Coding errors
Monitoring SQL Transformations
Data engineers frequently use T-SQL in Warehouses and Lakehouses.
Common monitoring activities include:
Query duration
Execution plans
Resource consumption
Blocking issues
SQL Performance Indicators
Long-Running Queries
May indicate:
Missing optimization
Poor filtering
Large joins
Excessive Scanning
Occurs when large tables are repeatedly scanned.
Resource Consumption
High CPU or memory usage can reduce overall system performance.
Monitoring KQL Transformations
KQL is heavily used within Real-Time Intelligence workloads.
Monitoring focuses on:
Query execution time
Data processing rates
Aggregation performance
Windowing performance
Common KQL Monitoring Scenarios
Slow Aggregations
Large datasets may require optimization.
High Latency
Streaming transformations should maintain low latency.
Resource Bottlenecks
Large event volumes can increase processing requirements.
Monitoring Data Quality During Transformation
One of the most important responsibilities of a data engineer is ensuring transformed data remains accurate.
Transformation monitoring should include quality validation.
Null Value Monitoring
Unexpected null values often indicate:
Source issues
Failed joins
Transformation errors
Duplicate Detection
Duplicates may result from:
Reprocessing
Faulty joins
Improper incremental loading
Row Count Validation
Compare row counts between stages.
Example:
Stage
Row Count
Raw
1,000,000
Cleansed
998,000
A small reduction may be expected.
A reduction to 500,000 would require investigation.
Data Type Validation
Common issues include:
Numeric values stored as text
Invalid dates
Truncation errors
Monitoring Transformations in Pipelines
Many transformation activities are orchestrated through Fabric pipelines.
Examples include:
Notebook activities
Dataflow activities
SQL script activities
Pipeline monitoring provides:
Activity-level status
Execution duration
Failure details
Retry history
Identifying Performance Bottlenecks
Transformation monitoring often focuses on performance optimization.
Common bottlenecks include:
Large Joins
Joining large datasets can create expensive operations.
Excessive Data Movement
Moving large volumes unnecessarily increases runtime.
Compare current performance against historical performance.
Monitor at Multiple Levels
Monitor:
Pipeline
Activity
Job
Query
Data quality
Configure Alerts
Create alerts for:
Failed executions
Long-running jobs
High latency
Resource utilization issues
DP-700 Exam Tips
Know Where Monitoring Occurs
The Monitoring Hub is the primary monitoring interface across Fabric workloads.
Understand Spark Monitoring
Expect questions about:
Job duration
Task failures
Shuffle operations
Resource usage
Understand Data Quality Monitoring
Transformation monitoring includes more than execution status.
Validate:
Row counts
Null values
Duplicates
Data types
Understand Pipeline Activity Monitoring
Pipeline activity runs often provide the fastest path to diagnosing transformation failures.
Focus on Root Cause Analysis
Many exam questions present failed transformations and ask which monitoring information should be reviewed first.
Practice Exam Questions
Question 1
A data engineer wants to monitor the execution status of Dataflows Gen2, Spark notebooks, and pipelines from a single location.
Which Fabric feature should be used?
A. OneLake Explorer
B. Monitoring Hub
C. Eventhouse
D. Data Activator
Answer: B
Explanation: The Monitoring Hub provides centralized visibility into Fabric workloads, including dataflows, notebooks, Spark jobs, and pipelines.
Question 2
A Spark transformation job suddenly takes twice as long as normal. Which metric should be examined first?
A. Workspace role assignments
B. Sensitivity labels
C. Job duration and execution details
D. Endorsement settings
Answer: C
Explanation: Job duration and execution metrics help identify performance degradation and processing bottlenecks.
Question 3
A transformation process successfully completes, but analysts report missing records.
Which monitoring activity should be performed first?
A. Row count validation
B. Capacity scaling
C. Sensitivity label review
D. Workspace auditing
Answer: A
Explanation: Row count validation helps determine whether records were lost during transformation.
Question 4
Which Spark operation commonly introduces significant performance overhead due to data movement?
A. Filtering
B. Projection
C. Sorting a small dataset
D. Large shuffle operations
Answer: D
Explanation: Shuffle operations move data between partitions and can significantly impact performance.
Question 5
A transformation begins failing after a source system adds a new column and changes a data type.
What is the most likely root cause?
A. Capacity throttling
B. Schema change
C. Workspace permissions
D. Query acceleration
Answer: B
Explanation: Schema changes frequently cause transformation failures when downstream processes expect a different structure.
Question 6
Which data quality issue is most likely caused by a faulty join operation?
A. High CPU usage
B. Increased capacity consumption
C. Unexpected null values
D. Workspace permission errors
Answer: C
Explanation: Failed or incomplete joins often introduce null values into transformed datasets.
Question 7
A data engineer wants to verify that an incremental transformation only processed newly changed records.
What should be monitored?
A. Endorsement level
B. Watermark or change-tracking values
C. Sensitivity labels
D. Workspace membership
Answer: B
Explanation: Watermarks and change-tracking mechanisms determine which records are processed incrementally.
Question 8
Which monitoring metric is most important for streaming transformation workloads?
A. Query folder structure
B. Workspace endorsement
C. Semantic model refresh ownership
D. Processing latency
Answer: D
Explanation: Streaming solutions depend on low latency to deliver near real-time results.
Question 9
A Dataflow Gen2 refresh begins failing due to authentication problems connecting to a source system.
What type of issue is this?
A. Source connectivity issue
B. Query optimization issue
C. Data skew issue
D. Aggregation issue
Answer: A
Explanation: Authentication failures prevent successful communication with the source system.
Question 10
Which practice helps identify transformation performance degradation before users are affected?
A. Creating additional workspaces
B. Removing monitoring logs
C. Establishing performance baselines and monitoring trends
D. Increasing report refresh frequency
Answer: C
Explanation: Performance baselines make it easier to detect unusual runtimes, resource consumption, and throughput changes before they become major problems.
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Overview
Data ingestion is one of the most critical processes in any data engineering solution. Regardless of whether data is ingested through pipelines, Dataflows Gen2, Eventstreams, Spark notebooks, mirroring, shortcuts, or streaming solutions, engineers must ensure that ingestion processes are running successfully, efficiently, and reliably.
In Microsoft Fabric, monitoring data ingestion involves tracking data movement activities, identifying failures, measuring performance, validating data completeness, troubleshooting bottlenecks, and ensuring data arrives in the correct destination on schedule.
For the DP-700 exam, you should understand:
How ingestion monitoring works across Fabric workloads
Monitoring pipelines and Dataflows Gen2
Monitoring Spark jobs and notebooks
Monitoring streaming ingestion
Using monitoring hubs and run history
Detecting ingestion failures
Investigating performance issues
Monitoring data quality and completeness
Best practices for operational monitoring
Why Data Ingestion Monitoring Matters
A data engineering solution is only valuable if data arrives correctly and on time.
Poorly monitored ingestion processes can result in:
Missing data
Incomplete reports
Delayed analytics
Data quality issues
Failed downstream transformations
Business decision errors
Consider an hourly sales ingestion process:
If the process fails at 2:00 AM
No monitoring is in place
The issue is not discovered until business users report incorrect dashboards
Proper monitoring helps detect and resolve problems before they impact users.
Data Ingestion Components in Microsoft Fabric
Several Fabric services perform data ingestion:
Data Pipelines
Used for:
Copy activities
Data movement
Workflow orchestration
ETL/ELT execution
Pipelines often serve as the primary ingestion mechanism for batch data.
Dataflows Gen2
Used for:
Low-code data ingestion
Power Query transformations
ETL development
Dataflows commonly ingest data from SaaS applications, databases, and files.
Spark Notebooks
Used for:
Large-scale ingestion
Custom transformations
Lakehouse loading
Spark jobs frequently handle enterprise-scale ingestion workloads.
Eventstreams
Used for:
Streaming ingestion
Event processing
Real-time data pipelines
Mirroring
Used for:
Near real-time replication
Continuous synchronization
Operational system integration
Monitoring Hub
The Monitoring Hub is the central monitoring experience within Microsoft Fabric.
It allows administrators and engineers to monitor:
Pipeline executions
Dataflow refreshes
Notebook runs
Spark jobs
Warehouse activities
Real-Time Intelligence workloads
The Monitoring Hub provides:
Run status
Start time
End time
Duration
Error messages
Historical execution information
For DP-700, expect questions regarding how to investigate failures and review execution history.
Monitoring Pipeline Executions
Pipelines provide detailed execution tracking.
Each pipeline run includes:
Status
Activity-level details
Runtime metrics
Input/output information
Error details
Typical statuses include:
Status
Meaning
Succeeded
Completed successfully
Failed
One or more activities failed
In Progress
Currently executing
Cancelled
Stopped before completion
Activity-Level Monitoring
Pipeline monitoring drills into individual activities.
Examples:
Copy Data activity
Notebook activity
Dataflow activity
Stored Procedure activity
If a pipeline fails, reviewing activity-level details is often the fastest way to identify the root cause.
Common Pipeline Failures
Authentication Errors
Examples:
Expired credentials
Missing permissions
Invalid service principal access
Network Issues
Examples:
Source unavailable
Connectivity interruptions
Schema Changes
Examples:
Missing columns
Data type mismatches
Capacity Constraints
Examples:
Resource contention
Capacity throttling
Monitoring Dataflows Gen2
Dataflows Gen2 provide refresh history information.
Engineers can monitor:
Refresh success
Refresh failures
Execution duration
Row processing counts
Monitoring refresh history helps identify:
Slow transformations
Source system issues
Data quality problems
Dataflow Refresh History
Common metrics include:
Start time
End time
Duration
Refresh status
Error details
If refresh duration increases significantly over time, it may indicate:
Growing data volumes
Source performance degradation
Inefficient transformations
Monitoring Spark Ingestion Jobs
Spark workloads often support large-scale ingestion processes.
Measures delay between event creation and availability.
Lower latency generally indicates healthier streaming systems.
Failed Events
Tracks records that could not be processed.
Backlog Size
Measures unprocessed events waiting for ingestion.
Large backlogs may indicate:
Capacity issues
Slow downstream processing
Configuration problems
Monitoring Eventstreams
Eventstreams provide operational monitoring capabilities.
You can monitor:
Incoming event volume
Processing status
Transformation performance
Output destinations
Common issues include:
Source connectivity failures
Event schema mismatches
Destination write failures
Monitoring Mirroring
Mirroring continuously replicates source data into Fabric.
Monitoring focuses on:
Replication status
Synchronization delays
Replication failures
Data freshness
Important concepts include:
Replication Latency
Time between source changes and destination availability.
Synchronization Health
Indicates whether replication remains current.
Monitoring Data Completeness
Successful execution does not always mean successful ingestion.
Data engineers should validate:
Expected row counts
File counts
Event counts
Record completeness
Example:
A pipeline succeeds but only loads 70% of expected records.
Technical execution succeeded, but business requirements were not met.
Common Validation Checks
Row Count Validation
Compare source and destination record counts.
File Validation
Verify expected files arrived.
Timestamp Validation
Confirm recent records are present.
Duplicate Detection
Identify accidental duplicate ingestion.
Monitoring Data Quality During Ingestion
Data quality monitoring often includes:
Null value detection
Invalid data type identification
Duplicate record detection
Referential integrity checks
Monitoring quality issues early prevents downstream reporting problems.
Alerts and Notifications
Monitoring becomes significantly more effective when alerts are configured.
Common alert scenarios include:
Pipeline failures
Dataflow refresh failures
Long-running jobs
Excessive ingestion latency
Capacity utilization thresholds
Alerts allow engineers to respond before business users notice issues.
Troubleshooting Ingestion Failures
A common troubleshooting workflow includes:
Step 1
Review Monitoring Hub status.
Step 2
Identify failed workload.
Step 3
Inspect detailed error message.
Step 4
Validate source connectivity.
Step 5
Verify credentials and permissions.
Step 6
Review recent schema changes.
Step 7
Rerun ingestion process if appropriate.
Best Practices
Establish Baselines
Track normal:
Runtime duration
Throughput
Latency
Data volume
Baseline measurements make anomalies easier to identify.
Monitor Data Quality
Do not rely solely on execution success.
Validate:
Completeness
Accuracy
Timeliness
Use Alerts
Configure proactive notifications for:
Failures
Delays
Performance degradation
Retain Historical Monitoring Data
Historical execution information helps identify:
Trends
Capacity growth
Recurring failures
Investigate Long-Running Jobs
Increasing execution times often indicate:
Growing data volumes
Inefficient queries
Capacity limitations
DP-700 Exam Tips
Know the Monitoring Hub
The Monitoring Hub is the primary location for monitoring Fabric workloads.
Understand Pipeline Monitoring
Be familiar with:
Run history
Activity runs
Error messages
Execution duration
Understand Streaming Metrics
Know the importance of:
Throughput
Latency
Backlogs
Failed events
Monitor More Than Success Status
Successful execution does not guarantee complete or accurate data ingestion.
Understand Data Validation
Exam questions often focus on verifying:
Row counts
Data completeness
Freshness
Data quality
Practice Exam Questions
Question 1
Which Microsoft Fabric feature serves as the central location for monitoring pipelines, notebooks, Spark jobs, and dataflows?
A. Data Activator
B. OneLake Explorer
C. Monitoring Hub
D. Eventhouse
Answer: C
Explanation: The Monitoring Hub provides centralized monitoring across Fabric workloads and is the primary tool for reviewing execution history and failures.
Question 2
A pipeline execution completed successfully, but only half the expected records were loaded.
What should you verify first?
A. Workspace permissions
B. Data completeness and row counts
C. Capacity SKU
D. Sensitivity labels
Answer: B
Explanation: Successful execution does not guarantee successful business outcomes. Row count validation helps confirm complete ingestion.
Question 3
Which metric measures the delay between event creation and event availability in a streaming solution?
A. Throughput
B. Replication count
C. Ingestion latency
D. Refresh frequency
Answer: C
Explanation: Ingestion latency measures how quickly streaming data becomes available after being generated.
Question 4
Which issue is most likely if streaming event backlogs continue growing over time?
A. Processing cannot keep up with incoming events
B. Missing endorsement settings
C. Too many workspace roles
D. Excessive sensitivity labels
Answer: A
Explanation: Growing backlogs typically indicate that event processing is slower than event arrival rates.
Question 5
When troubleshooting a failed pipeline, what should typically be examined first?
A. Lakehouse shortcuts
B. Activity-level execution details
C. Workspace endorsements
D. Semantic model refresh schedules
Answer: B
Explanation: Activity-level details usually identify the exact source of a pipeline failure.
Question 6
Which metric is most useful for determining whether a Dataflow Gen2 refresh is becoming slower over time?
A data engineer wants to verify that every expected source file was loaded during ingestion.
Which validation approach should be used?
A. Capacity monitoring
B. File count validation
C. Role assignment review
D. Workspace auditing
Answer: B
Explanation: File count validation confirms that all expected files were ingested.
Question 8
Which Spark monitoring metric can help identify uneven partition distribution during ingestion?
A. Activity retry count
B. Replication latency
C. Refresh history
D. Data skew
Answer: D
Explanation: Data skew occurs when partitions contain significantly different amounts of data, creating processing bottlenecks.
Question 9
What is the primary purpose of configuring alerts for ingestion workloads?
A. To reduce storage costs
B. To automatically increase capacity
C. To proactively notify administrators of issues
D. To encrypt incoming data
Answer: C
Explanation: Alerts help identify failures, delays, and performance issues before they impact users.
Question 10
Which monitoring focus is most important for mirrored databases?
A. Report visual refresh time
B. Synchronization health and replication latency
C. Notebook parameter values
D. Semantic model relationships
Answer: B
Explanation: Mirroring depends on keeping source and destination systems synchronized, making replication latency and synchronization health critical monitoring metrics.
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Overview
Windowing functions are a fundamental concept in stream processing and real-time analytics. In Microsoft Fabric, windowing functions enable you to group continuous streams of events into logical segments called windows, allowing aggregations and calculations to be performed on streaming data as it arrives. Windowing is heavily used in Eventstreams, Real-Time Intelligence, KQL queries, and stream processing scenarios. (Reitse’s blog)
Unlike batch processing, where all data is available before processing begins, streaming systems deal with potentially infinite streams of incoming events. Windowing functions provide a mechanism to divide this endless stream into manageable chunks for analysis. (MindMesh Academy)
For the DP-700 exam, you should understand:
Why windowing functions are required
The different window types
When each window type should be used
How windowing applies in Eventstreams and KQL
The differences between tumbling, hopping, sliding, session, and snapshot windows
Common real-world scenarios
Why Windowing Functions Are Needed
Imagine a sensor generating thousands of temperature readings every second.
Without windows:
Data arrives continuously.
Aggregations never complete.
Calculating averages, counts, or sums becomes difficult.
Windowing functions solve this problem by grouping events into defined time intervals where calculations can be performed. (MindMesh Academy)
Examples include:
Count website visits every 5 minutes
Calculate average temperature every minute
Measure sales totals every hour
Detect unusual activity within a rolling 10-minute period
Analyze user sessions based on inactivity
Windowing in Microsoft Fabric
Windowing is primarily encountered in:
Eventstreams
Real-Time Intelligence
Eventhouse queries
KQL transformations
Streaming analytics solutions
Fabric supports several window types, each designed for different business requirements. (Reitse’s blog)
Tumbling Windows
Definition
A tumbling window divides a stream into fixed, non-overlapping time intervals. Each event belongs to exactly one window. (MindMesh Academy)
Example
Five-minute windows:
Window
09:00–09:05
09:05–09:10
09:10–09:15
Events are assigned to one and only one window.
Characteristics
Fixed size
No overlap
Continuous
Predictable results
Use Cases
Website Traffic
Count visitors every five minutes.
Sensor Monitoring
Calculate average temperature every minute.
Sales Reporting
Generate hourly revenue summaries.
Exam Tip
If a question mentions:
Fixed intervals
Non-overlapping periods
Each event belongs to one window
The answer is almost always Tumbling Window.
Hopping Windows
Definition
A hopping window uses fixed-length windows that overlap. New windows start at specified intervals called the hop size. (Reitse’s blog)
Example
Window Size = 10 minutes
Hop Interval = 5 minutes
Windows:
Window
09:00–09:10
09:05–09:15
09:10–09:20
An event may appear in multiple windows.
Characteristics
Fixed size
Overlapping
Events can belong to multiple windows
Use Cases
Rolling Analytics
Monitor sales over the previous 10 minutes every 5 minutes.
Performance Monitoring
Analyze server utilization trends.
Operational Dashboards
Create smoother trend analysis.
Exam Tip
If a question describes:
Overlapping windows
Fixed intervals
Repeated calculations over rolling periods
Choose Hopping Window.
Sliding Windows
Definition
Sliding windows continuously evaluate data over a moving time range. Unlike tumbling windows, calculations are updated whenever new events arrive. (Reitse’s blog)
Example
Monitor failed logins within the previous 10 minutes.
As each new event arrives:
Old events leave the window
New events enter the window
Results update continuously
Characteristics
Continuous evaluation
Overlapping by nature
Event-driven processing
Use Cases
Fraud Detection
Detect suspicious transaction patterns.
Security Monitoring
Identify repeated failed logins.
IoT Alerts
Trigger warnings when sensor thresholds are exceeded.
Exam Tip
If the question mentions:
Real-time rolling calculations
Continuous updates
Last X minutes of activity
The correct answer is usually Sliding Window.
Session Windows
Definition
A session window groups events based on periods of activity separated by inactivity gaps. (Reitse’s blog)
Instead of fixed times, session windows are defined by user behavior.
Example
User activity:
Event Time
10:00
10:03
10:05
10:25
If timeout = 10 minutes:
Session 1:
10:00
10:03
10:05
Session 2:
10:25
The 20-minute gap creates a new session.
Characteristics
Activity-based
Dynamic duration
Defined by inactivity timeout
Use Cases
Website User Sessions
Track user visits.
Application Usage
Measure active engagement periods.
Customer Behavior Analytics
Group interactions into sessions.
Exam Tip
Look for keywords:
User sessions
Inactivity timeout
Activity periods
These indicate Session Window.
Snapshot Windows
Definition
A snapshot window captures data at a specific point in time rather than over a duration. (TechTacoFriday)
Think of it as taking a picture of the stream at a particular instant.
Use Cases
Point-in-Time Metrics
Current active users.
Device Status Monitoring
Current state of equipment.
Operational Dashboards
Real-time snapshots of system health.
Comparing Window Types
Window Type
Overlap
Fixed Duration
Based on Inactivity
Tumbling
No
Yes
No
Hopping
Yes
Yes
No
Sliding
Yes
Dynamic
No
Session
Dynamic
No
Yes
Snapshot
No
Instant
No
Windowing in Eventstreams
In Microsoft Fabric Eventstreams, windowing is commonly implemented using the Group By transformation. After selecting a window type, you can apply aggregations such as:
Count
Sum
Average
Minimum
Maximum
These aggregations help convert raw event streams into meaningful business metrics. (Reitse’s blog)
Windowing in KQL
KQL supports time-based aggregations using functions such as:
SalesEvents
| summarize TotalSales=sum(Amount)
by bin(Timestamp, 5m)
The bin() function creates fixed time buckets similar to tumbling windows. (A Guide to Cloud & AI)
Common KQL windowing scenarios include:
Time-series analytics
Streaming dashboards
Real-time monitoring
Trend analysis
Windowing and Streaming Analytics
Windowing is critical because streaming data never stops arriving.
Without windows:
Aggregations would never complete.
Metrics could not be calculated efficiently.
Real-time dashboards would be difficult to build.
Windows provide structure and enable:
Aggregation
Alerting
Trend detection
Session analysis
Operational monitoring
DP-700 Exam Tips
Know the Window Types
Microsoft frequently tests differences between:
Tumbling
Hopping
Sliding
Session
Remember Tumbling
If:
Windows are fixed
Windows do not overlap
Events belong to exactly one window
Choose Tumbling.
Remember Session
If:
User behavior is involved
There is an inactivity timeout
Windows vary in length
Choose Session.
Remember Hopping
If:
Windows overlap
Windows have fixed sizes
Events can appear multiple times
Choose Hopping.
Remember Sliding
If:
Continuous recalculation occurs
Rolling analysis is needed
Alerts depend on recent activity
Choose Sliding.
Practice Exam Questions
Question 1
A streaming solution must calculate the average temperature every minute. Each reading should belong to exactly one aggregation period.
What should you use?
A. Sliding window
B. Session window
C. Tumbling window
D. Hopping window
Answer: C
Explanation: Tumbling windows use fixed, non-overlapping intervals and each event belongs to only one window. (Scribd)
Question 2
You need to analyze sales from the previous 10 minutes every 5 minutes.
Which window type should you use?
A. Hopping window
B. Session window
C. Snapshot window
D. Tumbling window
Answer: A
Explanation: Hopping windows overlap and allow repeated analysis over rolling periods.
Question 3
A website analytics solution must group user activity until no activity occurs for 15 minutes.
Which window type is most appropriate?
A. Tumbling window
B. Snapshot window
C. Sliding window
D. Session window
Answer: D
Explanation: Session windows are based on inactivity periods and user behavior.
Question 4
You need a fraud detection solution that continuously evaluates transactions from the last five minutes whenever a new transaction arrives.
Which window type should be used?
A. Snapshot window
B. Session window
C. Tumbling window
D. Sliding window
Answer: D
Explanation: Sliding windows continuously recalculate results as new events arrive.
Question 5
Which window type allows an event to appear in multiple windows?
A. Tumbling window
B. Snapshot window
C. Hopping window
D. Session window
Answer: C
Explanation: Hopping windows overlap, allowing events to participate in multiple aggregations.
Question 6
What is the primary purpose of windowing functions in streaming systems?
A. Encrypt streaming data
B. Divide continuous streams into manageable groups for processing
C. Compress incoming events
D. Eliminate duplicate records
Answer: B
Explanation: Windowing organizes continuous streams into finite chunks that can be aggregated and analyzed. (MindMesh Academy)
Question 7
Which window type is most suitable for calculating hourly sales totals where no overlap is desired?
A. Sliding window
B. Hopping window
C. Session window
D. Tumbling window
Answer: D
Explanation: Tumbling windows create fixed, non-overlapping intervals.
Question 8
A streaming query groups events whenever there is activity and closes the group after ten minutes of inactivity.
What is being used?
A. Snapshot window
B. Hopping window
C. Session window
D. Tumbling window
Answer: C
Explanation: Session windows are based on inactivity timeouts.
Question 9
Which statement accurately describes a sliding window?
A. Events belong to only one interval
B. Results are calculated only after the window closes
C. Windows are based on inactivity gaps
D. Results are continuously updated as events arrive
Answer: D
Explanation: Sliding windows continuously recalculate as new events enter and old events leave the window.
Question 10
In Microsoft Fabric Eventstreams, windowing is commonly configured through which transformation?
A. Group By
B. Expand
C. Join
D. Union
Answer: A
Explanation: Eventstreams typically implement windowing through the Group By transformation, where window type and aggregations are defined. (Reitse’s blog)
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
As organizations increasingly rely on real-time analytics, the ability to query, transform, and analyze streaming data efficiently has become a critical skill for data engineers. Within Microsoft Fabric, one of the most important technologies for real-time data processing is Kusto Query Language (KQL).
KQL is the primary query language used in Real-Time Intelligence, Eventhouses, KQL Databases, and many streaming analytics scenarios within Microsoft Fabric. It is specifically optimized for high-performance analysis of large volumes of telemetry, log, event, and time-series data.
For the DP-700 exam, candidates should understand how KQL is used to process streaming data, when it should be selected over Spark or SQL, common KQL operators, ingestion concepts, aggregation techniques, windowing functions, and real-time analytics patterns.
What Is KQL?
Kusto Query Language (KQL) is a read-optimized query language developed by Microsoft for exploring, analyzing, and transforming large volumes of structured, semi-structured, and streaming data.
KQL is the primary language used in:
Microsoft Fabric Real-Time Intelligence
Eventhouses
KQL Databases
Azure Data Explorer
Microsoft Sentinel
Azure Monitor Logs
KQL is designed for:
Fast interactive analytics
Log analysis
Telemetry processing
Streaming data analytics
Time-series analysis
Monitoring solutions
Unlike traditional T-SQL, KQL uses a pipeline-style syntax that makes analytical queries easier to read and maintain.
Why Use KQL for Streaming Data?
KQL is optimized for scenarios involving:
High ingestion rates
Near real-time querying
Large event volumes
Time-series analysis
Operational monitoring
IoT telemetry
Application logs
Security analytics
A major advantage is that newly ingested streaming data can often be queried within seconds of arrival.
KQL in Microsoft Fabric
Within Microsoft Fabric, KQL is primarily used in:
Eventhouses
Eventhouses provide scalable storage and analytics for real-time data.
Capabilities include:
High-speed ingestion
KQL querying
Streaming analytics
Time-series analysis
Dashboard integration
Eventhouses are commonly used as the central repository for streaming event data.
KQL Databases
A KQL Database is a database inside an Eventhouse.
It stores:
Tables
Functions
Materialized views
Policies
KQL queries execute against these databases.
KQL Processing Workflow
A typical streaming architecture looks like:
Event Source
|
v
Eventstream
|
v
Eventhouse
|
v
KQL Database
|
v
KQL Queries
|
v
Reports / Dashboards
Data arrives continuously and becomes available for KQL analysis almost immediately.
Understanding KQL Query Structure
A basic KQL query:
Sales
| where Region == "East"
| summarize TotalSales = sum(Amount)
The pipe symbol (|) passes results from one operation to the next.
This pipeline approach is a key exam topic.
Filtering Streaming Data
The where operator filters records.
Example:
DeviceReadings
| where Temperature > 100
Common uses:
Error events
High temperatures
Security incidents
Suspicious transactions
Filtering early in a query improves performance.
Selecting Columns
The project operator selects specific columns.
Example:
Orders
| project OrderID, CustomerID, Amount
Benefits:
Reduced memory usage
Faster query execution
Cleaner output
Sorting Results
The sort operator orders data.
Example:
Orders
| sort by OrderDate desc
This is frequently used in monitoring and dashboard scenarios.
Aggregating Data with Summarize
The summarize operator is one of the most important KQL operators.
Example:
Sales
| summarize TotalSales = sum(Amount)
Common aggregation functions:
Function
Purpose
sum()
Total values
avg()
Average
count()
Row count
min()
Minimum value
max()
Maximum value
dcount()
Distinct count
Grouping Data
Grouping is accomplished with summarize and a grouping column.
Example:
Sales
| summarize TotalSales=sum(Amount)
by Region
Output:
Region
TotalSales
East
250000
West
300000
This pattern is heavily used in analytics solutions.
Time-Based Analysis
Streaming data is frequently analyzed by time.
Example:
Events
| summarize Count=count()
by bin(Timestamp, 1h)
The bin() function groups records into fixed time windows.
Common windows:
1 minute
5 minutes
15 minutes
1 hour
1 day
Working with Time-Series Data
Time-series analysis is one of KQL’s strengths.
Example:
SensorData
| summarize AvgTemp=avg(Temperature)
by bin(Timestamp, 5m)
This creates temperature averages every five minutes.
Typical use cases:
IoT monitoring
Server performance
Manufacturing systems
Financial transactions
Parsing Semi-Structured Data
Streaming data often arrives as JSON.
Example:
Events
| extend DeviceID = tostring(Event.DeviceID)
Common functions:
Function
Purpose
tostring()
Convert to string
toint()
Convert to integer
todouble()
Convert to decimal
parse_json()
Parse JSON object
Creating Calculated Columns
The extend operator adds calculated values.
Example:
Sales
| extend Tax = Amount * .07
Common uses:
Calculations
Data enrichment
Derived metrics
Joining Streaming Data
KQL supports joins between datasets.
Example:
Orders
| join Customers
on CustomerID
Common scenarios:
Customer enrichment
Product lookups
Reference data joins
However, excessive joins can impact performance on very large streaming datasets.
Materialized Views
Materialized views precompute query results.
Benefits include:
Faster analytics
Reduced query costs
Improved dashboard performance
Example scenario:
A dashboard continuously displays hourly sales totals.
Instead of recalculating every query, a materialized view stores precomputed results.
This is a frequently tested DP-700 optimization topic.
Update Policies
Update policies automatically transform data during ingestion.
Example:
RawEvents Table
|
Update Policy
|
ProcessedEvents Table
Benefits:
Automatic transformation
Consistent processing
Reduced query complexity
Common use cases:
JSON parsing
Data enrichment
Data normalization
Streaming Ingestion
Fabric supports streaming ingestion into Eventhouses.
Characteristics:
Low latency
High throughput
Near real-time availability
Common sources include:
Eventstreams
Azure Event Hubs
IoT devices
Application telemetry
Custom applications
KQL vs Spark Structured Streaming
DP-700 commonly tests when to choose each technology.
Requirement
KQL
Spark Structured Streaming
Real-time analytics
Excellent
Good
Data science workloads
Limited
Excellent
Machine learning
Limited
Excellent
Interactive querying
Excellent
Moderate
Time-series analysis
Excellent
Good
Large-scale transformations
Moderate
Excellent
SQL-like querying
Excellent
Moderate
Use KQL When:
Analyzing event data
Monitoring telemetry
Building operational dashboards
Performing log analytics
Working with Eventhouses
Use Spark When:
Complex transformations are required
Machine learning workloads exist
Advanced ETL processing is needed
Large-scale data engineering pipelines are required
KQL vs T-SQL
Feature
KQL
T-SQL
Streaming analytics
Excellent
Limited
Time-series analysis
Excellent
Moderate
OLTP operations
Poor
Excellent
Real-time dashboards
Excellent
Moderate
Log analytics
Excellent
Poor
For streaming analytics scenarios in Fabric, KQL is often the preferred option.
Performance Best Practices
Filter Early
Good:
Events
| where EventType == "Error"
| summarize count()
Poor:
Events
| summarize count()
| where EventType == "Error"
Filtering early reduces processing volume.
Project Only Required Columns
Avoid retrieving unnecessary data.
Events
| project Timestamp, DeviceID
Use Materialized Views
For frequently executed analytical queries, materialized views improve performance significantly.
Use Appropriate Time Bins
Choose bin sizes carefully:
Smaller bins = more detailed analysis
Larger bins = better performance
Common DP-700 Exam Scenarios
Scenario 1
You need near real-time analysis of millions of IoT events.
Best choice: Eventhouse + KQL
Scenario 2
You need complex machine learning transformations on streaming data.
Best choice: Spark Structured Streaming
Scenario 3
You need a dashboard showing rolling hourly transaction counts.
Best choice: KQL summarize with bin() function
Scenario 4
You need automatic transformation of incoming JSON data.
Best choice: Update policies
DP-700 Exam Tips
Remember these key points:
KQL is optimized for real-time analytics and event data.
Eventhouses are the primary storage and analytics engine for KQL workloads.
KQL is generally preferred over Spark for interactive streaming analytics.
Practice Exam Questions
Question 1
You need to analyze streaming telemetry data arriving from thousands of IoT devices and provide near real-time dashboards. Which technology should you primarily use?
A. Warehouse stored procedures B. Dataflow Gen2 C. KQL in an Eventhouse D. Power Query
Correct Answer: C
Explanation: KQL and Eventhouses are optimized for real-time analytics, telemetry processing, and interactive querying of streaming data.
Question 2
Which KQL operator is used to filter rows from a dataset?
A. summarize B. where C. project D. extend
Correct Answer: B
Explanation: The where operator filters records based on specified conditions.
Question 3
A query needs to calculate total sales by region. Which KQL operator should be used?
A. project B. where C. summarize D. extend
Correct Answer: C
Explanation:summarize performs aggregations such as sums, averages, and counts.
Question 4
Which operator is used to create a calculated column?
A. join B. where C. summarize D. extend
Correct Answer: D
Explanation: The extend operator creates new calculated columns within a query.
Question 5
You need to display the number of events generated every hour. Which function should be used?
A. bin() B. tostring() C. parse_json() D. countif()
Correct Answer: A
Explanation: The bin() function groups data into fixed time intervals for time-series analysis.
Question 6
Which Fabric component serves as the primary analytics engine for KQL workloads?
A. Lakehouse B. Warehouse C. Eventhouse D. Dataflow Gen2
Correct Answer: C
Explanation: Eventhouses are designed for high-scale event ingestion and KQL-based analytics.
Question 7
What is the primary benefit of a materialized view?
A. Data encryption B. Faster query performance through precomputed results C. Reduced storage requirements D. Automatic schema detection
A data engineer must automatically transform incoming JSON data during ingestion. Which feature should be used?
A. Spark checkpointing B. Eventstream routing C. Data Activator D. Update policies
Correct Answer: D
Explanation: Update policies automatically transform data as it is ingested into KQL tables.
Question 9
Which scenario is best suited for KQL instead of Spark Structured Streaming?
A. Large-scale machine learning pipeline B. Deep learning model training C. Interactive analysis of streaming telemetry data D. Complex ETL involving hundreds of joins
Correct Answer: C
Explanation: KQL excels at real-time querying and analytics of telemetry, log, and event data.
Question 10
Which KQL operator is used to select specific columns from a dataset?
A. project B. summarize C. extend D. where
Correct Answer: A
Explanation: The project operator returns only the specified columns, improving efficiency and readability.
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern analytics platforms increasingly require the ability to process data continuously as it arrives rather than waiting for scheduled batch loads. Microsoft Fabric supports this requirement through Spark Structured Streaming, a scalable and fault-tolerant stream processing engine built on Apache Spark.
For the DP-700 exam, you should understand when and how to use Spark Structured Streaming, how it differs from other real-time processing options such as Eventstreams and KQL Querysets, and how to design streaming solutions that write data into OneLake and Delta tables.
Spark Structured Streaming is commonly used when data engineers need to process streaming data with complex transformations, enrichments, joins, aggregations, and machine learning workloads while leveraging the scalability of Spark. (Microsoft Learn)
What Is Spark Structured Streaming?
Spark Structured Streaming is a stream-processing framework built on top of Apache Spark. It treats a continuous stream of incoming data as an unbounded table to which new rows are constantly appended. Developers write code using familiar DataFrame and Spark SQL operations while Spark handles the continuous execution behind the scenes. (Microsoft Learn)
Key characteristics include:
Near real-time processing
Fault tolerance
Automatic recovery
Horizontal scalability
Support for complex transformations
Integration with Delta Lake
Exactly-once processing capabilities through checkpointing and transaction logs (Microsoft Learn)
How Structured Streaming Works
The processing flow typically follows these steps:
Read data from a streaming source.
Apply transformations.
Write results to a destination.
Store checkpoints to track processing progress.
Continue processing new data as it arrives.
Common Sources
Spark Structured Streaming supports sources such as:
For Fabric data engineering scenarios, Append mode is most common. (Microsoft Learn)
Delta Lake Integration
One of the most important DP-700 concepts is integrating Structured Streaming with Delta Lake.
Benefits include:
ACID transactions
Schema evolution
Time travel
Data versioning
Reliable streaming ingestion
Streaming data can be written directly into Delta tables:
.writeStream
.format("delta")
.toTable("Orders")
This creates a continuously updated Delta table within the Lakehouse. (Microsoft Learn)
Checkpointing
Checkpointing is critical for fault tolerance.
Example:
.option(
"checkpointLocation",
"Files/checkpoints/orders"
)
Checkpoints store:
Processed offsets
Query progress
State information
Benefits:
Prevents duplicate processing
Enables recovery after failures
Supports exactly-once processing semantics
A frequent exam scenario involves identifying missing checkpoint configurations as the root cause of duplicate or reprocessed data. (mindmeshacademy.com)
Triggers
Triggers control how often Spark processes incoming data.
Example:
.trigger(
processingTime="1 minute"
)
Possible trigger strategies:
Trigger Type
Purpose
Continuous processing
Lowest latency
Processing time
Fixed intervals
Available Now
Process all available data and stop
Larger trigger intervals often improve throughput because more events are processed together. (Microsoft Learn)
Stateful vs Stateless Processing
Stateless Processing
Each event is processed independently.
Examples:
Filtering
Column selection
Simple transformations
stream.filter("temperature > 100")
Stateful Processing
Spark maintains information between batches.
Examples:
Running totals
Session windows
Stream aggregations
Deduplication
Stateful processing is more powerful but consumes additional memory and storage resources. (jumpstart.fabric.microsoft.com)
This continuously updates counts as new events arrive.
Common Streaming Scenarios in Fabric
IoT Monitoring
Sensors continuously send readings.
Process:
IoT Devices
|
Event Hub
|
Spark Structured Streaming
|
Lakehouse
|
Power BI Dashboard
Application Telemetry
Applications send logs and metrics continuously.
Use cases:
Performance monitoring
Error tracking
Operational dashboards
Real-Time Business Analytics
Examples include:
Online sales monitoring
Inventory tracking
Customer activity analysis
Fraud detection
Structured Streaming vs Eventstreams
DP-700 often tests when to use each technology.
Requirement
Eventstreams
Structured Streaming
No-code ingestion
Yes
No
Visual design
Yes
No
Complex transformations
Limited
Excellent
Custom code
No
Yes
Machine learning integration
Limited
Excellent
Advanced Spark operations
No
Yes
Use Eventstreams for simple routing and ingestion.
Use Structured Streaming for advanced engineering workloads. (Microsoft Learn)
Production Best Practices
Use Spark Job Definitions
For production workloads, Microsoft recommends Spark Job Definitions rather than leaving notebooks running continuously. They provide better reliability and restart capabilities. (Microsoft Learn)
Configure Retry Policies
Retry policies allow automatic recovery from infrastructure failures. (Microsoft Learn)
Always Use Checkpoints
Never deploy production streaming jobs without checkpoint locations. (mindmeshacademy.com)
Optimize Partitioning
Appropriate partitioning improves throughput and downstream query performance. (Microsoft Learn)
Structured Streaming treats streams as continuously growing tables.
readStream reads streaming data.
writeStream writes streaming data.
Delta tables are common streaming destinations.
Checkpointing enables fault tolerance.
Spark Job Definitions are preferred for production streaming workloads.
Event Hubs is a common streaming source.
Micro-batch processing is the default execution model.
Structured Streaming is preferred when complex transformations are required.
Eventstreams are often preferred for simpler ingestion scenarios.
Practice Exam Questions
Question 1
A company needs to process telemetry data from thousands of IoT devices as soon as it arrives. The solution must perform complex transformations before storing data in a Lakehouse.
Which technology should you choose?
A. Dataflow Gen2 B. Warehouse Stored Procedures C. Spark Structured Streaming D. Copy Activity
Correct Answer: C
Explanation: Spark Structured Streaming is designed for continuous data processing and complex transformations on streaming data.
Question 2
What is the primary purpose of a checkpoint location in Structured Streaming?
A. Increase Spark cluster size B. Store temporary query results C. Track processing progress and support recovery D. Compress Delta files
Correct Answer: C
Explanation: Checkpoints store offsets and state information that allow recovery without reprocessing all data.
Question 3
Which method is used to create a streaming DataFrame?
A. readStream() B. streamRead() C. loadStreaming() D. readDelta()
Correct Answer: A
Explanation:readStream() is the Spark API used to create streaming DataFrames.
Question 4
Which destination is most commonly used for Spark Structured Streaming in Microsoft Fabric?
A. Delta table in a Lakehouse B. Excel workbook C. Dataflow Gen2 D. Semantic model
Correct Answer: A
Explanation: Delta tables in Lakehouses are the primary streaming storage destination in Fabric.
Question 5
What execution model does Spark Structured Streaming primarily use?
A. Row-by-row execution B. Continuous SQL polling C. Micro-batch processing D. Manual scheduling
Correct Answer: C
Explanation: Structured Streaming processes incoming data as small batches at regular intervals.
Question 6
Which Fabric component is recommended for running production Structured Streaming workloads?
A. Notebook only B. Dataflow Gen2 C. Pipeline activity D. Spark Job Definition
Correct Answer: D
Explanation: Spark Job Definitions provide improved reliability, retry policies, and production-grade execution.
Question 7
A streaming job must continuously calculate running totals by customer.
What type of processing is required?
A. Stateless processing B. Stateful processing C. Batch processing D. Snapshot processing
Correct Answer: B
Explanation: Running totals require maintaining state across multiple batches.
Question 8
Which statement about Eventstreams and Structured Streaming is correct?
A. Eventstreams supports more advanced Spark transformations. B. Structured Streaming is a no-code solution. C. Structured Streaming supports complex custom code transformations. D. Eventstreams requires Spark coding.
Correct Answer: C
Explanation: Structured Streaming provides full Spark capabilities and custom coding flexibility.
Question 9
What is the benefit of writing streaming data to Delta tables?
A. Eliminates storage costs B. Prevents all schema changes C. Converts data to CSV automatically D. Provides ACID transactions and reliability
Correct Answer: D
Explanation: Delta Lake provides transactional consistency, schema evolution, and reliable streaming ingestion.
Question 10
A data engineer wants to process incoming events every 60 seconds instead of immediately.
Which feature should be configured?
A. Checkpointing B. Consumer groups C. Trigger interval D. Data partitioning
Correct Answer: C
Explanation: Trigger intervals control how frequently Spark processes incoming streaming data.
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
As organizations increasingly rely on real-time analytics, the ability to ingest, process, route, and analyze streaming data has become a critical skill for data engineers. Microsoft Fabric provides Eventstreams as a low-code, scalable solution for processing streaming data within the Real-Time Intelligence workload.
For the DP-700 exam, you should understand how Eventstreams work, how they integrate with other Fabric components, how to perform basic stream processing, and when to use Eventstreams instead of alternatives such as notebooks, pipelines, or KQL databases.
What Are Eventstreams?
An Eventstream is a real-time data processing service within Microsoft Fabric that enables users to:
Ingest streaming data from various sources
Process and transform events in motion
Route data to multiple destinations
Monitor streaming pipelines visually
Build real-time analytics solutions
Eventstreams serve as the ingestion and routing layer of many Real-Time Intelligence solutions.
Conceptually:
Data Sources
↓
Eventstream
↓
Processing & Routing
↓
Destinations
Eventstreams allow organizations to handle millions of events while maintaining low latency and high scalability.
Why Use Eventstreams?
Traditional batch processing waits for data to accumulate before processing.
Streaming scenarios require:
Immediate processing
Low-latency analytics
Real-time alerts
Continuous monitoring
Examples include:
IoT sensor monitoring
Website clickstream analysis
Application telemetry
Manufacturing equipment monitoring
Financial transaction processing
Security event monitoring
Eventstreams provide a managed platform for handling these requirements.
Eventstream Architecture
An Eventstream consists of three major components:
1. Sources
Sources provide incoming event data.
Common sources include:
Event Hubs
Fabric Eventhouses
Azure IoT Hub
Fabric Real-Time Hub
Custom applications
Sample streaming data
Example:
IoT Devices
↓
Azure Event Hubs
↓
Eventstream
2. Processing
After ingestion, Eventstreams can perform lightweight transformations.
Examples include:
Filtering records
Selecting columns
Enriching events
Basic data transformations
Event routing
Processing occurs while data is flowing through the stream.
3. Destinations
Processed events can be delivered to one or more destinations.
Common destinations include:
Eventhouse
KQL Database
Lakehouse
Activator
Custom endpoints
Example:
Eventstream
↓
┌─────────┬─────────┬─────────┐
│Lakehouse│Eventhouse│Activator│
└─────────┴─────────┴─────────┘
One incoming stream can be delivered to multiple destinations simultaneously.
Eventstreams and Real-Time Intelligence
Eventstreams are a foundational component of Fabric Real-Time Intelligence.
A typical architecture may include:
IoT Devices
↓
Eventstream
↓
Eventhouse
↓
KQL Queries
↓
Dashboards
In this architecture:
Eventstream ingests data.
Eventhouse stores data.
KQL analyzes data.
Dashboards visualize results.
Common Eventstream Sources
Azure Event Hubs
One of the most common production sources.
Use when:
High-volume streaming data exists
Enterprise-scale ingestion is required
External systems already publish events
Azure IoT Hub
Designed specifically for IoT devices.
Examples:
Manufacturing sensors
Smart buildings
Connected vehicles
Real-Time Hub
Fabric Real-Time Hub provides a centralized location for discovering and connecting streaming data sources.
Benefits include:
Simplified discovery
Easy integration
Centralized event management
Eventstream Processing Capabilities
Eventstreams support several lightweight transformation capabilities.
Filtering
Filter unwanted records before storage.
Example:
Only process temperatures above 80°F.
Input:
Device A: 75
Device B: 84
Device C: 81
Output:
Device B: 84
Device C: 81
Filtering reduces storage and processing costs.
Column Selection
Keep only required fields.
Input:
DeviceID
Temperature
Location
BatteryLevel
Timestamp
Output:
DeviceID
Temperature
Timestamp
This reduces data volume.
Data Enrichment
Additional information can be added to streaming events.
Example:
Incoming Event:
DeviceID = 100
Enriched Event:
DeviceID = 100
Region = East
Facility = Orlando
Enrichment improves downstream analytics.
Routing Events
One of the most important Eventstream features is routing.
A single incoming stream can be sent to multiple destinations.
Example:
Telemetry Stream
↓
Eventstream
↓
┌────────┬─────────┬─────────┐
│Lakehouse│Eventhouse│Activator│
└────────┴─────────┴─────────┘
This enables:
Historical storage
Real-time analytics
Automated actions
from the same stream.
Eventstream Destinations
Eventhouse
Best for:
KQL analytics
Real-time dashboards
Time-series analysis
Often the primary destination in Real-Time Intelligence solutions.
Lakehouse
Best for:
Historical retention
Data science
Long-term storage
Delta table analytics
Commonly used alongside Eventhouse.
Activator
Used to trigger actions based on conditions.
Examples:
Send alerts
Trigger workflows
Notify users
Example:
Temperature > 100°F
↓
Send Alert
Eventstream Monitoring
Fabric provides monitoring capabilities for Eventstreams.
Metrics include:
Throughput
Incoming events
Failed events
Processing latency
Destination status
Monitoring helps identify:
Bottlenecks
Connection issues
Data quality problems
Eventstreams vs Pipelines
This comparison is important for the DP-700 exam.
Feature
Eventstream
Pipeline
Real-time processing
Yes
No
Streaming data
Yes
No
Batch processing
Limited
Yes
Continuous execution
Yes
No
Scheduling
No
Yes
Data movement
Yes
Yes
Use Eventstreams When
Data arrives continuously
Low latency is required
Real-time monitoring is needed
Use Pipelines When
Batch processing is required
Scheduled execution is needed
ETL orchestration is required
Eventstreams vs Notebooks
Feature
Eventstream
Notebook
Low-code
Yes
No
Streaming ingestion
Yes
Possible
Complex transformations
Limited
Extensive
Spark processing
No
Yes
Machine learning
No
Yes
Use Eventstreams
For simple streaming ingestion and routing.
Use Notebooks
For advanced Spark transformations and machine learning workloads.
Eventstreams vs Eventhouse
Candidates often confuse these services.
Eventstream
Focuses on:
Ingestion
Processing
Routing
Eventhouse
Focuses on:
Storage
Querying
Analytics
A common architecture uses both together.
Eventstream
↓
Eventhouse
↓
KQL Queries
Best Practices
Filter Early
Remove unnecessary events before storage.
Benefits:
Lower storage costs
Faster queries
Reduced processing requirements
Route Once, Consume Many
Instead of duplicating ingestion pipelines, use one Eventstream and multiple destinations.
Benefits:
Simpler architecture
Lower maintenance effort
Monitor Throughput
Regularly review:
Event ingestion rates
Failed events
Processing latency
Separate Real-Time and Historical Analytics
A common architecture is:
Eventstream
↓
┌──────────┬──────────┐
│Eventhouse│Lakehouse │
└──────────┴──────────┘
Eventhouse supports operational analytics while Lakehouse supports historical analysis.
DP-700 Exam Tips
Remember the following:
Eventstreams are designed for real-time data ingestion and routing.
Eventstreams consist of sources, processing, and destinations.
Eventstreams commonly feed Eventhouses.
Multiple destinations can receive the same stream.
Eventstreams support filtering, selection, and enrichment.
Eventstreams are not replacements for notebooks.
Pipelines are primarily for batch orchestration.
Eventhouse stores and analyzes streaming data.
Activator can trigger actions from streaming events.
Eventstreams are a key component of Fabric Real-Time Intelligence architectures.
Practice Exam Questions
Question 1
A company receives telemetry from thousands of IoT devices every second. The data must be processed immediately and sent to an Eventhouse.
Which Fabric component should be used?
A. Eventstream B. Dataflow Gen2 C. Warehouse D. Deployment Pipeline
Correct Answer: A
Explanation: Eventstreams are designed specifically for real-time ingestion, processing, and routing of streaming data.
Question 2
Which component of an Eventstream receives incoming events?
A. Destination B. Source C. Activator D. Eventhouse
Correct Answer: B
Explanation: Sources are responsible for providing incoming streaming data to the Eventstream.
Question 3
A data engineer wants to remove all records where temperature is below 70°F before storing the data.
Which Eventstream capability should be used?
A. Mirroring B. Aggregation C. Filtering D. Scheduling
Correct Answer: C
Explanation: Filtering removes unwanted records before they reach downstream destinations.
Question 4
Which destination is best suited for real-time KQL analytics?
A. Warehouse B. Notebook C. Dataflow Gen2 D. Eventhouse
Correct Answer: D
Explanation: Eventhouse is optimized for real-time analytics and KQL querying.
Question 5
A company wants the same streaming data to be stored historically and analyzed in real time.
What should be done?
A. Create two separate Eventstreams B. Route the Eventstream to both a Lakehouse and an Eventhouse C. Export the data twice D. Use Dataflow Gen2
Correct Answer: B
Explanation: Eventstreams can send data to multiple destinations simultaneously.
Question 6
Which Fabric service can trigger alerts based on conditions detected in streaming data?
A. Pipeline B. Activator C. Warehouse D. Notebook
Correct Answer: B
Explanation: Activator can generate notifications and actions based on event conditions.
Question 7
Which statement best describes Eventstreams?
A. Primarily used for batch ETL scheduling B. Primarily used for dashboard creation C. Primarily used for real-time ingestion and routing D. Primarily used for SQL warehousing
Correct Answer: C
Explanation: Eventstreams specialize in streaming ingestion, lightweight processing, and routing.
Question 8
Which service is generally preferred for complex Spark-based transformations?
A. Eventstream B. Activator C. Eventhouse D. Notebook
Correct Answer: D
Explanation: Notebooks provide extensive Spark and PySpark transformation capabilities that exceed Eventstream processing functionality.
Question 9
What is a major benefit of routing a stream to multiple destinations?
A. Eliminates all storage costs B. Allows different workloads to consume the same stream simultaneously C. Removes the need for Eventhouse D. Prevents data retention
Correct Answer: B
Explanation: Multiple destinations allow operational analytics, historical storage, and alerting from the same data stream.
Question 10
Which statement accurately compares Eventstreams and pipelines?
A. Pipelines are optimized for continuous streaming ingestion. B. Eventstreams are primarily used for batch scheduling. C. Both services are identical. D. Eventstreams are optimized for real-time processing, while pipelines are optimized for batch orchestration.
Correct Answer: D
Explanation: Eventstreams handle continuously arriving data, while pipelines are designed for orchestrated batch processing and scheduled workflows.
This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub. This topic falls under these sections: Ingest and transform data (30–35%) --> Ingest and transform streaming data --> Choose between Query Acceleration for OneLake shortcuts and standard OneLake shortcuts in Real-Time Intelligence
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Microsoft Fabric provides multiple ways to access data stored in OneLake from Real-Time Intelligence workloads such as Eventhouses and KQL databases. One of the most important design decisions for data engineers is determining whether to use:
Standard OneLake shortcuts
Query-accelerated OneLake shortcuts
Understanding the differences between these options is essential for the DP-700 exam because they directly affect performance, cost, latency, storage consumption, and analytics architecture.
This article explains how each option works, when to use them, their limitations, and the decision-making criteria you should understand for the exam.
Understanding OneLake Shortcuts
A OneLake shortcut is a virtual reference to data stored elsewhere. Instead of copying data, the shortcut points to an existing data source. This allows multiple Fabric experiences to access the same data without creating duplicate copies. (Microsoft Learn)
For example:
A Lakehouse contains sales data.
An Eventhouse creates a shortcut to that data.
Queries can access the data through the shortcut.
The original data remains in its source location.
Benefits include:
No data duplication
Reduced storage costs
Single source of truth
Simplified data management
Faster implementation
Standard OneLake Shortcuts
A standard OneLake shortcut allows Real-Time Intelligence workloads to query external data directly from OneLake without ingesting it into the Eventhouse. (Microsoft Learn)
How It Works
When a query executes:
Eventhouse accesses the shortcut.
Data is retrieved from the source Delta table.
Results are returned to the query.
No additional indexing or caching is performed.
Advantages
Minimal setup effort
No duplicated storage
Lower cost
Immediate access to existing data
Suitable for infrequent queries
Disadvantages
Slower query performance
Higher query latency
External storage access required during execution
Limited optimization opportunities
Query Acceleration for OneLake Shortcuts
Query Acceleration is a feature in Real-Time Intelligence that improves query performance against OneLake shortcut data by automatically caching and indexing selected data. (Video2 Skills Academy)
Instead of repeatedly reading Delta files from storage, Fabric creates optimized structures that significantly improve performance.
How It Works
When acceleration is enabled:
A shortcut is created.
Fabric indexes the data.
Fabric caches data based on the configured retention period.
Queries use optimized structures instead of repeatedly scanning raw files. (Microsoft Learn)
The experience becomes similar to querying native Eventhouse data.
Query Acceleration Architecture
Without acceleration:
Delta Table
↓
OneLake Shortcut
↓
Query Reads Files Directly
With acceleration:
Delta Table
↓
OneLake Shortcut
↓
Indexing and Caching
↓
High-Performance Queries
Performance Comparison
Characteristic
Standard Shortcut
Query Accelerated Shortcut
Data duplication
No
No
Caching
No
Yes
Indexing
No
Yes
Query latency
Higher
Lower
Large-scale analytics
Moderate
Excellent
Cost
Lower
Higher
Setup complexity
Low
Moderate
When to Use Standard OneLake Shortcuts
Choose standard shortcuts when:
Query Frequency is Low
If users only occasionally access the data, acceleration may not provide sufficient value.
Example:
Monthly compliance reports
Ad hoc investigations
Occasional auditing
Cost Optimization is Critical
Since acceleration introduces caching and indexing costs, standard shortcuts are often preferred for budget-sensitive workloads.
Data Volumes are Small
Smaller datasets generally perform well enough without acceleration.
When to Use Query Acceleration
Choose query acceleration when:
High Query Volume Exists
Examples:
Interactive dashboards
Continuous monitoring
Frequent analytics workloads
Large Delta Tables Are Queried
Large historical datasets often benefit significantly from acceleration.
Real-Time and Historical Data Must Be Combined
A common Real-Time Intelligence pattern involves:
Streaming data arriving in Eventhouse
Historical data stored in OneLake
Query acceleration enables efficient joins between both datasets. (Video2 Skills Academy)
Example:
Live Sensor Stream
+
Historical Equipment Data
=
Real-Time Analytics
Dimension Data Must Be Joined Frequently
Organizations often mirror dimension data into OneLake and then use accelerated shortcuts for enrichment and lookup operations. (Video2 Skills Academy)
Configuring Query Acceleration
Acceleration can be enabled:
During shortcut creation
After shortcut creation through Data Policies settings (Microsoft Learn)
Administrators can also define:
Number of cached days
Retention period
Acceleration policies
The caching period determines how much data remains optimized for high-performance access. (Microsoft Learn)
Caching Period Considerations
The caching period directly impacts:
Query performance
Storage consumption
Cost
Example:
Cached Period
Typical Use Case
7 days
Operational monitoring
30 days
Business analytics
90 days
Historical trend analysis
Longer periods improve performance across larger time ranges but increase storage costs.
Cost Considerations
This topic frequently appears in architecture-based exam questions.
External tables with extremely large file counts may experience reduced effectiveness.
Certain Delta table schema changes may require reacceleration.
Some advanced Delta features may require disabling and re-enabling acceleration.
Decision Framework for the Exam
A useful exam strategy:
Choose Standard Shortcuts When
Cost is the highest priority.
Data is queried infrequently.
Data volume is moderate.
Performance requirements are relaxed.
Choose Query Acceleration When
Performance is critical.
Queries occur frequently.
Large datasets are analyzed.
Historical and streaming data are combined.
Interactive analytics workloads exist.
DP-700 Exam Tips
Remember These Key Points
OneLake shortcuts avoid data duplication.
Standard shortcuts access data directly.
Query acceleration adds indexing and caching.
Query acceleration improves performance but increases cost.
Accelerated shortcuts are ideal for frequent analytical queries.
Standard shortcuts are ideal for occasional access scenarios.
Query acceleration is especially valuable when combining streaming and historical datasets.
Cached retention periods directly affect cost and performance.
Accelerated shortcuts behave like external tables and inherit some external table limitations.
The exam often focuses on choosing the most cost-effective versus highest-performance solution.
Practice Exam Questions
Question 1
A company uses Eventhouse to analyze telemetry data. Historical data resides in OneLake and is queried thousands of times per day. Query performance is poor.
What should you implement?
A. Dataflows Gen2 B. Query acceleration on the OneLake shortcut C. Warehouse mirroring D. Notebook scheduling
Correct Answer: B
Explanation: Query acceleration adds indexing and caching that significantly improves query performance for frequently accessed shortcut data. (Video2 Skills Academy)
Question 2
What is the primary benefit of a standard OneLake shortcut?
A. Eliminates all query latency B. Automatically indexes data C. Provides access to data without duplication D. Creates materialized views
Correct Answer: C
Explanation: Shortcuts reference existing data rather than copying it, allowing a single source of truth. (Microsoft Learn)
Question 3
A solution prioritizes the lowest possible storage and acceleration costs. Data is queried only once per month.
Which option should be selected?
A. Query-accelerated shortcut B. Materialized view C. Standard OneLake shortcut D. Native Eventhouse ingestion
Correct Answer: C
Explanation: When query frequency is very low, the additional acceleration costs are generally not justified.
Question 4
What additional capability does query acceleration provide?
A. Encryption B. Data mirroring C. Row-level security D. Caching and indexing
Correct Answer: D
Explanation: Query acceleration improves performance through indexing and caching. (Video2 Skills Academy)
Question 5
Which scenario most strongly justifies query acceleration?
A. Small dataset queried quarterly B. Development environment testing C. Large historical dataset used in interactive dashboards D. One-time data migration
Correct Answer: C
Explanation: Interactive dashboards require low latency and frequent queries, making acceleration highly beneficial.
Question 6
What happens to the source data when a OneLake shortcut is created?
A. It is copied into Eventhouse B. It is archived C. It is compressed D. It remains in its original location
Correct Answer: D
Explanation: A shortcut is only a reference to the original data source. (Microsoft Learn)
Question 7
An engineer wants to join streaming Eventhouse data with historical OneLake data while maintaining high query performance.
Which approach should be recommended?
A. Query-accelerated shortcut B. Dataflow Gen2 C. Warehouse endpoint D. Manual exports
Correct Answer: A
Explanation: One of the primary use cases for query acceleration is combining streaming and historical data efficiently. (Video2 Skills Academy)
Question 8
What configuration primarily controls how much accelerated data remains cached?
A. Workspace role assignments B. Retention and caching period settings C. Lakehouse schema definitions D. Fabric tenant settings
Correct Answer: B
Explanation: Administrators specify how many days of data are retained in the acceleration cache. (Microsoft Learn)
Question 9
Which statement about accelerated shortcuts is true?
A. They always cost less than standard shortcuts. B. They require data duplication. C. They can improve performance through cached and indexed data. D. They eliminate storage requirements.
Correct Answer: C
Explanation: Acceleration works by indexing and caching data while still avoiding data duplication. (Video2 Skills Academy)
Question 10
A company needs the fastest possible query performance against frequently accessed OneLake data and is willing to accept additional cost.
Which option should be chosen?
A. Standard OneLake shortcut B. Manual exports to CSV C. Dataflow Gen2 D. Query-accelerated OneLake shortcut
Correct Answer: D
Explanation: Query acceleration is specifically designed to maximize query performance by using caching and indexing mechanisms. (Video2 Skills Academy)
This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub. This topic falls under these sections: Ingest and transform data (30–35%) --> Ingest and transform streaming data --> Choose between native tables and OneLake shortcuts in Real-Time Intelligence
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
One of the key design decisions when building real-time analytics solutions in Microsoft Fabric is determining where data should reside and how it should be accessed. Within Real-Time Intelligence, data engineers frequently encounter scenarios where they must choose between:
Native Tables in Eventhouse/KQL databases
OneLake Shortcuts to data stored elsewhere
Understanding the differences between these approaches is important for the DP-700 exam because the choice impacts:
Query performance
Data latency
Storage costs
Data governance
Data duplication
Maintenance complexity
A successful data engineer must understand when to ingest data directly into Real-Time Intelligence and when to reference existing data through shortcuts.
Understanding Real-Time Intelligence
Real-Time Intelligence is Microsoft Fabric’s solution for ingesting, analyzing, and acting upon streaming and operational data.
Key components include:
Eventstream
Eventhouse
KQL Databases
Data Activator
Real-Time Dashboards
Data stored within Eventhouse and KQL databases can come from multiple sources:
Direct streaming ingestion
Batch ingestion
External storage systems
OneLake data sources
This is where the choice between native tables and OneLake shortcuts becomes important.
What Are Native Tables?
Native tables are physical tables stored directly inside a KQL database or Eventhouse.
When data is ingested into Real-Time Intelligence, it is written into these tables and becomes part of the Eventhouse storage engine.
Characteristics of Native Tables
Native tables:
Physically store data
Support extremely fast query performance
Are optimized for time-series analytics
Support continuous streaming ingestion
Provide low-latency access
Support update policies and materialized views
Enable advanced KQL analytics
Native Table Architecture
Event Source
↓
Eventstream
↓
Native Table
↓
KQL Queries
↓
Dashboards / Analytics
Data resides directly within the Eventhouse environment.
Advantages of Native Tables
Highest Query Performance
Because data is physically stored in the Eventhouse engine, query execution is highly optimized.
Benefits include:
Faster aggregations
Faster filtering
Lower latency
Better concurrency
Optimized for Streaming Workloads
Native tables are specifically designed for:
High ingestion rates
Continuous event streams
Telemetry data
Operational analytics
Support for Advanced Features
Native tables support:
Materialized views
Update policies
Data retention policies
Cached query execution
Time-series functions
Lower Query Latency
Real-time dashboards often require results within seconds.
Native tables generally provide the lowest latency.
Disadvantages of Native Tables
Data Duplication
The same data may already exist elsewhere:
Lakehouse
Warehouse
ADLS Gen2
Other databases
Ingesting into native tables creates another copy.
Increased Storage Costs
More copies of data mean:
More storage consumption
Additional retention management
Additional Ingestion Processing
Data must be:
Moved
Loaded
Managed
before it becomes available.
What Are OneLake Shortcuts?
A OneLake shortcut provides a virtual reference to data stored elsewhere.
Rather than copying data into Eventhouse, Real-Time Intelligence accesses the existing data through the shortcut.
Shortcut Concept
Instead of:
Source → Copy → Eventhouse
You get:
Source → OneLake Shortcut → Query
No physical duplication occurs.
Supported Sources
Shortcuts can reference:
Fabric Lakehouses
Fabric Warehouses
Azure Data Lake Storage Gen2
Amazon S3
Other supported storage locations
Characteristics of OneLake Shortcuts
Shortcuts:
Avoid copying data
Provide a single source of truth
Reduce storage costs
Simplify governance
Enable data reuse
Advantages of OneLake Shortcuts
Eliminate Data Duplication
One of the biggest advantages.
Instead of storing multiple copies:
One Source
↓
Multiple Consumers
All consumers access the same data.
Lower Storage Costs
Since data is not duplicated:
Less storage consumption
Lower management overhead
Faster Data Availability
No ingestion process is required.
Data becomes accessible immediately after the shortcut is created.
Improved Governance
Governance becomes easier because:
Data remains in one location
Policies remain centralized
Data lineage remains clearer
Supports the One Copy Vision
OneLake is built around the principle of:
“One copy of data for the entire organization.”
Shortcuts are a key enabler of this strategy.
Disadvantages of OneLake Shortcuts
Potentially Higher Query Latency
Because data is not stored locally:
Queries may require additional access steps
Performance can be slower than native tables
Limited Optimization
Some advanced Eventhouse optimization capabilities are most effective with native data.
Examples include:
Materialized views
Update policies
Streaming ingestion optimizations
Dependency on Source Availability
If the source becomes unavailable:
Queries may fail
Performance may degrade
Native tables do not have this dependency.
When to Choose Native Tables
Choose native tables when:
Real-Time Performance Is Critical
Examples:
Monitoring dashboards
Security analytics
Fraud detection
Manufacturing telemetry
Continuous Streaming Ingestion Exists
Examples:
IoT sensors
Application logs
Device telemetry
High Query Volumes Are Expected
Examples:
Enterprise dashboards
Operational reporting
Advanced KQL Features Are Required
Examples:
Materialized views
Update policies
Retention policies
When to Choose OneLake Shortcuts
Choose shortcuts when:
Data Already Exists in OneLake
Avoid creating unnecessary copies.
Storage Costs Must Be Minimized
Shortcuts reduce storage requirements.
Data Sharing Is Important
Multiple teams can access the same dataset.
Data Is Primarily Historical
Examples:
Historical archives
Reference datasets
Slowly changing datasets
Governance Is a Priority
Maintaining a single source of truth simplifies compliance and governance efforts.
Comparing Native Tables and OneLake Shortcuts
Feature
Native Tables
OneLake Shortcuts
Physical storage
Yes
No
Data duplication
Yes
No
Storage cost
Higher
Lower
Query performance
Highest
Good
Streaming ingestion
Excellent
Not primary purpose
Advanced KQL features
Full support
Limited scenarios
Data governance
More complex
Simpler
Single source of truth
No
Yes
Real-time analytics
Best choice
Suitable in some cases
Historical data access
Good
Excellent
Common DP-700 Exam Scenarios
Scenario 1
A manufacturing company ingests millions of telemetry events every minute and requires dashboards that refresh within seconds.
Best Choice: Native Tables
Reason:
Maximum ingestion performance
Lowest query latency
Scenario 2
An organization already stores enterprise sales data in a Fabric Lakehouse and wants Eventhouse users to analyze it without creating another copy.
Best Choice: OneLake Shortcut
Reason:
Eliminates duplication
Supports centralized governance
Scenario 3
A security operations center performs continuous threat monitoring using KQL.
Best Choice: Native Tables
Reason:
Optimized for streaming analytics
Fast query response times
Scenario 4
A data engineering team needs occasional access to historical archive data stored in ADLS Gen2.
Best Choice: OneLake Shortcut
Reason:
No need to ingest large historical datasets
Lower storage costs
Decision Framework
Ask the following questions:
Is the data arriving continuously?
If yes → Native Tables.
Is ultra-low latency required?
If yes → Native Tables.
Does the data already exist in OneLake?
If yes → Consider OneLake Shortcuts.
Is avoiding duplication important?
If yes → OneLake Shortcuts.
Are advanced KQL optimization features required?
If yes → Native Tables.
DP-700 Exam Tips
Remember these key distinctions:
Native tables physically store data inside Eventhouse.
Native tables provide the highest performance.
Native tables are ideal for streaming ingestion.
OneLake shortcuts reference data without copying it.
Shortcuts support the One Copy vision of OneLake.
Shortcuts reduce storage costs.
Native tables are preferred when low-latency analytics is critical.
Shortcuts are preferred when data already exists elsewhere and duplication should be avoided.
Exam questions often focus on balancing performance versus storage and governance.
Practice Exam Questions
Question 1
A company requires sub-second analytics on continuously arriving IoT telemetry data in Eventhouse.
Which storage approach should be selected?
A. OneLake shortcut to a Lakehouse B. OneLake shortcut to ADLS Gen2 C. Native table D. Dataflow Gen2
Answer: C
Explanation: Native tables provide the lowest latency and are optimized for continuous streaming ingestion and real-time analytics.
Question 2
An organization already stores customer history in a Fabric Lakehouse and wants Eventhouse users to analyze the data without creating additional copies.
Which option should be used?
A. Native table B. OneLake shortcut C. Eventstream ingestion D. Data Activator
Answer: B
Explanation: OneLake shortcuts allow access to existing data without physically copying it into Eventhouse.
Question 3
What is the primary advantage of using OneLake shortcuts?
A. Faster ingestion speeds B. Automatic materialized views C. Lower query latency D. Elimination of data duplication
Answer: D
Explanation: Shortcuts provide virtual access to data and eliminate the need to create additional copies.
Question 4
Which feature is most strongly associated with native tables?
A. Single source of truth B. External data access C. Physical storage within Eventhouse D. Reduced storage costs
Answer: C
Explanation: Native tables physically store data within Eventhouse and are optimized for real-time analytics.
Question 5
A team wants to minimize storage costs while analyzing historical datasets already stored in OneLake.
Which option is best?
A. Native tables B. OneLake shortcuts C. Spark cache tables D. Temporary KQL tables
Answer: B
Explanation: Shortcuts allow direct access to existing data without storing another copy.
Question 6
Which scenario most strongly favors native tables?
A. Historical archive access B. Shared enterprise data reuse C. High-volume streaming telemetry analytics D. Storage cost reduction
Answer: C
Explanation: Native tables are designed for continuous ingestion and high-performance real-time analytics.
Question 7
A data engineer wants to support the OneLake principle of maintaining a single copy of organizational data.
Which option best aligns with this goal?
A. Native tables B. Materialized views C. Streaming ingestion D. OneLake shortcuts
Answer: D
Explanation: Shortcuts are specifically designed to support OneLake’s single-copy architecture.
Question 8
Which statement about native tables is true?
A. They never store data physically. B. They generally provide better query performance than shortcuts. C. They require external storage systems. D. They cannot be queried with KQL.
Answer: B
Explanation: Because the data is stored directly inside Eventhouse, native tables typically deliver the highest performance.
Question 9
A company wants to use advanced KQL features such as update policies and materialized views on streaming data.
Which approach should be selected?
A. OneLake shortcut B. Warehouse shortcut C. Native table D. Dataflow Gen2
Answer: C
Explanation: Advanced Eventhouse optimization features are most commonly associated with native tables.
Question 10
Which factor most commonly drives the decision to use a OneLake shortcut instead of a native table?
A. Requirement for lowest latency analytics B. Requirement for continuous event ingestion C. Requirement for materialized views D. Requirement to avoid storing duplicate copies of data
Answer: D
Explanation: The primary benefit of OneLake shortcuts is enabling data access without physically duplicating data, reducing storage costs and simplifying governance.
Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Overview
Modern analytics solutions increasingly rely on the ability to process data as it is generated rather than waiting for scheduled batch loads. Streaming data enables organizations to react to events in near real time, support operational analytics, monitor systems, detect anomalies, and power intelligent applications.
In Microsoft Fabric, selecting the appropriate streaming engine is a critical design decision. The DP-700 exam expects candidates to understand the strengths, limitations, and ideal use cases of the various streaming technologies available in Fabric and to choose the most appropriate option based on business requirements.
This article explores the major streaming engines and technologies within Microsoft Fabric, how they compare, and when to use each one.
What Is Streaming Data?
Streaming data is data that arrives continuously from sources such as:
IoT devices
Sensors
Application logs
Clickstream events
Social media feeds
Financial transactions
Manufacturing equipment
Website activity
Real-time telemetry
Unlike batch processing, where data is collected and processed periodically, streaming systems process data as events arrive.
Common requirements include:
Low-latency processing
Real-time dashboards
Event detection
Alert generation
Continuous data ingestion
Streaming analytics
Streaming Technologies in Microsoft Fabric
The primary streaming technologies that data engineers encounter in Fabric include:
Technology
Primary Purpose
Eventstream
Real-time event ingestion and routing
Eventhouse
Real-time analytics using KQL
KQL Database
High-performance streaming analytics
Real-Time Intelligence
End-to-end real-time analytics platform
Spark Structured Streaming
Large-scale streaming transformations
Data Activator
Event-driven actions and alerts
Pipelines
Scheduled orchestration (not true streaming)
Understanding when to use each is essential for the exam.
Eventstream
What Is Eventstream?
Eventstream is Fabric’s low-code real-time ingestion service.
It captures, transforms, filters, and routes streaming events from multiple sources to multiple destinations.
Think of Eventstream as the ingestion layer of a streaming architecture.
Common Sources
Eventstream can ingest data from:
Azure Event Hubs
Kafka endpoints
Fabric events
IoT sources
Real-time telemetry systems
Custom event producers
Common Destinations
Eventstream can send data to:
Eventhouse
KQL Databases
Lakehouses
Custom destinations
Activator
Best Use Cases
Choose Eventstream when:
Events must be continuously ingested
Minimal coding is desired
Data routing is required
Multiple downstream consumers need the same events
Building real-time analytics solutions
Exam Tip
If a scenario focuses on ingesting and routing real-time events, Eventstream is usually the best answer.
Eventhouse
What Is Eventhouse?
Eventhouse is a Real-Time Intelligence component optimized for storing and analyzing streaming data.
It is built on Kusto technology and provides:
High ingestion rates
Near real-time analytics
Time-series analysis
Log analytics
Event exploration
Key Characteristics
Optimized for append-only data
Supports KQL
Fast query performance
Near real-time visibility
Massive scalability
Best Use Cases
Use Eventhouse when:
Large volumes of events arrive continuously
Log analytics is required
Telemetry analysis is needed
Operational dashboards require low latency
Examples:
Website activity monitoring
Application diagnostics
Manufacturing telemetry
Security monitoring
KQL Databases
What Is a KQL Database?
A KQL database is the storage and query engine behind many real-time solutions.
It uses Kusto Query Language (KQL) and is highly optimized for:
Streaming ingestion
Log analytics
Time-series data
Event correlation
Advantages
Extremely fast analytical queries
Handles high ingestion volumes
Rich time-series functions
Powerful aggregation capabilities
Best Use Cases
Choose KQL databases when:
Event analysis is the primary objective
Massive event volumes exist
Time-based analysis is required
Operational monitoring is needed
Spark Structured Streaming
What Is Structured Streaming?
Spark Structured Streaming enables continuous processing using Apache Spark.
Unlike Eventstream and Eventhouse, Spark streaming is developer-focused and code-driven.
Supported languages include:
PySpark
Scala
Spark SQL
Capabilities
Spark Structured Streaming supports:
Complex transformations
Data enrichment
Machine learning integration
Streaming joins
Stateful processing
Advanced business logic
Best Use Cases
Choose Spark Structured Streaming when:
Complex transformations are required
Large-scale processing is needed
Machine learning must be integrated
Events must be joined with reference datasets
Custom code is acceptable
Examples:
Fraud detection
Customer behavior analytics
Streaming feature engineering
Predictive maintenance
Exam Tip
If a scenario requires advanced coding and transformation logic, Spark Structured Streaming is often the correct answer.
Real-Time Intelligence
What Is Real-Time Intelligence?
Real-Time Intelligence is Fabric’s complete platform for handling real-time data workloads.
It combines:
Eventstream
Eventhouse
KQL Databases
Data Activator
Real-time dashboards
Benefits
Provides:
End-to-end streaming architecture
Real-time monitoring
Event processing
Alerting
Operational analytics
Best Use Cases
Use Real-Time Intelligence when an organization needs:
Comprehensive streaming analytics
Operational dashboards
Real-time monitoring
Event-driven insights
Data Activator
What Is Data Activator?
Data Activator monitors events and automatically takes actions when specified conditions occur.
Examples include:
Sending emails
Triggering workflows
Generating notifications
Creating alerts
Example
If machine temperature exceeds 90°C:
Generate an alert
Notify engineers
Open a support ticket
Best Use Cases
Choose Data Activator when:
Business users need alerts
Event-driven automation is required
Low-code monitoring is desired
Pipelines Are Not Streaming Engines
A common DP-700 exam trap is confusing pipelines with streaming solutions.
Pipelines:
Execute scheduled workloads
Orchestrate activities
Handle batch data movement
Pipelines do NOT provide continuous event processing.
Many exam questions focus on matching business requirements to the correct streaming technology.
Practice Exam Questions
Question 1
A company needs to ingest streaming telemetry from thousands of IoT devices and route the data to multiple downstream consumers.
Which Fabric component should be used?
A. Data Activator B. Eventstream C. Pipeline D. Notebook
Answer: B
Explanation: Eventstream is specifically designed for real-time event ingestion and routing. Data Activator generates actions, pipelines handle batch orchestration, and notebooks perform transformations rather than ingestion.
Question 2
A solution requires advanced stream processing with custom Python code, joins against reference datasets, and machine learning inference.
Which technology should be selected?
A. Eventhouse B. Spark Structured Streaming C. KQL Database D. Data Activator
Answer: B
Explanation: Spark Structured Streaming supports complex transformations, enrichment, stateful processing, and machine learning integration through PySpark.
Question 3
A team needs extremely fast analytics over continuously arriving log data and plans to use KQL.
Which storage engine is most appropriate?
A. KQL Database B. Dataflow Gen2 C. Warehouse D. Pipeline
Answer: A
Explanation: KQL databases are optimized for streaming ingestion, time-series analysis, and log analytics.
Question 4
A business user wants automatic notifications whenever inventory levels fall below a threshold.
Which Fabric component is best suited?
A. Eventstream B. Notebook C. Data Activator D. Pipeline
Answer: C
Explanation: Data Activator monitors data conditions and triggers automated actions such as alerts and notifications.
Question 5
Which Fabric component is primarily responsible for routing real-time events to destinations?
A. Warehouse B. Eventstream C. Dataflow Gen2 D. Notebook
Answer: B
Explanation: Eventstream serves as the ingestion and routing layer for streaming architectures.
Question 6
A company requires an end-to-end platform for ingesting, storing, analyzing, and monitoring streaming events.
Which solution should be recommended?
A. Real-Time Intelligence B. Dataflow Gen2 C. Warehouse D. SQL Endpoint
Answer: A
Explanation: Real-Time Intelligence combines ingestion, analytics, monitoring, alerting, and visualization capabilities into a unified platform.
Question 7
Which technology is best suited for analyzing application logs with time-series queries and low-latency reporting?
A. Notebook B. Warehouse C. Eventhouse D. Pipeline
Answer: C
Explanation: Eventhouse is optimized for streaming analytics, log analysis, and time-series workloads.
Question 8
A solution requires nightly ingestion of source data into a lakehouse.
Which option is most appropriate?
A. Eventstream B. Data Activator C. Eventhouse D. Pipeline
Answer: D
Explanation: Nightly ingestion is a batch process and is best handled through scheduled pipeline execution.
Question 9
A data engineer needs to continuously enrich streaming events using lookup data and perform custom business-rule calculations.
Which technology should be selected?
A. Spark Structured Streaming B. Data Activator C. Eventstream D. Dashboard
Answer: A
Explanation: Spark Structured Streaming provides advanced transformation capabilities including joins, aggregations, and custom code execution.
Question 10
Which statement best describes Eventhouse?
A. A workflow orchestration service for ETL processes B. A low-code data preparation tool C. A real-time analytics store optimized for event and telemetry data D. A machine learning training environment
Answer: C
Explanation: Eventhouse is designed for high-scale event ingestion, real-time analytics, log analytics, and KQL-based querying of streaming data.