Tag: Dataflows Gen2

Identify and resolve Dataflow Gen2 errors (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
   --> Identify and resolve errors
      --> Identify and resolve Dataflow Gen2 errors


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Overview

Dataflow Gen2 is a powerful data ingestion and transformation service in Microsoft Fabric that enables data engineers and analysts to perform Extract, Transform, and Load (ETL) operations using a low-code, visual interface based on Power Query. Dataflow Gen2 supports hundreds of data sources and can load data into destinations such as Lakehouses, Warehouses, KQL Databases, and other Fabric items.

Because Dataflow Gen2 is often used to prepare and transform data before it reaches analytical solutions, failures can have significant downstream impacts. For the DP-700 exam, candidates must understand how to identify, troubleshoot, and resolve Dataflow Gen2 errors, interpret refresh results, analyze execution details, and implement practices that reduce operational issues.


Understanding Dataflow Gen2 Execution

A Dataflow Gen2 execution consists of several stages:

  1. Source connection
  2. Data extraction
  3. Query transformation
  4. Data validation
  5. Data loading
  6. Refresh completion

Errors can occur at any stage of this process.

Unlike pipelines, where multiple activities execute sequentially, Dataflow Gen2 refreshes are generally focused on executing Power Query transformations and loading results into destination systems.

Successful troubleshooting requires identifying which stage failed.


Common Categories of Dataflow Gen2 Errors

Connection Errors

Connection failures occur when Dataflow Gen2 cannot access the source system.

Common causes include:

  • Invalid credentials
  • Expired passwords
  • Revoked access
  • Incorrect connection strings
  • Network issues
  • Unsupported authentication methods

Example:

A Dataflow Gen2 refresh connects to Azure SQL Database using a username whose password has expired.

Result:

The refresh fails before any data is retrieved.

Typical troubleshooting steps:

  • Verify credentials.
  • Test connectivity.
  • Reauthenticate the connection.
  • Confirm source availability.

Authentication and Authorization Errors

Authentication confirms identity.

Authorization confirms permissions.

Common examples:

  • Missing database permissions
  • Insufficient workspace permissions
  • Revoked service account access
  • Missing OneLake permissions

Example error:

“Access denied while attempting to access source.”

Resolution:

Verify user permissions and security roles on both source and destination systems.


Source Schema Changes

Schema drift occurs when source structures change unexpectedly.

Examples include:

  • Columns removed
  • Columns renamed
  • Data types modified
  • New columns added

Example:

A transformation references a column named CustomerStatus.

A source application update renames the column to Status.

Result:

The transformation step fails.

Resolution:

Update transformation logic to reflect the new schema.


Power Query Transformation Errors

Many Dataflow Gen2 failures occur during transformation processing.

Missing Column Errors

Example:

A step attempts to select a column that no longer exists.

Error:

“Column not found.”

Resolution:

Review source schema and update transformation steps.


Data Type Conversion Errors

Example:

A text value such as “ABC123” is converted to a Whole Number data type.

Result:

Conversion failure.

Resolution:

  • Validate source data.
  • Clean data before conversion.
  • Use error handling logic.

Invalid Formula Errors

Power Query transformations use M language behind the scenes.

Example:

A custom column contains an incorrect expression.

Result:

Refresh failure.

Resolution:

Review and correct the transformation formula.


Reference Errors

Queries may reference:

  • Other queries
  • Parameters
  • Functions

If referenced objects are deleted or renamed, failures occur.

Resolution:

Validate dependencies within the dataflow.


Destination Errors

Errors may occur after transformations complete successfully.

Lakehouse Write Failures

Examples:

  • Missing destination table
  • Permission issues
  • Storage limitations
  • Schema mismatch

Resolution:

Verify table structure and permissions.


Warehouse Loading Errors

Examples:

  • Unsupported data types
  • Primary key violations
  • Schema conflicts

Resolution:

Validate compatibility between transformed data and destination schema.


KQL Database Loading Errors

Examples:

  • Incorrect mappings
  • Unsupported formats
  • Data ingestion policy issues

Resolution:

Review destination configuration and ingestion mappings.


Refresh Failures

Dataflow Gen2 refresh operations generate execution information that should be examined whenever failures occur.

Refresh details often provide:

  • Failure stage
  • Error messages
  • Execution duration
  • Rows processed
  • Destination information

For troubleshooting, refresh history is usually the first place to investigate.


Monitoring Dataflow Gen2 Refreshes

Refresh History

Refresh history provides information about:

  • Successes
  • Failures
  • Start times
  • End times
  • Refresh duration

Engineers should review failed refreshes immediately after errors occur.


Detailed Error Messages

Refresh details often contain:

  • Error codes
  • Source system messages
  • Transformation failures
  • Destination loading issues

Always review the detailed error rather than relying solely on the refresh status.

Example:

Generic message:

“Refresh failed.”

Detailed message:

“Cannot convert value ‘N/A’ to Whole Number.”

The detailed error immediately identifies the issue.


Dataflow Monitoring in Fabric

Fabric monitoring tools can help identify:

  • Failed refreshes
  • Long-running refreshes
  • Capacity-related issues
  • Destination write failures

Monitoring trends over time can reveal recurring problems.


Troubleshooting Common Dataflow Gen2 Errors

Scenario 1: Source Authentication Failure

Symptoms:

  • Refresh fails immediately.
  • No records processed.

Investigation:

  1. Verify credentials.
  2. Test source access.
  3. Reauthenticate the connection.
  4. Confirm account permissions.

Resolution:

Update credentials or restore permissions.


Scenario 2: Missing Column Error

Symptoms:

  • Refresh fails during transformation.
  • Error references a missing field.

Investigation:

  1. Review source schema.
  2. Compare against transformation steps.
  3. Identify renamed or deleted columns.

Resolution:

Modify transformation logic.


Scenario 3: Data Type Conversion Failure

Symptoms:

  • Refresh stops during transformation.

Example:

A column contains:

  • 100
  • 200
  • ABC

The query attempts to convert the column to numeric values.

Resolution:

  • Clean invalid values.
  • Replace errors.
  • Filter problematic records.

Scenario 4: Destination Table Failure

Symptoms:

  • Transformations succeed.
  • Loading fails.

Investigation:

  1. Verify destination exists.
  2. Validate permissions.
  3. Review destination schema.

Resolution:

Correct schema or permission issues.


Scenario 5: Long-Running Refresh

Symptoms:

  • Refresh takes significantly longer than expected.

Possible causes:

  • Large data volume
  • Complex transformations
  • Source system bottlenecks
  • Capacity constraints

Resolution:

Optimize transformations and reduce unnecessary processing.


Using Query Diagnostics

Power Query provides diagnostic capabilities that can help identify:

  • Expensive transformation steps
  • Slow source queries
  • Bottlenecks during execution

Query diagnostics are particularly useful when refreshes succeed but perform poorly.

Areas to investigate include:

  • Excessive row operations
  • Repeated transformations
  • Non-folding queries

Query Folding and Error Prevention

What is Query Folding?

Query folding occurs when transformations are pushed back to the source system.

Instead of processing data inside Fabric:

  • The source executes filtering.
  • The source performs aggregations.
  • The source reduces result sets.

Benefits:

  • Faster refreshes
  • Reduced resource consumption
  • Lower failure risk

How Query Folding Affects Troubleshooting

Poor query folding can lead to:

  • Excessive processing
  • Memory consumption
  • Long refresh durations

When troubleshooting performance-related refresh issues, query folding should be evaluated.


Capacity-Related Errors

Dataflow Gen2 consumes Fabric compute resources.

Potential issues include:

  • High concurrency
  • Capacity throttling
  • Resource contention

Symptoms:

  • Slow refreshes
  • Intermittent failures
  • Unexpected cancellations

Monitoring Fabric capacity metrics can help identify these issues.


Best Practices for Preventing Dataflow Gen2 Errors

Validate Source Schemas

Regularly review source structures.

This helps detect schema drift before failures occur.


Use Defensive Transformations

Handle unexpected values through:

  • Null handling
  • Error replacement
  • Data validation

This improves refresh reliability.


Minimize Complex Transformations

Perform only necessary transformations.

Simpler dataflows are easier to maintain and troubleshoot.


Monitor Refresh History

Review failures and performance trends regularly.

Early detection reduces operational impact.


Test After Source Changes

Whenever source applications are modified:

  • Validate schemas.
  • Test refreshes.
  • Confirm transformation logic.

Optimize Query Folding

Push processing to source systems whenever possible.

This reduces execution times and resource consumption.


Document Dependencies

Track:

  • Source systems
  • Queries
  • Parameters
  • Destination tables

Documentation simplifies troubleshooting.


DP-700 Exam Tips

For the exam, remember:

  • Most Dataflow Gen2 troubleshooting begins with refresh history.
  • Source schema changes are a common cause of refresh failures.
  • Data type conversion errors frequently occur during transformations.
  • Destination errors can occur even when transformations succeed.
  • Query folding significantly affects performance and reliability.
  • Detailed error messages provide more value than high-level failure notifications.
  • Authentication and authorization issues are common root causes.
  • Capacity constraints can impact refresh performance.
  • Missing columns and renamed fields often cause transformation failures.
  • Monitoring refresh history is a core operational responsibility for Fabric data engineers.

Practice Exam Questions

Question 1

A Dataflow Gen2 refresh fails with the error:

“Column ‘CustomerType’ was not found.”

What is the most likely cause?

A. Destination table permissions were removed.

B. Fabric capacity is overloaded.

C. A source schema change occurred.

D. Query folding is disabled.

Correct Answer: C

Explanation:

The transformation references a column that no longer exists or has been renamed.

  • A would generate authorization errors.
  • C would typically cause performance or resource issues.
  • D affects performance rather than column existence.

Question 2

A Dataflow Gen2 refresh fails immediately after starting and processes zero rows.

Which issue is most likely?

A. Authentication failure

B. Query folding problem

C. Aggregation error

D. Destination schema mismatch

Correct Answer: A

Explanation:

Authentication issues generally prevent data retrieval from beginning.

  • B affects execution efficiency.
  • C occurs during transformation.
  • D typically appears later during loading.

Question 3

A data engineer wants to determine exactly why a Dataflow Gen2 refresh failed.

What should they examine first?

A. Fabric capacity metrics

B. Lakehouse storage statistics

C. Refresh history details

D. Workspace role assignments

Correct Answer: C

Explanation:

Refresh history contains detailed execution information and error messages.

  • A and B may be useful later.
  • D should only be investigated if permissions are suspected.

Question 4

A Dataflow Gen2 transformation attempts to convert a text value of “Unknown” into a Whole Number.

What type of error will occur?

A. Data type conversion error

B. Capacity error

C. Query dependency error

D. Authentication error

Correct Answer: A

Explanation:

Text values that cannot be converted to numeric formats generate conversion failures.

  • A, B, and C are unrelated to data conversion.

Question 5

Which capability pushes transformation processing back to the source system whenever possible?

A. Data validation

B. Query folding

C. Incremental refresh

D. Parameterization

Correct Answer: B

Explanation:

Query folding allows supported transformations to execute on the source system.

  • A validates data.
  • C limits refresh scope.
  • D provides dynamic values.

Question 6

Transformations complete successfully, but data cannot be written to the destination Warehouse.

Which category of issue is most likely?

A. Destination loading error

B. Source connectivity failure

C. Missing source table

D. Query folding issue

Correct Answer: A

Explanation:

If transformations finish successfully, the failure likely occurs during the loading phase.

  • B and C would occur earlier.
  • D typically affects performance.

Question 7

A data engineer wants to reduce failures caused by unexpected values in source data.

Which approach is best?

A. Increase capacity size

B. Disable query folding

C. Use defensive transformations and error handling

D. Reduce refresh frequency

Correct Answer: C

Explanation:

Handling nulls, invalid values, and conversion errors proactively improves reliability.

  • A may help performance but not data quality.
  • B often reduces efficiency.
  • D does not address the root cause.

Question 8

Which issue commonly results from source schema drift?

A. Missing or renamed columns

B. Capacity throttling

C. Refresh scheduling conflicts

D. Workspace role inheritance

Correct Answer: A

Explanation:

Schema drift occurs when source structures change unexpectedly.

  • B, C, and D are unrelated.

Question 9

A refresh suddenly begins taking twice as long as usual without failing.

Which tool would be most useful for identifying expensive transformation steps?

A. Workspace permissions

B. Query diagnostics

C. Tenant settings

D. Dataflow ownership settings

Correct Answer: B

Explanation:

Query diagnostics help identify bottlenecks and inefficient transformations.

  • A, C, and D do not provide execution analysis.

Question 10

Which best practice helps prevent Dataflow Gen2 failures after application updates modify source tables?

A. Disable refresh schedules temporarily

B. Increase concurrency limits

C. Recreate all dataflows monthly

D. Validate source schemas and test refreshes after changes

Correct Answer: D

Explanation:

Testing after source changes helps identify schema drift and compatibility issues before production failures occur.

  • A is reactive rather than preventive.
  • B does not address schema changes.
  • C is unnecessary and inefficient.

Go to the DP-700 Exam Prep Hub main page.

Configure Dataflows Gen2 workspace settings (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Implement and manage an analytics solution (30–35%)
--> Configure Microsoft Fabric workspace settings
--> Configure Dataflows Gen2 workspace settings


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Dataflows Gen2 are a core component of Microsoft Fabric’s data ingestion and transformation capabilities. They provide a low-code/no-code method for extracting, transforming, and loading (ETL) data into Fabric destinations such as Lakehouses, Warehouses, and other analytics assets.

For the DP-700 exam, it is important to understand not only how to create Dataflows Gen2, but also how workspace settings affect their operation, governance, security, performance, and administration.

Workspace-level settings help administrators establish standards and controls for how Dataflows Gen2 are used within a Fabric environment. Understanding these settings enables data engineers to create scalable, maintainable, and governed data integration solutions.


What Are Dataflows Gen2?

Dataflows Gen2 are cloud-based data transformation solutions built on Power Query technology.

They allow users to:

  • Connect to data sources
  • Clean and transform data
  • Combine multiple datasets
  • Perform data quality operations
  • Load data into Fabric destinations

Unlike notebooks or Spark jobs that require coding skills, Dataflows Gen2 provide a graphical interface for data preparation.

Common use cases include:

  • Data ingestion
  • Data cleansing
  • Dimension table creation
  • Data enrichment
  • ETL and ELT workflows
  • Self-service data preparation

Dataflows Gen2 Architecture

A typical Dataflow Gen2 process consists of:

Data Source
Power Query Transformations
Dataflow Gen2
Destination

Possible destinations include:

  • Lakehouse Tables
  • Warehouse Tables
  • Azure SQL Database
  • Other supported Fabric destinations

Why Workspace Settings Matter

In small environments, Dataflows Gen2 can be managed individually.

However, in enterprise environments, administrators need centralized control over:

  • Dataflow creation
  • Dataflow execution
  • Compute usage
  • Security
  • Data destinations
  • Governance

Workspace settings help establish consistent behavior across all Dataflows Gen2 within a workspace.


Dataflows Gen2 Workspace Administration

Workspace administrators control who can:

  • Create Dataflows Gen2
  • Modify Dataflows Gen2
  • Schedule refreshes
  • Access source data
  • Access destinations

These permissions are governed through Fabric workspace roles.

Workspace RoleDataflow Capability
AdminFull control
MemberCreate and manage
ContributorCreate and edit
ViewerRead-only

DP-700 Exam Tip

Remember that Dataflows Gen2 do not have a separate security model.

They inherit Fabric workspace permissions.


Configure Dataflow Creation Permissions

Organizations often restrict who can create Dataflows Gen2.

Reasons include:

  • Governance requirements
  • Cost management
  • Data quality controls
  • Standardization

A common enterprise pattern is:

  • Contributors create Dataflows
  • Members manage Dataflows
  • Admins govern Dataflows

This prevents uncontrolled proliferation of ETL processes.


Configure Data Destinations

One of the most important Dataflows Gen2 settings involves destination configuration.

Supported destinations include:

Lakehouse

The most common destination.

Benefits:

  • Delta table storage
  • Integration with Spark
  • Medallion architecture support

Common usage:

  • Bronze layer ingestion
  • Silver layer transformation

Warehouse

Dataflows can load directly into Fabric Warehouses.

Benefits:

  • Structured analytics
  • SQL querying
  • Dimensional modeling support

Multiple Destinations

Dataflows Gen2 support loading data into multiple destinations from a single transformation pipeline.

Benefits include:

  • Reduced duplication of transformation logic
  • Improved maintainability
  • Consistent outputs

Configure Refresh Settings

Refresh configuration is one of the most frequently tested Dataflow topics.

Refresh settings determine:

  • When Dataflows execute
  • How often they run
  • How data is updated

Options include:

Manual Refresh

Execution occurs only when initiated by a user.

Best for:

  • Testing
  • Development
  • Small workloads

Scheduled Refresh

Execution occurs automatically based on a defined schedule.

Examples:

  • Hourly
  • Daily
  • Weekly

Most production Dataflows use scheduled refresh.


Pipeline-Orchestrated Refresh

Dataflows can be executed through Fabric Data Factory pipelines.

Benefits:

  • End-to-end orchestration
  • Dependency management
  • Complex workflow support

This is commonly used in enterprise ETL solutions.


Refresh Failure Notifications

Administrators can configure monitoring and notifications for refresh failures.

Benefits:

  • Faster troubleshooting
  • Improved reliability
  • Reduced downtime

Monitoring is particularly important when Dataflows support business-critical reporting systems.


Configure Data Source Credentials

Dataflows require access credentials for source systems.

Supported authentication methods vary by connector and may include:

  • Organizational account
  • OAuth
  • Basic authentication
  • Service principals
  • API keys

Workspace administrators often establish governance policies around credential management.

Best Practice

Use service accounts or service principals whenever possible for production workloads.

This avoids refresh failures caused by employee account changes.


Configure Gateway Usage

Some data sources reside inside private corporate networks.

Examples:

  • On-premises SQL Server
  • Oracle databases
  • File shares

In these scenarios, Dataflows Gen2 may require an On-Premises Data Gateway.

Gateway settings determine:

  • Connectivity
  • Authentication
  • Data access paths

A common DP-700 scenario involves selecting a gateway for on-premises data access.


Dataflow Compute and Performance Considerations

Dataflows Gen2 execute within Fabric-managed infrastructure.

Administrators should understand factors that impact performance:

Data Volume

Larger datasets increase:

  • Refresh duration
  • Resource consumption

Transformation Complexity

Operations such as:

  • Merges
  • Joins
  • Group By
  • Aggregations

increase processing requirements.


Number of Refreshes

Frequent refresh schedules can consume additional capacity resources.

Administrators should balance:

  • Data freshness
  • Capacity utilization

Dataflow Lineage and Impact Analysis

Fabric automatically captures lineage information.

Administrators can view:

Source
Dataflow Gen2
Lakehouse
Semantic Model
Report

Benefits include:

  • Impact analysis
  • Dependency tracking
  • Governance visibility

Lineage is an important governance feature frequently associated with Dataflows.


Dataflow Monitoring

Workspace administrators can monitor:

  • Refresh history
  • Success rates
  • Failure messages
  • Duration metrics

Monitoring tools include:

  • Refresh history
  • Monitoring Hub
  • Fabric capacity metrics

Common Troubleshooting Areas

  • Credential failures
  • Gateway connectivity issues
  • Schema changes
  • Destination write failures
  • Capacity limitations

Dataflow Governance Best Practices

Standardize Naming Conventions

Example:

DF_Bronze_Customer_Ingestion
DF_Silver_Sales_Transform
DF_Gold_Product_Aggregation

Consistent naming improves maintainability.


Use Scheduled Refresh Sparingly

Avoid unnecessary refresh frequency.

Example:

Do not refresh every 15 minutes if daily updates are sufficient.


Implement Service Principals

Reduce dependency on individual user accounts.


Leverage Lineage Views

Monitor downstream dependencies before making changes.


Align with Medallion Architecture

Use Dataflows strategically within:

  • Bronze Layer
  • Silver Layer
  • Gold Layer

Common DP-700 Exam Scenarios

Scenario 1

A Dataflow must load data from an on-premises SQL Server.

Solution:

Configure an On-Premises Data Gateway.


Scenario 2

A Dataflow should execute only after a source ingestion process completes.

Solution:

Use a Data Factory pipeline to orchestrate execution.


Scenario 3

A Dataflow should load transformed data into a Lakehouse for downstream Spark processing.

Solution:

Configure the Lakehouse as the destination.


DP-700 Exam Focus Areas

You should understand:

✓ Dataflows Gen2 architecture

✓ Workspace permissions

✓ Dataflow creation governance

✓ Data destinations

✓ Refresh scheduling

✓ Pipeline orchestration

✓ Credential management

✓ Gateway configuration

✓ Monitoring and troubleshooting

✓ Lineage and impact analysis

✓ Performance considerations


10 Practice Exam Questions

Question 1

Which technology provides the transformation engine used by Dataflows Gen2?

A. Power Query

B. Apache Spark

C. Kusto Query Language (KQL)

D. T-SQL

Answer: A

Explanation

Dataflows Gen2 use Power Query as their transformation engine, providing a low-code interface for data preparation.


Question 2

A Dataflow Gen2 needs to access an on-premises SQL Server database.

What must be configured?

A. Eventstream

B. Data Activator

C. On-Premises Data Gateway

D. OneLake Shortcut

Answer: C

Explanation

An On-Premises Data Gateway enables Fabric services to securely access data sources located inside private networks.


Question 3

Which destination is most commonly used for storing Dataflow Gen2 outputs within a medallion architecture?

A. Semantic Model

B. Dashboard

C. Notebook

D. Lakehouse

Answer: D

Explanation

Lakehouses are commonly used as Bronze, Silver, and Gold layers within Fabric medallion architectures.


Question 4

What is the primary advantage of scheduled refresh?

A. Eliminates authentication requirements

B. Automatically updates data without manual intervention

C. Increases storage capacity

D. Creates backup copies of source systems

Answer: B

Explanation

Scheduled refresh ensures that data remains current without requiring users to manually run the Dataflow.


Question 5

Which Fabric feature can orchestrate Dataflow Gen2 execution as part of a larger workflow?

A. Data Factory Pipeline

B. Lakehouse Explorer

C. Monitoring Hub

D. OneLake File Explorer

Answer: A

Explanation

Data Factory pipelines provide orchestration, dependency management, and scheduling capabilities.


Question 6

What information can lineage views provide?

A. Network bandwidth consumption

B. Spark executor utilization

C. Upstream and downstream dependencies

D. Gateway installation logs

Answer: C

Explanation

Lineage views show how data moves between sources, Dataflows, Lakehouses, semantic models, and reports.


Question 7

Which workspace role has full administrative control over Dataflows Gen2?

A. Viewer

B. Contributor

C. Member

D. Admin

Answer: D

Explanation

Workspace Admins have complete control over all workspace items, including Dataflows Gen2.


Question 8

A company wants to minimize production refresh failures caused by employee account changes.

What is the recommended approach?

A. Increase refresh frequency

B. Use service principals or service accounts

C. Disable scheduled refresh

D. Use Viewer permissions

Answer: B

Explanation

Service principals provide stable authentication that is not tied to individual users.


Question 9

Which factor is most likely to increase Dataflow refresh duration?

A. Smaller datasets

B. Reduced transformations

C. Complex joins and aggregations

D. Fewer destination tables

Answer: C

Explanation

Complex transformation logic increases processing requirements and refresh times.


Question 10

What is the primary purpose of Dataflow monitoring?

A. Create semantic models

B. Manage workspace domains

C. Configure Spark runtimes

D. Identify refresh failures and performance issues

Answer: D

Explanation

Monitoring helps administrators detect failures, troubleshoot issues, and optimize performance.


Final Exam Tip

For DP-700, Dataflows Gen2 questions typically focus on data ingestion, destinations, refresh management, gateways, orchestration, and governance. When evaluating exam scenarios, remember that Dataflows Gen2 are designed to provide a low-code ETL experience using Power Query, while Fabric Pipelines provide orchestration and Lakehouses commonly serve as the destination within modern medallion architectures.


Go to the DP-700 Exam Prep Hub main page.

Choose Between Dataflows Gen2, Notebooks, KQL, and T-SQL for data transformation (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Ingest and transform data (30–35%)
   --> Ingest and transform batch data
      --> Choose Between Dataflows Gen2, Notebooks, KQL, and T-SQL for data transformation


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Microsoft Fabric provides multiple technologies for transforming data. One of the most common challenges for a Data Engineer is determining which transformation tool is best suited for a specific business requirement.

The DP-700 exam frequently tests your ability to select the appropriate transformation technology based on:

  • Data volume
  • Data complexity
  • Required programming skills
  • Data source type
  • Performance requirements
  • Real-time versus batch processing needs
  • User expertise
  • Maintainability

The four most important transformation technologies covered in the exam are:

  • Dataflows Gen2
  • Notebooks
  • KQL
  • T-SQL

Although all four can transform data, they are optimized for different workloads and use cases.

Understanding their strengths, limitations, and ideal scenarios is critical for success on the DP-700 exam.


Overview of Transformation Technologies

TechnologyPrimary PurposeBest For
Dataflows Gen2Low-code ETLBusiness-friendly transformations
NotebooksAdvanced engineering and Spark processingLarge-scale data engineering
T-SQLRelational transformationsWarehouses and SQL workloads
KQLReal-time analytics and telemetry processingLogs and streaming data

Dataflows Gen2

What Are Dataflows Gen2?

Dataflows Gen2 are low-code data transformation tools within Microsoft Fabric that use Power Query.

They allow users to:

  • Connect to data sources
  • Clean data
  • Transform data
  • Load data into Fabric destinations

without writing significant amounts of code.


Transformation Engine

Dataflows Gen2 use:

  • Power Query
  • M Language (behind the scenes)

Most transformations are performed through a graphical interface.


Typical Transformations

Examples include:

  • Renaming columns
  • Removing duplicates
  • Filtering rows
  • Merging datasets
  • Splitting columns
  • Data type conversions
  • Calculated columns

When to Use Dataflows Gen2

Choose Dataflows Gen2 when:

  • Low-code development is desired
  • Data volumes are moderate
  • Business analysts participate in development
  • Transformations are relatively straightforward
  • Self-service data preparation is required

Examples:

  • Preparing Excel data
  • Cleaning CSV files
  • Combining multiple business datasets
  • Standard ETL processes

Advantages

Low-Code Experience

Minimal coding required.

Large Connector Library

Supports numerous source systems.

Easy Maintenance

Visual transformation steps are easier to understand.

Integration with Fabric

Loads directly into:

  • Lakehouses
  • Warehouses
  • Other Fabric destinations

Limitations

Less Flexible

Complex logic may become difficult.

Not Ideal for Very Large Data Volumes

Spark-based solutions often scale better.

Limited Advanced Programming

Compared to notebooks.


Notebooks

What Are Notebooks?

Notebooks are code-based development environments that support:

  • PySpark
  • Python
  • Scala
  • Spark SQL
  • R

within Microsoft Fabric.


Transformation Engine

Notebooks execute on Spark clusters.

This enables:

  • Distributed processing
  • Parallel execution
  • Large-scale transformations

Typical Transformations

Examples:

  • Complex joins
  • Data enrichment
  • Machine learning preparation
  • Feature engineering
  • Data quality validation
  • Custom business logic

When to Use Notebooks

Choose notebooks when:

  • Large data volumes exist
  • Spark processing is required
  • Advanced transformations are needed
  • Custom programming is necessary
  • Machine learning integration is planned

Examples:

  • Processing billions of records
  • Data science workflows
  • Medallion architecture pipelines
  • Complex transformations

Advantages

Massive Scalability

Handles large datasets efficiently.

Flexible Programming

Supports multiple languages.

Machine Learning Integration

Works with Spark ML libraries.

Advanced Data Engineering

Ideal for enterprise-scale pipelines.


Limitations

Requires Coding Skills

Less accessible for business users.

More Complex Development

Compared to Dataflows Gen2.


T-SQL

What Is T-SQL?

T-SQL (Transact-SQL) is Microsoft’s extension of SQL.

Fabric Warehouses and SQL endpoints support T-SQL for:

  • Querying
  • Transforming
  • Managing relational data

Transformation Techniques

Common operations include:

SELECT
JOIN
GROUP BY
CASE
CTE
MERGE
WINDOW FUNCTIONS

When to Use T-SQL

Choose T-SQL when:

  • Data resides in a Warehouse
  • Relational transformations are required
  • SQL expertise already exists
  • Dimensional models are being built

Examples:

  • Fact table loading
  • Dimension updates
  • Data warehouse ETL
  • Reporting data preparation

Advantages

Familiar Language

Widely used by data professionals.

Excellent Relational Processing

Optimized for structured data.

Strong Performance

Particularly for warehouse workloads.

Easy Integration

Works naturally with BI tools.


Limitations

Less Suitable for Unstructured Data

Not ideal for files and raw data.

Limited Distributed Processing

Compared to Spark.


KQL

What Is KQL?

Kusto Query Language (KQL) is designed for:

  • Log analytics
  • Telemetry analysis
  • Real-time data processing
  • Event analytics

KQL is commonly used in:

  • KQL Databases
  • Eventhouse
  • Real-Time Intelligence

Typical Transformations

Examples include:

  • Filtering events
  • Aggregations
  • Pattern detection
  • Time-series analysis
  • Stream transformations

When to Use KQL

Choose KQL when:

  • Working with telemetry data
  • Processing logs
  • Analyzing streaming events
  • Building real-time dashboards

Examples:

  • Sensor monitoring
  • Application logs
  • Security analytics
  • Operational monitoring

Advantages

Optimized for Time-Series Data

Excellent for event-driven workloads.

Fast Query Performance

Handles large event volumes efficiently.

Real-Time Analytics

Supports low-latency analysis.


Limitations

Not a General ETL Tool

Less suitable for traditional batch ETL.

Not Designed for Dimensional Modeling

Warehouses are generally better for reporting models.


Comparing Transformation Technologies

RequirementDataflows Gen2NotebooksT-SQLKQL
Low-Code DevelopmentExcellentPoorModerateModerate
Large-Scale ProcessingModerateExcellentGoodExcellent
Relational TransformationsModerateGoodExcellentLimited
Streaming AnalyticsLimitedModeratePoorExcellent
Machine Learning SupportPoorExcellentPoorLimited
Telemetry AnalyticsPoorModeratePoorExcellent
Business User FriendlyExcellentPoorModerateModerate
Advanced ProgrammingLimitedExcellentModerateLimited

Decision Framework

Choose Dataflows Gen2 When:

  • Low-code ETL is preferred
  • Business users are involved
  • Data volumes are moderate
  • Transformations are straightforward

Choose Notebooks When:

  • Spark processing is required
  • Data volumes are large
  • Complex transformations exist
  • Machine learning is involved

Choose T-SQL When:

  • Working with a Warehouse
  • Building dimensional models
  • SQL skills are available
  • Data is highly structured

Choose KQL When:

  • Processing logs
  • Analyzing telemetry
  • Supporting streaming analytics
  • Building operational monitoring solutions

Common DP-700 Scenario Questions

Scenario 1

A business analyst needs to combine Excel spreadsheets and remove duplicate rows using a visual interface.

Best choice:

Dataflows Gen2


Scenario 2

A data engineer must transform billions of records stored in a Lakehouse.

Best choice:

Notebook


Scenario 3

A warehouse team must populate fact and dimension tables.

Best choice:

T-SQL


Scenario 4

An operations team analyzes millions of application log events each hour.

Best choice:

KQL


Scenario 5

A machine learning team requires custom Python transformations.

Best choice:

Notebook


Exam Tips

Many DP-700 questions are not asking what can perform a transformation, but what should perform the transformation.

Remember these associations:

RequirementBest Choice
Visual ETLDataflows Gen2
Spark processingNotebook
Data warehouse transformationsT-SQL
Telemetry and logsKQL
Machine learning preparationNotebook
Self-service data preparationDataflows Gen2
Streaming analyticsKQL

Practice Exam Questions

Question 1

A business analyst needs to cleanse CSV files using a graphical interface with minimal coding. Which transformation technology should be used?

A. T-SQL

B. Notebook

C. KQL

D. Dataflows Gen2

Answer: D

Explanation

Dataflows Gen2 provide a low-code, visual interface that is ideal for business users and simple ETL processes.


Question 2

A data engineer must process several billion records stored in a Lakehouse using distributed computing.

Which option should be selected?

A. Notebook

B. Dataflows Gen2

C. T-SQL

D. KQL

Answer: A

Explanation

Notebooks leverage Spark for distributed processing and are designed for large-scale data transformations.


Question 3

Which technology is specifically optimized for transforming and analyzing telemetry and log data?

A. Dataflows Gen2

B. Notebook

C. KQL

D. T-SQL

Answer: C

Explanation

KQL is designed for log analytics, telemetry processing, and real-time operational analytics.


Question 4

A team is loading dimension and fact tables within a Fabric Warehouse.

Which transformation technology is most appropriate?

A. Notebook

B. Dataflows Gen2

C. KQL

D. T-SQL

Answer: D

Explanation

T-SQL is the preferred technology for relational transformations in Fabric Warehouses.


Question 5

A company requires machine learning feature engineering using Python libraries.

Which technology should be selected?

A. Notebook

B. Dataflows Gen2

C. T-SQL

D. KQL

Answer: A

Explanation

Notebooks support Python, Spark, and machine learning frameworks, making them ideal for feature engineering.


Question 6

Which technology relies primarily on Power Query transformations?

A. Notebook

B. Dataflows Gen2

C. T-SQL

D. KQL

Answer: B

Explanation

Dataflows Gen2 use Power Query and the M language behind the scenes for data transformations.


Question 7

An operations team needs to perform real-time aggregations on streaming sensor data.

Which option should be used?

A. Dataflows Gen2

B. Notebook

C. KQL

D. T-SQL

Answer: C

Explanation

KQL is optimized for real-time event processing and telemetry analysis.


Question 8

A data engineer needs maximum flexibility to implement custom business logic across multiple data sources.

Which technology is most appropriate?

A. KQL

B. Dataflows Gen2

C. T-SQL

D. Notebook

Answer: D

Explanation

Notebooks provide the highest degree of customization through programming languages such as Python and PySpark.


Question 9

A team already has extensive SQL expertise and needs to transform highly structured relational data in a Warehouse.

Which option is best?

A. Notebook

B. T-SQL

C. Dataflows Gen2

D. KQL

Answer: B

Explanation

T-SQL is optimized for relational transformations and leverages existing SQL skills.


Question 10

Which technology is generally the most business-user-friendly option for creating batch data transformation processes?

A. Notebook

B. KQL

C. T-SQL

D. Dataflows Gen2

Answer: D

Explanation

Dataflows Gen2 provide a visual, low-code experience that is easier for business users and citizen developers than code-based solutions.


DP-700 Exam Summary

When deciding between transformation technologies, focus on the primary workload:

  • Dataflows Gen2 → Low-code ETL and self-service data preparation
  • Notebooks → Spark, large-scale processing, advanced engineering, and machine learning
  • T-SQL → Relational transformations and warehouse development
  • KQL → Telemetry, logs, time-series analytics, and real-time event processing

A common DP-700 exam strategy is to identify the keywords in the scenario:

  • Visual interface → Dataflows Gen2
  • Billions of rows / Spark → Notebook
  • Warehouse / dimensional model → T-SQL
  • Logs / telemetry / real-time analytics → KQL

These keywords often point directly to the correct answer.


Go to the DP-700 Exam Prep Hub main page.

Choose between Dataflow Gen2, a pipeline and a notebook (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Implement and manage an analytics solution (30–35%)
   --> Orchestrate processes
      --> Choose between Dataflow Gen2, a pipeline and a notebook


Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the most important skills for a Microsoft Fabric Data Engineer is selecting the appropriate tool for a particular task. Microsoft Fabric provides several powerful technologies for data ingestion, transformation, orchestration, and automation. Three of the most commonly used are:

  • Dataflow Gen2
  • Data Pipelines
  • Notebooks

Although these tools often work together, they serve different purposes. Choosing the wrong tool can lead to unnecessary complexity, reduced maintainability, and increased development effort.

For the DP-700 exam, you should understand:

  • The primary purpose of each tool
  • When to use each tool
  • Strengths and limitations
  • Common design patterns
  • How these tools interact with one another

A significant number of DP-700 scenario questions are likely to test your ability to determine which Fabric component best fits a given business requirement.


Understanding the Three Tools

Before comparing them, it is important to understand their primary functions.

ToolPrimary Purpose
Dataflow Gen2Low-code data ingestion and transformation
Data PipelineWorkflow orchestration and automation
NotebookCode-based data processing and advanced analytics

A useful way to remember this is:

Dataflow Gen2 = Transform Data
Pipeline = Orchestrate Processes
Notebook = Execute Code

What Is Dataflow Gen2?

Dataflow Gen2 is a low-code/no-code data integration and transformation tool built on Power Query technology.

It allows users to:

  • Connect to data sources
  • Clean data
  • Transform data
  • Merge datasets
  • Filter records
  • Load data into Fabric destinations

Dataflow Gen2 is designed for users who prefer visual development rather than coding.


Dataflow Gen2 Architecture

Data Source
Power Query Transformations
Dataflow Gen2
Lakehouse / Warehouse

The transformation logic is built using a graphical interface.


Common Dataflow Gen2 Tasks

Examples include:

  • Removing duplicates
  • Filtering rows
  • Renaming columns
  • Data cleansing
  • Combining files
  • Joining datasets
  • Type conversions
  • Data standardization

These activities require little or no programming.


Advantages of Dataflow Gen2

Low-Code Development

Business analysts and citizen developers can build transformations without extensive coding knowledge.

Reusable Transformations

Transformations can be reused across multiple projects.

Familiar Power Query Experience

Users familiar with Power BI often adapt quickly.

Large Connector Library

Supports many cloud and on-premises data sources.


Limitations of Dataflow Gen2

Dataflow Gen2 is not ideal for:

  • Complex machine learning workloads
  • Advanced Spark processing
  • Custom Python development
  • Large-scale distributed programming

For those scenarios, notebooks are often more appropriate.


What Is a Data Pipeline?

A Data Pipeline is an orchestration tool.

Its primary purpose is not data transformation but rather coordinating and automating activities.

Think of a pipeline as a workflow engine.


Pipeline Architecture

Activity 1
Activity 2
Activity 3
Activity 4

Pipelines determine:

  • What runs
  • When it runs
  • In what order it runs
  • Under what conditions it runs

Common Pipeline Activities

Examples include:

  • Copy Data
  • Execute Notebook
  • Execute Dataflow
  • Stored Procedures
  • Web Activities
  • Conditional Logic
  • Scheduling Jobs

Pipelines coordinate these activities into a complete workflow.


Advantages of Pipelines

Workflow Automation

Automates complex end-to-end processes.

Scheduling

Supports recurring execution schedules.

Dependency Management

Controls execution order.

Error Handling

Supports retries and failure paths.

Integration

Can orchestrate multiple Fabric components.


Limitations of Pipelines

Pipelines are not intended for:

  • Complex data transformations
  • Interactive analysis
  • Advanced programming

Pipelines orchestrate work; they do not replace transformation tools.


What Is a Notebook?

A notebook is a code-based environment that allows developers and data engineers to execute code directly against Fabric data.

Notebooks commonly use:

  • Python
  • PySpark
  • Spark SQL
  • Scala (where supported)

They run on Spark compute engines.


Notebook Architecture

Data Source
Spark Processing
Notebook
Lakehouse / Warehouse

Notebooks provide maximum flexibility and control.


Common Notebook Tasks

Examples include:

  • PySpark transformations
  • Data engineering workflows
  • Machine learning preparation
  • Advanced data cleansing
  • Streaming data processing
  • Delta table optimization
  • Custom business logic

Advantages of Notebooks

Full Programming Flexibility

Developers can implement virtually any logic.

Spark Integration

Supports distributed processing.

Advanced Transformations

Suitable for highly complex data engineering workloads.

Machine Learning Support

Works well with AI and ML frameworks.

Scalability

Can process very large datasets.


Limitations of Notebooks

Coding Required

Requires programming knowledge.

Higher Complexity

Can be more difficult to maintain.

Less Accessible

Business users typically prefer Dataflow Gen2.


Side-by-Side Comparison

FeatureDataflow Gen2PipelineNotebook
Primary PurposeData TransformationOrchestrationAdvanced Processing
Coding RequiredMinimalMinimalExtensive
SchedulingLimitedYesUsually via Pipeline
Spark SupportNo Direct CodingNoYes
Visual InterfaceYesYesNo
Advanced LogicLimitedLimitedExtensive
Best for ETLYesCoordinates ETLYes
Machine LearningNoNoYes

When to Choose Dataflow Gen2

Choose Dataflow Gen2 when:

  • Data cleansing is required
  • Users prefer visual tools
  • Power Query transformations are sufficient
  • Business analysts are building solutions
  • Coding should be minimized

Example

Requirement:

Import CSV files
Remove duplicates
Rename columns
Load into Lakehouse

Best Choice:

Dataflow Gen2


When to Choose a Pipeline

Choose a Pipeline when:

  • Multiple tasks must be coordinated
  • Processes require scheduling
  • Activities depend on one another
  • Workflows need monitoring
  • Automation is required

Example

Requirement:

Run Dataflow
Run Notebook
Load Warehouse
Send Notification

Best Choice:

Pipeline


When to Choose a Notebook

Choose a Notebook when:

  • Complex transformations are required
  • PySpark processing is needed
  • Machine learning is involved
  • Custom code is necessary
  • Large-scale distributed processing is required

Example

Requirement:

Apply custom PySpark transformation
Process 10 TB dataset
Optimize Delta tables

Best Choice:

Notebook


Common Real-World Pattern

In many Fabric environments, all three tools are used together.

Example:

Dataflow Gen2
Pipeline
Notebook
Warehouse

Workflow:

  1. Dataflow Gen2 cleans source files.
  2. Pipeline orchestrates execution.
  3. Notebook performs advanced transformations.
  4. Results load into a Warehouse.

This layered approach is common in enterprise solutions.


Decision Framework for DP-700

When reading exam questions, ask:

Is the requirement primarily data transformation?

Choose:

Dataflow Gen2


Is the requirement workflow orchestration?

Choose:

Pipeline


Is the requirement advanced coding or Spark processing?

Choose:

Notebook


Common Exam Traps

Trap #1

Question mentions:

  • Scheduling
  • Dependencies
  • Automation

Correct answer:

Pipeline

Even if transformations are involved.


Trap #2

Question mentions:

  • PySpark
  • Python
  • Machine Learning
  • Spark

Correct answer:

Notebook


Trap #3

Question mentions:

  • Power Query
  • Visual transformation
  • No-code development

Correct answer:

Dataflow Gen2


DP-700 Exam Focus Areas

You should understand:

✓ Purpose of Dataflow Gen2

✓ Purpose of Data Pipelines

✓ Purpose of Notebooks

✓ Visual versus code-based development

✓ Workflow orchestration

✓ Spark processing

✓ Power Query transformations

✓ Scheduling and automation

✓ Common integration patterns

✓ Appropriate tool selection for business scenarios


Practice Exam Questions

Question 1

A business analyst needs to import CSV files, remove duplicate rows, and standardize column names using a visual interface with minimal coding.

Which Fabric component should be used?

A. Notebook

B. Data Pipeline

C. Dataflow Gen2

D. Deployment Pipeline

Answer: C

Explanation

Dataflow Gen2 is designed for low-code data ingestion and transformation using Power Query.


Question 2

A data engineering solution must execute the following sequence:

  1. Run a Dataflow Gen2 process
  2. Execute a Notebook
  3. Load a Warehouse
  4. Send a notification email

Which Fabric component should coordinate this workflow?

A. Lakehouse

B. Data Pipeline

C. Notebook

D. Semantic Model

Answer: B

Explanation

Pipelines are designed to orchestrate and automate multiple activities and dependencies.


Question 3

A team needs to perform complex PySpark transformations against several terabytes of data.

Which Fabric component is most appropriate?

A. Dataflow Gen2

B. Pipeline

C. Dashboard

D. Notebook

Answer: D

Explanation

Notebooks provide Spark-based programming environments suitable for large-scale transformations.


Question 4

Which Fabric component is primarily responsible for workflow orchestration?

A. Dataflow Gen2

B. Lakehouse

C. Warehouse

D. Data Pipeline

Answer: D

Explanation

Data Pipelines coordinate and automate execution of multiple activities.


Question 5

A solution requires users with no programming experience to create reusable data cleansing transformations.

Which component should be selected?

A. Notebook

B. Dataflow Gen2

C. Pipeline

D. Spark Job Definition

Answer: B

Explanation

Dataflow Gen2 provides a low-code visual environment for data preparation.


Question 6

Which Fabric component offers the greatest flexibility for implementing custom business logic?

A. Dataflow Gen2

B. Warehouse

C. Notebook

D. Data Pipeline

Answer: C

Explanation

Notebooks support Python, PySpark, and Spark SQL, allowing virtually unlimited customization.


Question 7

A company wants to schedule nightly execution of several notebooks and monitor failures.

Which Fabric component should be used?

A. Dataflow Gen2

B. Notebook

C. Lakehouse

D. Data Pipeline

Answer: D

Explanation

Pipelines provide scheduling, monitoring, dependencies, and failure handling.


Question 8

Which statement best describes Dataflow Gen2?

A. It is primarily a workflow orchestration tool.

B. It is a low-code data transformation solution based on Power Query.

C. It is designed for machine learning development.

D. It replaces Spark notebooks.

Answer: B

Explanation

Dataflow Gen2 is optimized for low-code ETL and data transformation workloads.


Question 9

A data engineer must optimize Delta tables using Spark commands and Python code.

Which Fabric component should be used?

A. Notebook

B. Data Pipeline

C. Dataflow Gen2

D. Warehouse

Answer: A

Explanation

Notebook environments provide direct access to Spark capabilities and custom code execution.


Question 10

Which scenario is the best fit for a Data Pipeline?

A. Creating Power Query transformations

B. Applying machine learning algorithms

C. Coordinating multiple Fabric activities into an automated workflow

D. Writing custom PySpark code

Answer: C

Explanation

Pipelines are specifically designed for orchestration, automation, scheduling, dependency management, and monitoring.


Exam Tip

A useful DP-700 memory aid is:

RequirementBest Tool
Visual ETL and data preparationDataflow Gen2
Scheduling and orchestrationData Pipeline
Spark, Python, and advanced processingNotebook

When a scenario focuses on automation and coordinating activities, think Pipeline.

When it focuses on Power Query transformations, think Dataflow Gen2.

When it focuses on PySpark, Spark SQL, machine learning, or custom code, think Notebook.


Go to the DP-700 Exam Prep Hub main page.