Category: Power Query

Implement a Star Schema for a Lakehouse or Warehouse

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Prepare data
--> Transform data
--> Implement a star schema for a lakehouse or warehouse

Designing and implementing an effective schema is foundational to efficient analytics. In Microsoft Fabric, structuring your data into a star schema dramatically improves query performance, simplifies reporting, and aligns with best practices for BI workloads.

This article explains what a star schema is, why it matters in Fabric, and how to implement it in a lakehouse or data warehouse.

What Is a Star Schema?

A star schema is a relational modeling technique that organizes data into two primary types of tables:

  • Fact tables: Contain measurable, quantitative data (metrics, transactions, events).
  • Dimension tables: Contain descriptive attributes (e.g., customer info, product details, dates).

Star schemas get their name because the design resembles a star—a central fact table linked to multiple dimension tables.

Why Use a Star Schema?

A star schema offers multiple advantages for analytical workloads:

  • Improved query performance: Queries are simplified and optimized due to straightforward joins.
  • Simpler reporting: BI tools like Power BI map naturally to star schemas.
  • Aggregations and drill-downs: Dimension tables support filtering and hierarchy reporting.
  • Better scalability: Optimized for large datasets and parallel processing.

In Fabric, both lakehouses and warehouses support star schema implementations, depending on workload and user needs.

Core Components of a Star Schema

1. Fact Tables

Fact tables store the numeric measurements of business processes.
Common characteristics:

  • Contains keys linking to dimensions
  • Often large and wide
  • Used for aggregations (SUM, COUNT, AVG, etc.)

Examples:
Sales transactions, inventory movement, website events

2. Dimension Tables

Dimension tables describe contextual attributes.
Common characteristics:

  • Contain descriptive fields
  • Usually smaller than fact tables
  • Often used for filtering/grouping

Examples:
Customer, product, date, geography

Implementing a Star Schema in a Lakehouse

Lakehouses in Fabric support Delta format tables and both Spark SQL and T-SQL analytics endpoints.

Steps to Implement:

  1. Ingest raw data into your lakehouse (as files or staging tables).
  2. Transform data:
    • Cleanse and conform fields
    • Derive business keys
  3. Create dimension tables:
    • Deduplicate
    • Add descriptive attributes
  4. Create fact tables:
    • Join transactional data to dimension keys
    • Store numeric measures
  5. Optimize:
    • Partition and Z-ORDER for performance

Tools You Might Use:

  • Notebooks (PySpark)
  • Lakehouse SQL
  • Data pipelines

Exam Tip:
Lakehouses are ideal when you need flexibility, schema evolution, or combined batch + exploratory analytics.

Implementing a Star Schema in a Warehouse

Data warehouses in Fabric provide a SQL-optimized store designed for BI workloads.

Steps to Implement:

  1. Stage raw data in warehouse tables
  2. Build conforming dimension tables
  3. Build fact tables with proper keys
  4. Add constraints and indexes (as appropriate)
  5. Optimize with materialized views or aggregations

Warehouse advantages:

  • Strong query performance for BI
  • Native SQL analytics
  • Excellent integration with Power BI and semantic models

Exam Tip:
Choose a warehouse when your priority is high-performance BI analytics with well-defined dimensional models.

Common Star Schema Patterns

Conformed Dimensions

  • Dimensions shared across multiple fact tables
  • Ensures consistent filtering and reporting across business processes

Slowly Changing Dimensions (SCD)

  • Maintain historical attribute changes
  • Types include Type 1 (overwrite) and Type 2 (versioning)

Fact Table Grain

  • Define the “grain” (level of detail) clearly—for example, “one row per sales transaction.”

Star Schema and Power BI Semantic Models

Semantic models often sit on top of star schemas:

  • Fact tables become measure containers
  • Dimensions become filtering hierarchies
  • Reduces DAX complexity
  • Improves performance

Best Practice: Structure your lakehouse or warehouse into a star schema before building the semantic model.

Star Schema in Lakehouse vs Warehouse

FeatureLakehouseWarehouse
Query enginesSpark & SQLSQL only
Best forMixed workloads (big data + SQL)BI & reporting
OptimizationPartition/Z-ORDERIndexing and statistics
ToolingNotebooks, pipelinesSQL scripts, BI artifacts
Schema complexityFlexibleRigid

Governance and Performance Considerations

  • Use consistent keys across facts and dimensions
  • Validate referential integrity where possible
  • Avoid wide, unindexed tables for BI queries
  • Apply sensitivity labels on schemas for governance
  • Document schema and business logic

What to Know for the DP-600 Exam

Be prepared to:

  • Explain the purpose of star schema components
  • Identify when to implement star schema in lakehouses vs warehouses
  • Recognize patterns like conformed dimensions and SCDs
  • Understand performance implications of schema design
  • Relate star schema design to Power BI and semantic models

Final Exam Tip
If the question emphasizes high-performance reporting, simple joins, and predictable filtering, think star schema.
If it mentions big data exploration or flexible schema evolution, star schema in a lakehouse may be part of the answer.

Practice Questions:

Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …

  • Identifying and understand why an option is correct (or incorrect) — not just which one
  • Look for and understand the usage scenario of keywords in exam questions to guide you
  • Expect scenario-based questions rather than direct definitions

1. What is the defining characteristic of a star schema?

A. Multiple fact tables connected through bridge tables
B. A central fact table connected directly to dimension tables
C. Fully normalized transactional tables
D. A schema optimized for OLTP workloads

Correct Answer: B

Explanation:
A star schema consists of a central fact table directly linked to surrounding dimension tables, forming a star-like structure optimized for analytics.

2. Which type of data is stored in a fact table?

A. Descriptive attributes such as names and categories
B. Hierarchical metadata for navigation
C. Quantitative, measurable values
D. User access permissions

Correct Answer: C

Explanation:
Fact tables store numeric measures (e.g., sales amount, quantity) that are aggregated during analytical queries.

3. Which table type is typically smaller and used for filtering and grouping?

A. Fact table
B. Dimension table
C. Bridge table
D. Staging table

Correct Answer: B

Explanation:
Dimension tables store descriptive attributes and are commonly used for filtering, grouping, and slicing fact data in reports.

4. Why are star schemas preferred for Power BI semantic models?

A. They eliminate the need for relationships
B. They align naturally with BI tools and optimize query performance
C. They reduce OneLake storage usage
D. They replace DAX calculations

Correct Answer: B

Explanation:
Power BI and other BI tools are optimized for star schemas, which simplify joins, reduce model complexity, and improve performance.

5. When implementing a star schema in a Fabric lakehouse, which storage format is typically used?

A. CSV
B. JSON
C. Parquet
D. Delta

Correct Answer: D

Explanation:
Fabric lakehouses store tables in Delta format, which supports ACID transactions and efficient analytical querying.

6. Which scenario most strongly suggests using a warehouse instead of a lakehouse for a star schema?

A. Schema evolution and exploratory data science
B. High-performance, SQL-based BI reporting
C. Streaming ingestion of real-time events
D. Semi-structured data exploration

Correct Answer: B

Explanation:
Fabric warehouses are optimized for SQL-based analytics and BI workloads, making them ideal for star schemas supporting reporting scenarios.

7. What does the “grain” of a fact table describe?

A. The number of dimensions in the table
B. The level of detail represented by each row
C. The size of the table in storage
D. The indexing strategy

Correct Answer: B

Explanation:
The grain defines the level of detail for each row in the fact table (e.g., one row per transaction or per day).

8. What is a conformed dimension?

A. A dimension used by only one fact table
B. A dimension that contains only numeric values
C. A shared dimension used consistently across multiple fact tables
D. A dimension generated dynamically at query time

Correct Answer: C

Explanation:
Conformed dimensions are shared across multiple fact tables, enabling consistent filtering and reporting across different business processes.

9. Which design choice improves performance when querying star schemas?

A. Highly normalized dimension tables
B. Complex many-to-many relationships
C. Simple joins between fact and dimension tables
D. Storing dimensions inside the fact table

Correct Answer: C

Explanation:
Star schemas minimize join complexity by using simple, direct relationships between facts and dimensions, improving query performance.

10. Which statement best describes how star schemas fit into the Fabric analytics lifecycle?

A. They replace semantic models entirely
B. They are used only for real-time analytics
C. They provide an analytics-ready structure for reporting and modeling
D. They are required only for data ingestion

Correct Answer: C

Explanation:
Star schemas organize data into an analytics-ready structure that supports semantic models, reporting, and scalable BI workloads.

Enrich data by adding new columns and tables

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Prepare data
--> Transform data
--> Enrich data by adding new columns and tables

Data enrichment is a critical step in preparing analytics-ready datasets in Microsoft Fabric. This section of the DP-600 exam focuses on how analytics engineers enhance existing data by adding derived columns, augmenting datasets with new tables, and combining multiple data sources to provide more business context and analytical value.

What Does Data Enrichment Mean in Fabric?

Data enrichment involves extending raw or curated datasets with additional attributes, calculations, or related entities so that the data is more useful for reporting, analytics, and downstream consumption.

In Microsoft Fabric, enrichment can occur across:

  • Lakehouses
  • Data Warehouses
  • Dataflows Gen2
  • Power BI semantic models
  • SQL analytics endpoints

Adding New Columns

Common Ways to Add Columns

  • Calculated columns using SQL expressions
  • Derived columns in Dataflows Gen2
  • Computed columns in warehouses or lakehouses
  • Calculated columns in semantic models (DAX)

Typical Use Cases

  • Creating business-friendly attributes (e.g., full name, year-month)
  • Applying business rules (e.g., customer segment, status flags)
  • Performing type conversions or formatting
  • Adding derived metrics (e.g., profit, margin, age)

Where This Is Done

  • Lakehouse / Warehouse (SQL): Persistent, reusable transformations
  • Dataflows Gen2: Low-code, ETL-style transformations
  • Semantic models (DAX): Report-specific or analytical calculations

Exam Tip

Know where to add a column based on reuse, performance, and governance. Foundational logic belongs earlier in the data pipeline.

Adding New Tables

How New Tables Are Created

  • Creating tables from queries or transformations
  • Materializing enriched datasets
  • Joining multiple source tables into curated outputs
  • Creating dimension or fact tables for analytics

Common Use Cases

  • Creating lookup or reference tables
  • Building star schema components
  • Storing aggregated or summarized data
  • Supporting reuse across multiple reports and models

Fabric Components Involved

  • Lakehouse tables (Delta format)
  • Warehouse tables
  • Dataflows Gen2 outputs
  • Shared semantic models

Enrichment Through Joins and Relationships

Enrichment Patterns

  • Joining transactional data with reference data
  • Adding descriptive attributes from lookup tables
  • Combining data from multiple domains (e.g., sales + geography)

Best Practices

  • Use appropriate join types (inner, left, etc.)
  • Ensure consistent data types and keys
  • Validate row counts and data integrity
  • Avoid unnecessary denormalization when not required

Exam Tip

Understand the impact of joins on data volume, performance, and data correctness.

Enrichment Using Dataflows Gen2

Dataflows Gen2 are a key enrichment tool in Fabric:

  • Low-code Power Query transformations
  • Combine, append, and merge datasets
  • Add derived and conditional columns
  • Output enriched tables to OneLake

Ideal for:

  • Source-level enrichment
  • Repeatable, governed transformations
  • Non-SQL-based data preparation

Enrichment in Semantic Models

Some enrichment happens at the modeling layer:

  • Calculated columns (DAX)
  • Calculated tables
  • Role-playing dimensions

However:

  • Semantic-layer enrichment is best for analysis, not heavy transformation
  • Overuse can impact model performance and complexity

Governance and Performance Considerations

  • Prefer enriching data upstream when logic is reusable
  • Document derived columns and tables
  • Apply consistent naming conventions
  • Avoid duplicating enrichment logic across layers
  • Balance flexibility with maintainability

What to Know for the DP-600 Exam

You should be comfortable with:

  • When to add columns vs. when to add tables
  • Choosing the right Fabric component for enrichment
  • SQL vs. Power Query vs. DAX enrichment
  • Performance and governance trade-offs
  • Supporting analytics-ready and reusable datasets

Practice Questions:

Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …

  • Identifying and understand why an option is correct (or incorrect) — not just which one
  • Look for and understand the usage scenario of keywords in exam questions to guide you
  • Expect scenario-based questions rather than direct definitions

1. What is the primary goal of data enrichment in Microsoft Fabric?

A. Reduce data storage costs
B. Improve infrastructure security
C. Increase analytical value by adding context
D. Replace raw data sources

Correct Answer: C

Explanation:
Data enrichment enhances datasets by adding derived attributes or related data so that the data becomes more meaningful and useful for analytics and reporting.

2. Where should reusable business logic for derived columns ideally be implemented?

A. Power BI report visuals
B. Semantic model calculated columns
C. Lakehouse or warehouse SQL transformations
D. Ad-hoc DAX measures

Correct Answer: C

Explanation:
Reusable business logic should be implemented upstream (lakehouse or warehouse) to promote consistency, reuse, and better governance across multiple reports and models.

3. Which Fabric feature is best suited for low-code enrichment such as merging datasets and adding conditional columns?

A. SQL analytics endpoint
B. Dataflows Gen2
C. Eventhouse
D. Real-Time hub

Correct Answer: B

Explanation:
Dataflows Gen2 use Power Query to perform low-code transformations, including merges, derived columns, and conditional logic, making them ideal for enrichment scenarios.

4. When enriching data by joining tables, which join type preserves all rows from the primary dataset?

A. Inner join
B. Right join
C. Left join
D. Cross join

Correct Answer: C

Explanation:
A left join preserves all rows from the left (primary) table while adding matching data from the right table where available.

5. Which scenario best justifies creating a new enriched table instead of adding columns to an existing one?

A. Adding a formatting column for display
B. Creating a reusable dimension table
C. Renaming an existing column
D. Filtering rows for a specific report

Correct Answer: B

Explanation:
Creating a new table is appropriate when building reusable dimension or reference tables that support multiple fact tables or analytics use cases.

6. Why should heavy transformation logic generally be avoided in Power BI semantic models?

A. Semantic models cannot handle transformations
B. It increases OneLake storage usage
C. It can negatively affect performance and maintainability
D. Semantic models do not support calculated columns

Correct Answer: C

Explanation:
While semantic models support calculated columns and tables, heavy transformation logic is better handled upstream to improve performance and simplify model maintenance.

7. Which of the following is an example of enriching data by adding a new column?

A. Changing workspace permissions
B. Adding a “Customer Segment” column based on business rules
C. Publishing a semantic model
D. Creating a new workspace

Correct Answer: B

Explanation:
Derived columns such as customer segmentation add business context to data, which is a classic enrichment scenario.

8. What is a key benefit of enriching data earlier in the data pipeline?

A. Faster report publishing
B. Reduced need for Power BI licenses
C. Improved consistency across analytics assets
D. Automatic index creation

Correct Answer: C

Explanation:
Enriching data upstream ensures that all downstream consumers use the same logic and definitions, improving consistency and governance.

9. Which Fabric storage format is typically used when creating enriched tables in a lakehouse?

A. CSV
B. Parquet
C. Delta
D. JSON

Correct Answer: C

Explanation:
Lakehouse tables in Microsoft Fabric are stored in Delta format, which supports ACID transactions and efficient analytics.

10. Which factor should most influence where enrichment logic is implemented?

A. User interface preferences
B. Reusability and performance considerations
C. The number of Power BI visuals
D. Workspace naming conventions

Correct Answer: B

Explanation:
Choosing where to enrich data depends on how reusable the logic is and how it affects performance, scalability, and governance.

Ingest or Access Data as Needed

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Prepare data
--> Get data
--> Ingest or access data as needed

A core responsibility of a Microsoft Fabric Analytics Engineer is deciding how data should be brought into Fabric—or whether it should be brought in at all. For the DP-600 exam, this topic focuses on selecting the right ingestion or access pattern based on performance, freshness, cost, and governance requirements.

Ingest vs. Access: Key Concept

Before choosing a tool or method, understand the distinction:

  • Ingest data: Physically copy data into Fabric-managed storage (OneLake)
  • Access data: Query or reference data where it already lives, without copying

The exam frequently tests your ability to choose the most appropriate option—not just a working one.

Common Data Ingestion Methods in Microsoft Fabric

1. Dataflows Gen2

Best for:

  • Low-code ingestion and transformation
  • Reusable ingestion logic
  • Business-friendly data preparation

Key characteristics:

  • Uses Power Query Online
  • Supports scheduled refresh
  • Stores results in OneLake (Lakehouse or Warehouse)
  • Ideal for centralized, governed ingestion

Exam tip:
Use Dataflows Gen2 when reuse, transformation, and governance are priorities.

2. Data Pipelines (Copy Activity)

Best for:

  • High-volume or frequent ingestion
  • Orchestration across multiple sources
  • ELT-style workflows

Key characteristics:

  • Supports many source and sink types
  • Enables scheduling, dependencies, and retries
  • Minimal transformation (primarily copy)

Exam tip:
Choose pipelines when performance and orchestration matter more than transformation.

3. Notebooks (Spark)

Best for:

  • Complex transformations
  • Data science or advanced engineering
  • Custom ingestion logic

Key characteristics:

  • Full control using Spark (PySpark, Scala, SQL)
  • Suitable for large-scale processing
  • Writes directly to OneLake

Exam tip:
Notebooks are powerful but require engineering skills—don’t choose them for simple ingestion scenarios.

Accessing Data Without Ingesting

1. OneLake Shortcuts

Best for:

  • Avoiding data duplication
  • Reusing data across workspaces
  • Accessing external storage

Key characteristics:

  • Logical reference only (no copy)
  • Supports ADLS Gen2 and Amazon S3
  • Appears native in Lakehouse tables or files

Exam tip:
Shortcuts are often the best answer when the question mentions avoiding duplication or reducing storage cost.

2. DirectQuery

Best for:

  • Near-real-time data access
  • Large datasets that cannot be imported
  • Centralized source-of-truth systems

Key characteristics:

  • Queries run against the source system
  • Performance depends on source
  • Limited modeling flexibility compared to Import

Exam tip:
Expect trade-off questions involving DirectQuery vs. Import.

3. Real-Time Access (Eventstreams / KQL)

Best for:

  • Streaming and telemetry data
  • Operational and real-time analytics

Key characteristics:

  • Event-driven ingestion
  • Supports near-real-time dashboards
  • Often discovered via Real-Time hub

Exam tip:
Use real-time ingestion when freshness is measured in seconds, not hours.

Choosing the Right Approach (Exam-Critical)

You should be able to decide based on these factors:

RequirementBest Option
Reusable ingestion logicDataflows Gen2
High-volume copyData pipelines
Complex transformationsNotebooks
Avoid duplicationOneLake shortcuts
Near real-time reportingDirectQuery / Eventstreams
Governance and trustIngestion + endorsement

Governance and Security Considerations

  • Ingested data can inherit sensitivity labels
  • Access-based methods rely on source permissions
  • Workspace roles determine who can ingest or access data
  • Endorsed datasets should be preferred for reuse

DP-600 often frames ingestion questions within a governance context.

Common Exam Scenarios

You may be asked to:

  • Choose between ingesting data or accessing it directly
  • Identify when shortcuts are preferable to ingestion
  • Select the right tool for a specific ingestion pattern
  • Balance data freshness vs. performance
  • Reduce duplication across workspaces

Best Practices to Remember

  • Ingest when performance and modeling flexibility are required
  • Access when freshness, cost, or duplication is a concern
  • Centralize ingestion logic for reuse
  • Prefer Fabric-native patterns over external tools
  • Let business requirements drive architectural decisions

Key Takeaway
For the DP-600 exam, “Ingest or access data as needed” is about making intentional, informed choices. Microsoft Fabric provides multiple ways to bring data into analytics solutions, and the correct approach depends on scale, freshness, reuse, governance, and cost. Understanding why one method is better than another is far more important than memorizing features.

Practice Questions:

Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …

  • Identifying and understand why an option is correct (or incorrect) — not just which one
  • Look for and understand the usage scenario of keywords in exam questions (for example, low code/no code, large dataset, high-volume data, reuse, complex transformations)
  • Expect scenario-based questions rather than direct definitions

Also, keep in mind that …

  • DP-600 questions often include multiple valid options, but only one that best aligns with the scenario’s constraints. Always identify and consider factors such as:
    • Data volume
    • Freshness requirements
    • Reuse and duplication concerns
    • Transformation complexity

1. What is the primary difference between ingesting data and accessing data in Microsoft Fabric?

A. Ingested data cannot be secured
B. Accessed data is always slower
C. Ingesting copies data into OneLake, while accessing queries data in place
D. Accessed data requires a gateway

Correct Answer: C

Explanation:
Ingestion physically copies data into Fabric-managed storage (OneLake), while access-based approaches query or reference data where it already exists.

2. Which option is BEST when the goal is to avoid duplicating large datasets across multiple workspaces?

A. Import mode
B. Dataflows Gen2
C. OneLake shortcuts
D. Notebooks

Correct Answer: C

Explanation:
OneLake shortcuts allow data to be referenced without copying it, making them ideal for reuse and cost control.

3. A team needs reusable, low-code ingestion logic with scheduled refresh. Which Fabric feature should they use?

A. Spark notebooks
B. Data pipelines
C. Dataflows Gen2
D. DirectQuery

Correct Answer: C

Explanation:
Dataflows Gen2 provide Power Query–based ingestion with refresh scheduling and reuse across Fabric items.

4. Which ingestion method is MOST appropriate for complex transformations requiring custom logic?

A. Dataflows Gen2
B. Copy activity in pipelines
C. OneLake shortcuts
D. Spark notebooks

Correct Answer: D

Explanation:
Spark notebooks offer full control over transformation logic and are suited for complex, large-scale processing.

5. When should DirectQuery be preferred over Import mode?

A. When the dataset is small
B. When data freshness is critical
C. When transformations are complex
D. When performance must be maximized

Correct Answer: B

Explanation:
DirectQuery is preferred when near-real-time access to data is required, even though performance depends on the source system.

6. Which Fabric component is BEST suited for orchestrating high-volume data ingestion with dependencies and retries?

A. Dataflows Gen2
B. Data pipelines
C. Semantic models
D. Power BI Desktop

Correct Answer: B

Explanation:
Data pipelines are designed for orchestration, handling large volumes of data, scheduling, and dependency management.

7. A dataset is queried infrequently but must support advanced modeling features. Which approach is most appropriate?

A. DirectQuery
B. Access via shortcut
C. Import into OneLake
D. Eventstream ingestion

Correct Answer: C

Explanation:
Import mode supports full modeling capabilities and high query performance, making it suitable even for infrequently accessed data.

8. Which scenario best fits the use of real-time ingestion methods such as Eventstreams or KQL databases?

A. Monthly financial reporting
B. Static reference data
C. IoT telemetry and operational monitoring
D. Slowly changing dimensions

Correct Answer: C

Explanation:
Real-time ingestion is designed for continuous, event-driven data such as IoT telemetry and operational metrics.

9. Why might ingesting data be preferred over accessing it directly?

A. It always reduces storage costs
B. It eliminates the need for security
C. It improves performance and modeling flexibility
D. It avoids data refresh

Correct Answer: C

Explanation:
Ingesting data into OneLake enables faster query performance and full support for modeling features.

10. Which factor is MOST important when deciding between ingesting data and accessing it?

A. The color of the dashboard
B. The number of reports
C. Business requirements such as freshness, scale, and governance
D. The Fabric region

Correct Answer: C

Explanation:
The decision to ingest or access data should be driven by business needs, including performance, freshness, cost, and governance—not technical convenience alone.

Create a Data Connection in Microsoft Fabric

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Prepare data
--> Get data
--> Create a data connection

Creating data connections is a foundational skill for a Microsoft Fabric Analytics Engineer. In the DP-600 exam, this topic focuses on how to securely and efficiently connect Fabric workloads—such as Lakehouses, Warehouses, Dataflows Gen2, and semantic models—to a wide variety of data sources.

What a Data Connection Means in Microsoft Fabric

A data connection defines how Fabric authenticates to, accesses, and retrieves data from a source system. It includes:

  • The data source type
  • Connection details (server, database, endpoint, file path, etc.)
  • Authentication method
  • Optional privacy and credential reuse settings

Once created, a data connection can often be reused across multiple items within a workspace.

Common Data Sources in Fabric

For the exam, you should be familiar with connecting to the following categories of data sources:

1. Azure and Microsoft Data Sources

  • Azure SQL Database
  • Azure Synapse (dedicated and serverless pools)
  • Azure Data Lake Storage Gen2
  • Azure Blob Storage
  • OneLake (Fabric-native storage)
  • Power BI semantic models (DirectQuery)

2. On-Premises Data Sources

  • SQL Server
  • Oracle
  • Other relational databases

These typically require an On-premises Data Gateway.

3. Files and Semi-Structured Data

  • CSV, JSON, Parquet, Excel
  • Files stored in OneLake, ADLS Gen2, SharePoint, or local file systems

Where Data Connections Are Created

In Microsoft Fabric, data connections can be created from several entry points:

  • Lakehouse: Add data via shortcuts or ingestion
  • Warehouse: Connect external data or ingest via pipelines
  • Dataflows Gen2: Define connections as part of Power Query Online
  • Pipelines: Configure source connections in copy activities
  • Semantic models: Connect via Import or DirectQuery

Understanding where the connection is configured is important for exam scenarios.

Authentication Methods

The DP-600 exam commonly tests authentication concepts. Be familiar with:

  • Microsoft Entra ID (OAuth) – Recommended and most secure
  • Service principal – Common for automation and CI/CD
  • Account key / Shared Access Signature (SAS) – Often used for storage
  • Username and password – Less secure, sometimes legacy

You should also understand when credentials are:

  • Stored at the connection level
  • Managed per workspace
  • Reused across multiple items

Gateways and Connectivity Modes

On-Premises Data Gateway

Required when connecting Fabric to on-premises sources. Key points:

  • Can be standard or personal (standard is preferred)
  • Must be online for refresh and query operations
  • Uses outbound connections only

Connectivity Modes

  • Import: Data is loaded into Fabric storage
  • DirectQuery: Queries run against the source system
  • Shortcut-based access: Data remains external but appears native in OneLake

Security and Governance Considerations

When creating data connections, Fabric enforces governance through:

  • Workspace roles (Viewer, Contributor, Member, Admin)
  • Credential isolation per workspace
  • Sensitivity labels inherited from data sources (when applicable)

Exam questions may test your ability to choose the most secure and scalable connection method.

Best Practices (Exam-Relevant)

  • Prefer Entra ID authentication over credentials or keys
  • Use OneLake shortcuts to avoid unnecessary data duplication
  • Centralize connections in Dataflows Gen2 for reuse
  • Validate gateway availability for on-premises sources
  • Align connection methods with performance needs (Import vs DirectQuery)

How This Appears on the DP-600 Exam

You may be asked to:

  • Identify the correct data connection method for a scenario
  • Choose the appropriate authentication type
  • Determine when a gateway is required
  • Decide where to create a connection for reuse and governance
  • Troubleshoot refresh or connectivity issues

Key Takeaway
Creating data connections in Microsoft Fabric is about more than just accessing data—it’s about security, performance, reusability, and governance. For the DP-600 exam, focus on understanding source types, authentication options, gateways, and where connections are defined within the Fabric ecosystem.

Practice Questions:

Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …

  • Identifying and understand why an option is correct (or incorrect) — not just which one
  • Look for and understand the usage scenario of keywords in exam questions (for example, gateway, authentication, reuse, DirectQuery vs Import)
  • Expect scenario-based questions rather than direct definitions

1. Which authentication method is generally recommended when creating data connections in Microsoft Fabric?

A. Username and password
B. Shared Access Signature (SAS)
C. Microsoft Entra ID (OAuth)
D. Account key

Correct Answer: C

Explanation:
Microsoft Entra ID (OAuth) is the recommended authentication method because it provides centralized identity management, better security, support for conditional access, and easier credential rotation compared to passwords or keys.

2. When is an On-premises Data Gateway required in Microsoft Fabric?

A. When connecting to Azure SQL Database
B. When connecting to OneLake
C. When connecting to an on-premises SQL Server
D. When connecting to Azure Data Lake Storage Gen2

Correct Answer: C

Explanation:
An On-premises Data Gateway is required when Fabric needs to access data sources that are hosted on-premises. Cloud-based sources such as Azure SQL Database or ADLS Gen2 do not require a gateway.

3. Which Fabric feature allows external data to appear as if it is stored in OneLake without copying the data?

A. Import mode
B. DirectQuery mode
C. OneLake shortcuts
D. Data pipelines

Correct Answer: C

Explanation:
OneLake shortcuts provide a logical reference to external storage locations (such as ADLS Gen2 or S3) without physically moving or duplicating the data.

4. You want multiple Fabric items in the same workspace to reuse a single data connection. Where should you create the connection?

A. In each semantic model
B. In Dataflows Gen2
C. In Power BI Desktop only
D. In Excel

Correct Answer: B

Explanation:
Dataflows Gen2 are designed for centralized data ingestion and transformation, making them ideal for creating reusable data connections across multiple Fabric items.

5. Which connectivity mode loads data into Fabric storage and provides the best query performance?

A. DirectQuery
B. Live connection
C. Shortcut-based access
D. Import

Correct Answer: D

Explanation:
Import mode copies data into Fabric-managed storage, enabling high-performance queries and full modeling capabilities at the cost of data freshness.

6. Which statement about DirectQuery connections in Fabric is true?

A. Data is stored in OneLake
B. Queries are always faster than Import mode
C. Queries are executed against the source system
D. A gateway is never required

Correct Answer: C

Explanation:
With DirectQuery, queries are sent directly to the source system at runtime. Performance depends on the source, and a gateway may be required for on-premises sources.

7. Which role is required to create or edit data connections within a Fabric workspace?

A. Viewer
B. Contributor
C. Member
D. Admin

Correct Answer: B

Explanation:
Users must have at least Contributor permissions to create or modify data connections. Viewers have read-only access and cannot manage connections.

8. Which file formats are commonly supported when creating file-based data connections in Fabric?

A. CSV only
B. CSV, JSON, Parquet, Excel
C. TXT only
D. XML only

Correct Answer: B

Explanation:
Microsoft Fabric supports a wide range of structured and semi-structured file formats, including CSV, JSON, Parquet, and Excel, especially when stored in OneLake or ADLS Gen2.

9. What is the primary security benefit of using a service principal for data connections?

A. Faster query performance
B. No need for a gateway
C. Automated, non-interactive authentication
D. Unlimited access to all workspaces

Correct Answer: C

Explanation:
Service principals enable secure, automated authentication scenarios (such as CI/CD pipelines) without relying on individual user credentials.

10. A data refresh in Fabric fails because credentials are missing. What is the most likely cause?

A. The dataset is in Import mode
B. The gateway is offline or misconfigured
C. The semantic model contains calculated columns
D. The file format is unsupported

Correct Answer: B

Explanation:
If a data source requires an On-premises Data Gateway and the gateway is offline or incorrectly configured, Fabric cannot access the credentials, causing refresh failures.

Implement Incremental Refresh for Semantic Models

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Implement and manage semantic models (25-30%)
--> Optimize enterprise-scale semantic models
--> Implement Incremental Refresh for Semantic Models

Overview

Incremental refresh is a key optimization technique for enterprise-scale semantic models in Microsoft Fabric and Power BI. Instead of fully refreshing all data during each refresh cycle, incremental refresh allows you to refresh only new or changed data, significantly improving refresh performance, reducing resource consumption, and enabling scalability for large datasets.

In the DP-600 exam, this topic appears under Optimize enterprise-scale semantic models and focuses on when, why, and how to configure incremental refresh correctly.


What Is Incremental Refresh?

Incremental refresh is a feature for Import mode and Hybrid (Import + DirectQuery) semantic models that:

  • Partitions data based on date/time columns
  • Refreshes only a recent portion of data
  • Retains historical data without reprocessing it
  • Optionally supports real-time data using DirectQuery

Incremental refresh is not applicable to:

  • Direct Lake–only semantic models
  • Pure DirectQuery models

Key Benefits

Incremental refresh provides several enterprise-level advantages:

  • Faster refresh times for large datasets
  • Reduced memory and CPU usage
  • Improved reliability of scheduled refreshes
  • Better scalability for growing fact tables
  • Enables near-real-time analytics when combined with DirectQuery

Core Configuration Components

1. Date/Time Column Requirement

Incremental refresh requires a column that:

  • Is of type Date, DateTime, or DateTimeZone
  • Represents a monotonically increasing timeline (for example, OrderDate or TransactionDate)

This column is used to define data partitions.


2. RangeStart and RangeEnd Parameters

Incremental refresh relies on two Power Query parameters:

  • RangeStart – Beginning of the refresh window
  • RangeEnd – End of the refresh window

These parameters:

  • Must be of type Date/Time
  • Are used in a filter step in Power Query
  • Are evaluated dynamically during refresh

Exam tip: These parameters are required, not optional.


3. Refresh and Storage Policies

When configuring incremental refresh, you define two key time windows:

PolicyPurpose
Store rows from the pastDefines how much historical data is retained
Refresh rows from the pastDefines how much recent data is refreshed

Example:

  • Store data for 5 years
  • Refresh data from the last 7 days

Only the refresh window is reprocessed during each refresh.


4. Optional: Detect Data Changes

Incremental refresh can optionally use a change detection column (for example, LastModifiedDate):

  • Only refreshes partitions where data has changed
  • Reduces unnecessary refresh operations
  • Column must be reliably updated when records change

This is especially useful for slowly changing dimensions.


Incremental Refresh with Real-Time Data (Hybrid Tables)

Incremental refresh can be combined with DirectQuery to support real-time data:

  • Historical data → Import mode
  • Recent data → DirectQuery

This configuration:

  • Uses the “Get the latest data in real time” option
  • Is commonly referred to as a Hybrid table
  • Balances performance with freshness

Deployment and Execution Behavior

  • Incremental refresh is defined in Power BI Desktop
  • Partitions are created only after publishing
  • Refresh execution happens in the Fabric service
  • Desktop refresh does not create partitions

Exam tip: Many questions test the difference between design-time configuration and service-side execution.


Limitations and Considerations

  • Requires Import or Hybrid mode
  • Date column must exist in the fact table
  • Cannot be configured directly in Fabric service
  • Schema changes may require full refresh
  • Partition count should be managed to avoid excessive overhead

Common DP-600 Exam Scenarios

You may be asked to:

  • Choose incremental refresh to solve long refresh times
  • Identify missing requirements (RangeStart/RangeEnd)
  • Decide between full refresh vs incremental refresh
  • Configure refresh windows for historical vs recent data
  • Combine incremental refresh with real-time analytics

When to Use Incremental Refresh (Exam Heuristic)

Choose incremental refresh when:

  • Fact tables are large and growing
  • Only recent data changes
  • Full refresh times are too long
  • Import mode is required for performance

Avoid it when:

  • Data volume is small
  • Real-time access is required for all data
  • Using Direct Lake–only models

Exam Tips

For DP-600, remember:

  • RangeStart / RangeEnd are mandatory
  • Incremental refresh = Import or Hybrid
  • Partitions are service-side
  • Refresh window ≠ storage window
  • Hybrid tables enable real-time + performance

Summary

Incremental refresh is a foundational optimization technique for large semantic models in Microsoft Fabric. For the DP-600 exam, focus on:

  • Required parameters (RangeStart, RangeEnd)
  • Refresh vs storage windows
  • Import and Hybrid model compatibility
  • Real-time and change detection scenarios
  • Service-side execution behavior

Practice Questions:

Here are 10 questions to test and help solidify your learning and knowledge. As you review these and other questions in your preparation, make sure to …

  • Identifying and understand why an option is correct (or incorrect) — not just which one
  • Look for and understand the usage scenario of keywords in exam questions to guide you
  • Expect scenario-based questions rather than direct definitions

Question 1

You have a large fact table with 5 years of historical data. Only the most recent data changes daily. Which feature should you implement to reduce refresh time?

A. DirectQuery mode
B. Incremental refresh
C. Calculated tables
D. Composite models

Correct Answer: B

Explanation:
Incremental refresh is designed to refresh only recent data while retaining historical partitions, significantly improving refresh performance for large datasets.


Question 2

Which two Power Query parameters are required to configure incremental refresh?

A. StartDate and EndDate
B. MinDate and MaxDate
C. RangeStart and RangeEnd
D. RefreshStart and RefreshEnd

Correct Answer: C

Explanation:
Incremental refresh requires RangeStart and RangeEnd parameters of type Date/Time to define partition boundaries.


Question 3

Where are incremental refresh partitions actually created?

A. Power BI Desktop during data load
B. Fabric Data Factory
C. Microsoft Fabric service after publishing
D. SQL endpoint

Correct Answer: C

Explanation:
Partitions are created and managed only in the Fabric service after the model is published. Desktop refresh does not create partitions.


Question 4

Which storage mode is required to use incremental refresh?

A. DirectQuery only
B. Direct Lake only
C. Import or Hybrid
D. Dual only

Correct Answer: C

Explanation:
Incremental refresh works with Import mode and Hybrid tables. It is not supported for DirectQuery-only or Direct Lake–only models.


Question 5

You configure incremental refresh to store 5 years of data and refresh the last 7 days. What happens during a scheduled refresh?

A. All data is fully refreshed
B. Only the last 7 days are refreshed
C. Only the last year is refreshed
D. Only new rows are loaded

Correct Answer: B

Explanation:
The refresh window defines how much data is reprocessed. Historical partitions outside that window are retained without refresh.


Question 6

Which column type is required for incremental refresh filtering?

A. Text
B. Integer
C. Boolean
D. Date/DateTime

Correct Answer: D

Explanation:
Incremental refresh requires a Date, DateTime, or DateTimeZone column to define time-based partitions.


Question 7

What is the purpose of the Detect data changes option?

A. To refresh all partitions automatically
B. To detect schema changes
C. To refresh only partitions where data has changed
D. To enable real-time DirectQuery

Correct Answer: C

Explanation:
Detect data changes uses a change-tracking column (e.g., LastModifiedDate) to avoid refreshing partitions when no data has changed.


Question 8

Which scenario best fits a Hybrid incremental refresh configuration?

A. All data must be queried in real time
B. Small dataset refreshed once per day
C. Historical data rarely changes, but recent data must be real time
D. Streaming data only

Correct Answer: C

Explanation:
Hybrid tables combine Import for historical data and DirectQuery for recent data, providing real-time access where needed.


Question 9

What happens if the date column used for incremental refresh contains null values?

A. Incremental refresh is automatically disabled
B. Only historical partitions fail
C. Refresh may fail or produce incorrect partitions
D. Null values are ignored safely

Correct Answer: C

Explanation:
The date column must be reliable. Null or invalid values can break partition logic and cause refresh failures.


Question 10

When should you avoid using incremental refresh?

A. When the dataset is large
B. When only recent data changes
C. When using Direct Lake–only semantic models
D. When refresh duration is long

Correct Answer: C

Explanation:
Incremental refresh is not supported for Direct Lake–only models, as Direct Lake handles freshness differently through OneLake access.


Create and configure deployment pipelines

This post is a part of the DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Prep Hub; and this topic falls under these sections: 
Maintain a data analytics solution
--> Maintain the analytics development lifecycle
--> Create and configure deployment pipelines

Development pipelines in Microsoft Fabric provide a structured, governed way to promote analytics content across environments—typically Development, Test, and Production. They are a core lifecycle management feature that helps teams deploy changes safely, consistently, and with minimal risk. For the DP-600 exam, you should understand what development pipelines are, how they are configured, what they support, and how they differ from Git-based version control.

What Are Development Pipelines?

A development pipeline is a Fabric feature that:

  • Connects multiple workspaces into an ordered promotion flow
  • Enables controlled deployment of items between environments
  • Supports validation and testing before production release

Pipelines are especially important for enterprise-scale analytics solutions.

Typical Pipeline Structure

A standard Fabric pipeline consists of three stages:

  1. Development
    • Active development
    • Frequent changes
    • Used by engineers and analysts
  2. Test
    • Validation and user acceptance testing
    • Data and logic verification
    • Limited access
  3. Production
    • Certified, trusted content
    • Broad consumer access
    • Minimal direct changes

Each stage is linked to a separate Fabric workspace.

Creating a Development Pipeline

At a high level, the process is:

  1. Create a deployment pipeline in Microsoft Fabric
  2. Assign a workspace to each stage:
    • Dev workspace
    • Test workspace
    • Prod workspace
  3. Configure pipeline settings
  4. Control who can deploy between stages

Once created, the pipeline provides a visual interface showing item differences across stages.

What Items Can Be Deployed Through Pipelines?

Development pipelines support deployment of many Fabric items, including:

  • Semantic models
  • Reports and dashboards
  • Dataflows Gen2
  • Lakehouses and Warehouses (supported scenarios)
  • Other supported analytics artifacts

Exam note:
Not every Fabric item supports pipeline deployment equally—expect questions to focus on Power BI and core analytics items.

How Deployment Works

Comparing Changes

  • Pipelines show differences between stages
  • You can review what will change before deploying

Deploying Content

  • Deploy from Dev → Test
  • Validate
  • Deploy from Test → Prod

Deployments:

  • Copy item definitions
  • Can update existing items or create new ones
  • Do not automatically move workspace permissions

Deployment Rules and Parameters

Pipelines support deployment rules, such as:

  • Changing data source connections per environment
  • Switching parameters between Dev, Test, and Prod
  • Avoiding hard-coded environment values

This is critical for:

  • Separating development and production data
  • Supporting safe testing

Pipelines vs Git Integration (Exam Comparison)

This distinction is frequently tested.

FeatureDevelopment PipelinesGit Integration
PurposeEnvironment promotionSource control
FocusDeploymentVersioning
Tracks historyNoYes
Supports branchingNoYes
Typical useDev → Test → ProdCode collaboration

Key insight:
They are complementary, not competing features.

Permissions and Governance

To use pipelines:

  • Users need appropriate pipeline permissions
  • Workspace access is still required
  • Production deployments are often restricted to a small group

Pipelines support governance by:

  • Reducing direct changes in production
  • Enforcing controlled release processes
  • Improving auditability

Common Exam Scenarios

You may be asked to:

  • Choose pipelines for controlled promotion of reports
  • Identify when pipelines are preferable to manual publishing
  • Combine pipelines with Git and PBIP
  • Configure different data sources per environment
  • Prevent accidental production changes

Example:

A report must be tested before being released to executives.
Correct concept: Use a development pipeline with Dev, Test, and Prod stages.

Best Practices to Remember

  • Use separate workspaces per environment
  • Restrict production deployment permissions
  • Combine pipelines with:
    • PBIP projects
    • Git integration
    • Endorsements and certification
  • Avoid direct editing in production

Key Exam Takeaways

  • Development pipelines manage content promotion across environments
  • They connect multiple Fabric workspaces
  • Pipelines support comparison, validation, and controlled deployment
  • They do not replace Git-based version control
  • A core feature of the Fabric analytics lifecycle

Exam Tips

  • If a question focuses on moving content safely from development to production, the correct answer is development pipelines.
  • If it focuses on tracking changes or collaboration, the answer is Git or PBIP.
  • Know how pipelines support:
    • Dev/Test/Prod lifecycle
    • Governance & change control
    • Environment-specific configuration
    • Enterprise-scale BI practices
  • Common exam traps:
    • Confusing workspace roles with deploy permissions
    • Assuming pipelines manage security or performance
    • Forgetting deployment rules

Practice Questions

Question 1 (Single choice)

What is the PRIMARY purpose of a deployment pipeline in Microsoft Fabric?

A. Schedule dataset refreshes
B. Promote content across lifecycle environments
C. Enable row-level security
D. Optimize DAX performance

Correct Answer: B

Explanation:
Deployment pipelines are designed to promote content across environments (for example, Development → Test → Production) in a controlled and governed manner.

  • ❌ A: Refresh scheduling is handled separately
  • ❌ C: Security is not the primary purpose
  • ❌ D: Performance tuning is unrelated

Question 2 (Multi-select)

Which stages are available by default in a Fabric deployment pipeline? (Select all that apply.)

A. Development
B. Test
C. Production
D. Sandbox

Correct Answers: A, B, C

Explanation:
Fabric deployment pipelines use a three-stage lifecycle:

  • Development
  • Test
  • Production

There is no default Sandbox stage.


Question 3 (Scenario-based)

A team wants analysts to freely modify reports, while only approved changes reach production. Which pipeline stage should analysts primarily work in?

A. Production
B. Test
C. Development
D. Any stage

Correct Answer: C

Explanation:
The Development stage is intended for:

  • Frequent changes
  • Experimentation
  • Initial validation

Higher stages are more controlled.


Question 4 (Single choice)

Which permission is required to deploy content from one stage to the next in a deployment pipeline?

A. Viewer
B. Contributor
C. Admin
D. Pipeline deploy permission

Correct Answer: D

Explanation:
Deploying content requires explicit pipeline deployment permissions, not just workspace roles.

  • ❌ Admin alone is not sufficient
  • ❌ Contributor may edit but not deploy

Question 5 (Scenario-based)

You deploy a semantic model from Test to Production. What happens to data source connections by default?

A. They are deleted
B. They remain unchanged
C. They can be overridden per stage
D. They must be manually reconfigured

Correct Answer: C

Explanation:
Deployment pipelines support parameter and data source rules, allowing environment-specific connections.


Question 6 (Multi-select)

Which items can be deployed using deployment pipelines? (Select all that apply.)

A. Reports
B. Semantic models
C. Dashboards
D. Notebooks

Correct Answers: A, B, C

Explanation:
Deployment pipelines support Power BI artifacts, including:

  • Reports
  • Semantic models
  • Dashboards

❌ Notebooks are Fabric artifacts but are not deployed via Power BI deployment pipelines.


Question 7 (Scenario-based)

A deployment shows warnings that some items are skipped. What is the MOST likely cause?

A. The workspace is full
B. Unsupported artifacts exist
C. The dataset is too large
D. Git integration is disabled

Correct Answer: B

Explanation:
Unsupported or incompatible artifacts (for example, unsupported report types) may be skipped during deployment.


Question 8 (Single choice)

Which feature allows different environments to use different data sources during deployment?

A. Row-level security
B. Dynamic format strings
C. Deployment rules
D. Incremental refresh

Correct Answer: C

Explanation:
Deployment rules allow:

  • Data source switching
  • Parameter overrides
  • Environment-specific configuration

Question 9 (Scenario-based)

You want production users to access only certified content. How do deployment pipelines help?

A. By enforcing sensitivity labels
B. By promoting tested content only
C. By encrypting production reports
D. By disabling edit access

Correct Answer: B

Explanation:
Deployment pipelines ensure:

  • Content is validated in Test
  • Only approved changes reach Production

They support trust and governance, not encryption or labeling.


Question 10 (Multi-select)

Which best practices apply when configuring deployment pipelines? (Select all that apply.)

A. Restrict deploy permissions
B. Use separate data sources per stage
C. Allow all users to deploy to Production
D. Validate content in Test before Production

Correct Answers: A, B, D

Explanation:
Best practices include:

  • Limited deploy access
  • Environment-specific configurations
  • Mandatory testing before production

❌ Allowing everyone to deploy defeats governance.


Merging Two Excel Files or Sheets Using Power Query (with the merge based on Multiple Columns)

Excel Power Query is a powerful, no-code/low-code tool that allows you to combine and transform data from multiple sources in a repeatable and refreshable way. One common use case is merging two Excel files or worksheets based on multiple matching columns, similar to a SQL join. Power Query is a major part of Power BI, but it can be used in Excel.

When to Use Power Query for Merging

Power Query is ideal when:

  • You receive recurring Excel files with the same structure
  • You need a reliable, refreshable merge process
  • You want to avoid complex formulas like VLOOKUP or XLOOKUP across many columns

Step-by-Step Overview

1. Load Both Data Sources into Power Query

  • Open Excel and go to Data → Get Data
  • Choose From Workbook (for separate files) or From Table/Range (for sheets in the same file)

Tip: Ensure the columns you plan to merge on have the same data types (e.g., text vs. number).


  • Load each dataset into Power Query as a separate query

2. Start the Merge Operation

  • In Power Query, select the primary table
  • Go to Query → Merge Queries
  • Choose the secondary table from the dropdown

3. Select Multiple Matching Columns

  • Click the first matching column in the primary table
  • Hold Ctrl (or Cmd on Mac) and select additional matching columns
  • Repeat the same column selections in the secondary table, in the same order

For example, if you needed to perform the merge on CustomerID, OrderDate, and Region, you would click Customer ID, then hold the Ctrl key and click OrderDate, then (while still holding down the Ctrl key) click Region.

Power Query treats this as a composite key, and all selected columns must match for rows from both tables to merge.


4. Choose the Join Type

Select the appropriate join kind:

  • Left Outer – Keep all rows from the first table (most common) and brings in the values for the matching rows from the second table
  • Inner – Keep only matching rows from both tables
  • Full Outer – Keep all rows from both tables, merging the table where there is a match and having just the values from the respective tables when there is no match

Click OK to complete the merge.


5. Expand the Merged Data

  • A new column appears containing nested tables
  • Click the expand icon to select which columns to bring in
  • Remove unnecessary columns to keep the dataset clean

6. Load and Refresh

  • Click Close & Load
  • The merged dataset is now available in Excel
  • When source files change, simply click Refresh to update everything automatically

Key Benefits

  • Handles multi-column joins cleanly and reliably
  • Eliminates fragile lookup formulas
  • Fully refreshable and auditable
  • Scales well as data volume grows

In Summary

Using Power Query to merge Excel data on multiple columns brings database-style joins into Excel, making your workflows more robust, maintainable, and professional. Once set up, it saves time and reduces errors—especially for recurring reporting and analytics tasks.

Thanks for reading!

How to update your Power BI source file location

The location of your source files has changed, and now you need to update your Power BI report to use the new location. To update the directory or location of your source file, in Power BI Desktop, click Transform Data -> Data Source Settings 

Then click on the entry that corresponds to the path you need to update.

Update or entirely change the path and click ok. Apply your changes.

It becomes a little more complicated when you are changing a local folder to a SharePoint location, which we will cover in another post, but for changing location of single files, it’s that simple. 

Thanks for reading!

Power BI load error: load was cancelled by error in loading a previous table

You may run into this error when loading Power BI:

"load was cancelled by error in loading a previous table"

If you do get this error, keep scrolling down to see what the “inducing” error is. This message is an indication that there was an error previous to getting to the current table in the process. The real, initial error will be more descriptive. Start with resolving that error(s), and then this one will go away.

I hope you found this helpful.

Power BI refresh error: Column ‘X’ in table ‘Y’ contains blank values and this is not allowed for columns on the one-side of a many-to-one relationship or for columns that are used as the primary key of a table

I was getting this error message when I attempted to refresh a Power BI application:

"Column 'Date' in table 'Date Dim' contains blank values and this is not allowed for columns on the one-side of a many-to-one relationship or for columns that are used as the primary key of a table"

However, despite what the message indicated, I double-checked and confirmed that I did not have any blank values in the ‘Date Dim’ table.

It turns out that you may also get this error (although incorrectly worded in my opinion) if the blanks are in the joining table. In my case, I had blanks in a ‘Snapshot Date’ column in the fact table that was joined to the ‘Date Dim’ table. Once these blanks were filled, the refresh ran without error.

One thing to look out for in these cases (since this is what happened in my case), if your source is Excel, undo all filters to make sure that you do not have any rows being filtered out when checking for blanks values across your columns, because this could potentially inadvertently hide the rows with the blank values and cause you to miss them.

I hope you found this helpful.