Category: Data Governance

Understanding Microsoft Fabric Shortcuts

Microsoft Fabric is a central platform for data and analytics, and one of its powerful features that supports it being an all-in-one platform is Shortcuts. Shortcuts provide a simple way to unify data across multiple locations without duplicating or moving it. This is a big deal because it saves a LOT of time and effort that is usually involved in moving data around.

What Are Shortcuts?

Shortcuts are references (or “pointers”) to data that resides in another storage location. Instead of copying the data into Fabric, a shortcut lets you access and query it as if it were stored locally.

This is especially valuable in today’s data landscape, where data often spans OneLake, Azure Data Lake Storage (ADLS), Amazon S3, or other environments.

Types of Shortcuts

There are 2 types of shortcuts: table shortcuts and file shortcuts

  1. Table Shortcuts
    • Point to existing tables in other Fabric workspaces or external sources.
    • Allow you to query and analyze the table without physically moving it.
  2. File Shortcuts
    • Point to files (e.g., Parquet, CSV, Delta Lake) stored in OneLake or other supported storage systems.
    • Useful for scenarios where files are your system of record, but you want to use them in Fabric experiences like Power BI, Data Engineering, or Data Science.

Benefits of Shortcuts

Shortcuts is a really useful feature, and here are some of its benefits:

  • No Data Duplication: Saves storage costs and avoids data sprawl.
  • Single Source of Truth: Data stays in its original location while being usable across Fabric.
  • Speed and Efficiency: Query and analyze external data in place, without lengthy ETL processes.
  • Flexibility: Works across different storage platforms and Fabric workspaces.

How and Where Shortcuts Can Be Created

  • In OneLake: You can create shortcuts directly in OneLake to link to data from ADLS Gen2, Amazon S3, or other OneLake workspaces.
  • In Fabric Experiences: Whether working in Data Engineering, Data Science, Real-Time Analytics, or Power BI, shortcuts can be created in lakehouses or KQL (Kusto Query Language) databases, and you can use them directly as data in OneLake. Any Fabric service will be able to use them without copying data from the data source.
  • In Workspaces: Shortcuts make it possible to connect across lakehouses stored in different workspaces, breaking down silos within an organization. The shortcuts can be generated from a lakehouse, warehouse, or KQL database.
  • Note that warehouses do not support the creation of shortcuts. However, you can query data stored within other warehouses and lakehouses.

How Shortcuts Can Be Used

  • Cross-Workspace Data Access: Analysts can query data in another team’s workspace without requesting a copy.
  • Data Virtualization: Data scientists can work with files stored in ADLS without having to move them into Fabric.
  • BI and Reporting: Power BI models can use shortcuts to reference external files or tables, enabling consistent reporting without duplication.
  • ETL Simplification: Instead of moving raw files into Fabric, engineers can create shortcuts and build transformations directly on the source.

Common Scenarios

  • A finance team wants to build Power BI reports on data stored by the operations team without moving the data.
  • A data scientist needs access to parquet files in Amazon S3 but prefers to analyze them within Fabric.
  • A company with multiple Fabric workspaces wants to centralize access to shared reference data (like customer or product master data) without replication.

In summary: Microsoft Fabric Shortcuts simplify data access across locations and workspaces. Whether table-based or file-based, they allow organizations to unify data without duplication, streamline analytics, and improve collaboration.

Here is a link to the Microsoft Learn OneLake documentation about Shortcuts. From there you will be able to explore all the Shortcut topics shown in the image below:

Thanks for reading! I hope you found this information useful.

Microsoft Fabric OneLake Catalog – description and links to resources

What is OneLake Catalog?

Microsoft Fabric OneLake Catalog is the next generation, enhanced version of the OneLake Data Hub. It provides a complete solution in a central location for team members (data engineers, data scientists, analysts, business team members, and other stakeholders) to browse, manage, and govern all their data from a single, intuitive location. It provides an intuitive and efficient user interface and truly simplifies and transforms the way we can manage, explore, and utilize content in Fabric. Usage is contextual and it has unified all Fabric item types (including Power BI items) and expanded support to all Fabric item types, integrating experiences, and providing detailed views of data subitems. It is a great tool.

Why use OneLake Catalog?

This tool will make your work within Fabric easier, and it will reduce duplication of items due to improved discoverability, and it will enhance our ability to govern data objects within the platform. So, check out the resources below to learn more.

Here is a link to a detailed Microsoft blog post introducing the OneLake Catalog:

And here is a link to a Microsoft Learn OneLake Catalog overview:

And finally, this is a link to a great, short (less than 5 min) video that gives an overview of the OneLake Catalog:

Thanks for reading! Good luck on your data journey!

Data Cleaning methods

Data cleaning is an essential step in the data preprocessing pipeline when preparing data for analytics or data science. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the dataset to improve its quality and reliability. It is essential that data is cleaned before being used in analyses, reporting, development or integration. Here are some common data cleaning methods:

Handling missing values:

  • Delete rows or columns with a high percentage of missing values if they don’t contribute significantly to the analysis.
  • Impute missing values by replacing them with a statistical measure such as mean, median, mode, or using more advanced techniques like regression imputation or k-nearest neighbors imputation.

Handling categorical variables:

  • Encode categorical variables into numerical representations using techniques like one-hot encoding, label encoding, or target encoding.

Removing duplicates:

  • Identify and remove duplicate records based on one or more key variables.
  • Be cautious when removing duplicates, as sometimes duplicated entries may be valid and intentional.

Handling outliers:

  • Identify outliers using statistical methods like z-scores, box plots, or domain knowledge.
  • Decide whether to remove outliers or transform them based on the nature of the data and the analysis goals.

Correcting inconsistent data:

  • Standardize data formats: Convert data into a consistent format (e.g., converting dates to a specific format).
  • Resolve inconsistencies: Identify and correct inconsistent values (e.g., correcting misspelled words, merging similar categories).

Dealing with irrelevant or redundant features:

  • Remove irrelevant features that do not contribute to the analysis or prediction task.
  • Identify and handle redundant features that provide similar information to avoid multicollinearity issues.

Data normalization or scaling:

  • Normalize numerical features to a common scale (e.g., min-max scaling or z-score normalization) to prevent certain features from dominating the analysis due to their larger magnitudes.

Data integrity issues:

Finally, you need to address data integrity issues.

  • Check for data integrity problems such as inconsistent data types, incorrect data ranges, or violations of business rules.
  • Resolve integrity issues by correcting or removing problematic data.

It’s important to note that the specific data cleaning methods that need to be applied to a dataset will vary depending on the nature of the dataset, the analysis goals, and domain knowledge. It’s recommended to thoroughly understand the data and consult with domain experts when preparing to perform data cleaning tasks.

Oracle Error when table or other object has the same name as the schema name

We recently had a situation where a procedure was running fine in 2 environments but was failing in another. During debugging, it was determined that if the schema prefix was removed from the procedure call, it would run fine, otherwise it fails.

The following error was produced:

ERROR at line 1:

ORA-06550: line 1, column 14:

PLS-00302: component ‘MyProcedure’ must be declared

ORA-06550: line 1, column 7:

PL/SQL: Statement ignored

After some research, the DBA found a web post that indicated that this error is generated if you have an object with the same name as the schema.

You can check if you have any such objects by running this SQL command:

SQL> select * from dba_objects where object_name = ‘Your_Schema_Name’;

(of course, where “Your_Schema_Name” is the actual name of your schema)

If you do, then you should rename the object or remove it if it is no longer needed. Of course, if it is a valid object that is being used, you will need to rename it in all the places in which it is being used.

Thanks for reading! Good luck on your data journey!

Power BI Workspace roles

Power BI has 4 roles. Those roles, in order of increasing access/capabilities, are Viewer, Contributor, Member, and Admin. Before granting roles to users in your environment, it’s best to have a solid understanding of what each role has access to and is capable of doing.

The table below provides a list of capabilities of each role. As you will see, each roles “absorbs” or “inherits” the capabilities of all the roles below it in the hierarchy – for example, the Contributor can do everything the Viewer can do plus more, and the Member can do everything the Contributor can do plus more.


The Power BI Workspace roles

ViewerContributorMemberAdmin
View dashboards, reports, and workbooks in the workspaceEverything that the Viewer can doEverything that the Contributor can doEverything that the Member can do
Read data from dataflows in the workspaceAdd, edit, delete content workspacesAdd other users as members, contributors, or viewers to the workspaceUpdate and delete the workspace
Row-level security applies to viewersSchedule refreshes and use the on-premises gateway within workspaces Publish and update the workspace appAdd and remove other users of any role from the workspace
Feature dashboards and reports from workspacesShare and allow others to reshape items from the workspace
Have access to the lineage viewFeature the workspace app
Have full access to all datasets within a workspace

A few things to keep in mind regarding roles:

  • Only the Member and Admin roles can perform access related tasks and publish apps.
  • Both Member and Admin roles can update workspaces, but only the Admin role can delete.
  • By default, the Contributor role cannot update apps, but there is a workspace setting that allows Contributors to update apps.
  • Both the Member and Admin roles can add users, but only the Admin role can add other Admins. 
  • A Power BI Pro license is needed to be able to fully utilize the Admin role. 

This article was intended to be an easy read; more detailed information regarding Power BI roles can be found here on the Microsoft site.

Thanks for reading!

Quality Assurance (QA) for Data Projects or Data Applications

This post discusses Quality Assurance (QA) activities for data projects.

What is Quality Assurance (QA)?  Simply put, Quality Assurance, also called QA, Testing or Validation, is about testing an application or solution to ensure that all the stated/promised/expected requirements are met. It is a critically important activity for all software application development or implementations. Data applications are no different. They need to be tested to ensure they work as intended.

QA stands between development and deployment. And QA makes the difference between a delivered product and a high quality delivered product.

There are a number of things to keep in mind when you plan your Quality Assurance activities for data solutions. I present some of them in this post as suggestions, considerations, or prompting questions. The things mentioned here will not apply to all data applications but can be used as a guide or a check.

People / Teams

The number of people and teams involved in a project will vary depending on the size, scope and complexity of the project.

The technical team building the application needs to perform an initial level of validation of the solution.

If there is a Quality Assurance team that performs the validation tasks, then that team will need to perform the “official” validation.

The business analysts and end-users of the application also need to validate. Where possible, work with as many end users as efficiently possible. The more real users you have testing the application, the better the chances of finding issues early.

Where it makes sense, Test IDs that simulate various types of users or groups should be used to help test various usage and security scenarios. This is particularly useful in automated testing.

On large projects where there is a lot to be tested, it is best to break up the testing across multiple people or teams. This will help to prevent testing fatigue and sloppy testing and result in higher quality testing.

Plan ahead to ensure that access for all the relevant users is set up in the testing environments.

Communication

With all the teams and people involved, it is important to have a plan for how they will communicate. Things to consider and have a plan for include:

  • How will teams communicate within? Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will the various teams involved communicate with each other? In other words, how will cross-team communication be handled? As above, Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will issues and status be communicated? Weekly meetings, Status emails or documents, Shared files available on shared spaces are options.
  • How will changes and resolutions be tracked? Files, SDLC applications, Change Management applications are options.
  • How will teams and individuals be notified when they need to perform a task? Manual communication or automated notifications from tools are options.

Data

The most important thing to ensure in data projects is that the data is high quality, particularly the “base” data set. If the base data is incorrect, everything built on top of it will be bad. Of course, the correctness of intermediate and user-facing data is also just as important, but the validation of the base data is critical to achieving the correct data all over.

  • Ensure that table counts, field counts and row counts of key data are correct.
  • Does the data warehouse data match the source data?
  • Test detailed, low level records with small samples of data
  • Test to ensure that the data and the values conform to what is expected. For example, ensuring that there is no data older than 3 years old, or ensuring that there are no account values outside a certain range. The Data Governance Team may become involved in these activities across all projects.

Next in line is the “intermediate” data such as derived metrics, aggregates, specialized subsets, and more. These will also need to be verified.

  • Are the calculated values correct?
  • Are the aggregates correct? Test aggregate data with small, medium and large sets of data
  • Verify metric calculations

Then the user-facing data or data prepared for self-service usage needs to be validated.

  • Does the data on the dashboard match the data in the database?
  • Are the KPIs correctly reflecting the status?

Test the full flow of the data. The validity of the data should be verified at each stage of the data flow – from the source, to the staging, to the final tables in the data warehouse, to aggregates or subsets, to the dashboard.

Take snapshots of key datasets or reports so you can compare results post data migration.

Some additional data prep might be needed in some cases.

  • These include making sure that you have sourced adequate data for testing. For example, if you need to test an annual trend, then it might be best to have at least a year’s worth of data, preferably two.
  • You may need to scramble or redact some data for testing. Often Test data is taken from the Production environment and then scrambled and/or redacted in order to not expose sensitive information.
  • You may need to temporarily load in data for testing. For various reasons, you may need to load some Production data into the QA environment just to test the solution or a particular feature and then remove the data after the testing is complete. While this can be time consuming, sometimes it’s necessary, and it’s good to be aware of the need early and make plans accordingly.

Aesthetics & Representation of Data

Presentation matters. Although the most critical thing is data correctness, how the data is presented is also very important. Good presentation helps with understanding, usability, and adoption. A few things to consider include:

  • Does the application, such as dashboard, look good?  Does it look right? 
  • Are the components laid out properly so that there is no overcrowding?
  • Are the logos, colors and fonts in line with company expectations?
  • Are proper chart options used to display the various types of data and metrics?
  • Is the information provided in a way that users can digest?

Usage

The data application or solution should be user friendly, preferably intuitive or at least have good documentation. The data must be useful to the intended audience, in that, it should help them to understand the information and make good decisions or take sensible actions based on it.

The application should present data in a manner that is effective – easy to access, and easy to understand.

The presentation should satisfy the analytic workflows of the various users. Users should be able to logically step through the application to find information at the appropriate level of detail that they need based on their role.

A few things that affect usability include:

  • Prompts – ensure that all the proper prompts or selections are available to users to slice and filter the data as necessary. And of course, verify that they work.
  • Drill downs and drill throughs – validate that users can drill-down and across data to find the information they need in a simple, logical manner.
  • Easy interrogation of the data – if the application is ad-hoc in nature, validate that users can navigate it or at least verify that the documentation is comprehensive enough for users to follow.

Security

Securing the application and its data so that only authorized users have access to it is critical.

Application security comprises of “authentication”– access to the application, and “authorization” – what a user is authorized to do when he or she accesses the application.

Authorization (what a user is authorized to do within the application) can be broken into “object security” – what objects or features a user has access to, and “data security” – what data elements a user has access to within the various objects or features.

For example, a user has access to an application (authenticated / can log in), and within the application the user has access to (authorized to see and use) 3 of 10 reports (object-level security). The user is not authorized to see the other 7 reports (object-level security) and, therefore, will not have access to them. Now, within the 3 reports that the user has access to, he or she can only see data related to 1 of 5 departments (data-level security).

All object-level and data-level security needs to be validated. This includes negative testing. Not only test to make sure that users have the access they need, but testing should also ensure that users do not have access that they should not have.

  • Data for testing should be scrambled or redacted as appropriate to protect it.
  • Some extremely sensitive data may need to be filtered out entirely.
  • Can all the appropriate users access the application?
  • Are non-authorized users blocked from accessing the application?
  • Can user see the data they should be able to see to perform their jobs?

Performance

Performance of the data solution is important to user efficiency and user adoption. If users cannot get the results they need in a timely manner, they will look elsewhere to get what they need. Even if they have no choice, a poorly performing application will result in wasted time and dollars.

A few things to consider for ensuring quality around performance:

  • Application usage – is the performance acceptable? Do the results get returned in an acceptable time?
  • Data Integration – is the load performance acceptable?
  • Data processing – can the application perform all the processing it needs to do in a reasonable amount of time?
  • Stress Testing – how is performance with many users? How is it with a lot data?
  • How is performance with various selections or with no selections at all?
  • Is ad-hoc usage setup to be flexible but avoid rogue analyses that may cripple the system?
  • Is real-time analysis needed and is the application quick enough?

These items need to be validated and any issues need to be reported to the appropriate teams for performance tuning before the application is released for general usage.

Methodology

Each organization, and even each team within an organization, will have a preferred methodology for application development and change management, including how they perform QA activities.

Some things to consider include:

  • Get QA resources involved in projects early so that they gain an early understanding of the requirements and the solutions to assess and plan how best to test.
  • When appropriate, do not wait until all testing is complete before notifying development teams of issue discovered. By notifying them early, this could make the difference between your project being on-time or late.
  • Create a test plan and test scripts – even if they are high-level.
  • Where possible, execute tasks in an agile, iterative manner.
  • Each environment will have unique rules and guidelines that need to be validated. For example, your application may have a special naming convention, color & font guidelines, special metadata items, and more. You need to validate that these rules and guidelines are followed.
  • Use a checklist to ensure that you validate with consistency from deliverable to deliverable
  • When the solution being developed is replacing an existing system or dataset, use the new and old solutions in parallel to validate the new against the old.
  • Document test results. All testing participants should document what has been tested and the results. This may be as simple as a checkmark or a “Done” status, but may also include things like data entered, screenshots, results, errors, and more.
  • Update the appropriate tracking tools (such as your SDLC or Change Management tools) to document changes and validation. These tools will vary from company to company, but it is best to have a trail of the development, testing, and release to production.
  • For each company and application, there will a specific, unique set of things that will need to be done. It is best if you have a standard test plan or test checklist to help you confirm that you have tested all important aspects and scenarios of the application.

This is not an all-encompassing coverage of Quality Assurance for data solutions, but I hope the article gives you enough information to get started or tips for improving what you currently have in place. You can share your questions, thoughts and input via comments to this post. Thanks for reading!

How to generate detailed Oracle BI (OBIEE) Repository Documentation

In this post, I will show the steps for using the OBIEE “Repository Documentation” utility to generate repository (RPD) lineage information.  I will also provide a couple example of how this documentation (output file) can be used.

To access and run the Repository Documentation utility,  from the BI Admin Tool menu, select Tools -> Utilities.

biadmintool_menu_tools_utilities

From the Utilities dialog, select “Repository Documentation”, and click “Execute…”

utilitiesdialog

In the “Save As” dialog, select the destination and enter the name you would like for the output file.

saverepositorydocumentationdialog

When it finishes, it will generate the output csv file.  Note  – this will likely be a large file.  It will contain all your repository objects.

obieerepositoryoutputfile

The RPD documentation file will contain the following columns:
Subject Area, Presentation Table, Presentation Column, Description – Presentation Column, Business Model, Derived logical table, Derived logical column, Description – Derived Logical Column, Expression, Logical Table, Logical Column, Description – Logical Column, Logical Table Source, Expression, Initialization Block, Variable, Database, Physical Catalog, Physical Schema, Physical Table, Alias, Physical Column, Description – Physical Column

You can use this file to quickly track lineage from physical sources to the logical columns to the presentation columns and identify all the logic and variables in between.
You can also use it to identify where and how much a specified table, column, variable, etc. is used which will help you to identify dependencies and know the effect of making changes or deleting elements.

Development, Data Governance, and Quality Assurance teams may find this information useful in this format.

Oracle EBS Payables Data Flow and Data Model

This post gives a quick overview of Oracle Payables data flow and data model for some of the most used tables in the module. This is not a complete coverage of the topic but aims to give a general idea of how records are stored in and moves through the module.

OraclePayablesDataflowDataModel

In Oracle Payables, before an AP transaction can be generated, a supplier must exist in the system.  When suppliers are created, records are created in AP_SUPPLIERS and AP_SUPPLIER_SITES.  Invoices are saved in AP_INVOICES (invoice header records) and AP_INVOICE_LINES (invoice lines records).

Payments generate records in AP_INVOICE_PAYMENTS and AP_PAYMENTS_SCHEDULE.  These tables will have 1 record for an invoice if the invoice is paid using a single payment, but will have multiple records for an invoice if the invoice is paid in installments or with more than one payment.

When an invoice is approved and when it is paid, accounting transactions are generated.  These transactions make their way to the GL_INTERFACE table via Payables Transfer to GL process.  In addition to interfacing with GL, Payables also interfaces with many other Oracle modules, such as, Purchasing, Assets, Projects, Inventory, and others (but will not be covered in this post).

Invoices can be entered into Oracle Payables from external systems via the AP_INVOICE_INTERFACE and AP_INVOICES_LINES_INTERFACE tables.  When the Payables Open Interface Import program is run, the records are brought into the regular header and lines tables (AP_INVOICE and AP_INVOICE_LINES).

When payments are made, the information about the payment is recorded in AP_CHECKS.

These are some of the sources for the star-schemas (dimension and fact tables) used in Oracle Business Intelligence Applications (OBIA) Financials module, and in particular, the Payables dashboard and analyses.

Oracle E-Business Suite (EBS) List of Flexfields

This post is simply a table listing of the Key Flexfields (KFFs) in Oracle E-Business Suite (EBS) (aka Oracle Applications). The table contains the flexfields, along with their Code, Owning Application and base table. 

Flexfields are commonly used in OBIEE reporting for EBS source systems. Flexfields are also an integral part of the Oracle Business Analytics Warehouse (OBAW) data model. It is worthwhile to be familiar with them.

  Name Code Owning Application Table Name
1 Account Aliases MDSP Oracle Inventory MTL_GENERIC_DISPOSITIONS
2 Accounting Flexfield GL# Oracle General Ledger GL_CODE_COMBINATIONS
3 Activity Flexfield FEAC Enterprise Performance Foundation  
4 AHL Route AHLR Complex Maintenance, Repair, and Overhaul  
5 Asset Key Flexfield KEY# Oracle Assets FA_ASSET_KEYWORDS
6 Bank Details KeyFlexField BANK Oracle Payroll PAY_EXTERNAL_ACCOUNTS
7 CAGR Flexfield CAGR Oracle Human Resources  
8 Category Flexfield CAT# Oracle Assets FA_CATEGORIES
9 Competence Flexfield CMP Oracle Human Resources  
10 Cost Allocation Flexfield COST Oracle Payroll PAY_COST_ALLOCATION_KEYFLEX
11 Grade Flexfield GRD Oracle Human Resources PER_GRADE_DEFINITIONS
12 Item Catalogs MICG Oracle Inventory MTL_ITEM_CATALOG_GROUPS
13 Item Categories MCAT Oracle Inventory MTL_CATEGORIES
14 Job Flexfield JOB Oracle Human Resources PER_JOB_DEFINITIONS
15 Location Flexfield LOC# Oracle Assets FA_LOCATIONS
16 Oracle Service Item Flexfield SERV Oracle Inventory MTL_SYSTEM_ITEMS
17 People Group Flexfield GRP Oracle Payroll PAY_PEOPLE_GROUPS
18 Personal Analysis Flexfield PEA Oracle Human Resources PER_ANALYSIS_CRITERIA
19 Position Flexfield POS Oracle Human Resources PER_POSITION_DEFINITIONS
20 Public Sector Budgeting BPS Oracle Public Sector Budgeting  
21 Sales Orders MKTS Oracle Inventory SALES_ORDER_ID
22 Sales Tax Location Flexfield RLOC Oracle Receivables AR_LOCATION_COMBINATIONS
23 Soft Coded KeyFlexfield SCL Oracle Human Resources HR_SOFT_CODING_KEYFLEX
24 Stock Locators MTLL Oracle Inventory MTL_ITEM_LOCATIONS
25 System Items MSTK Oracle Inventory MTL_SYSTEM_ITEMS
26 Territory Flexfield CT# Oracle Receivables RA_TERRITORIES
27 Training Resources RES Oracle Learning Management