Category: Data Integration

Analytics, Business Intelligence, Business Intelligence (BI) Development, Data Analysis, Data Cleaning, Data Development, Data Governance, Data Integration, Data Munging, Data Quality Assurance, Data Science, Data Warehousing, Data Wrangling, Machine Learning (ML), Reporting August 23, 2024July 29, 2024

Data Cleaning methods

Data cleaning is an essential step in the data preprocessing pipeline when preparing data for analytics or data science. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the dataset to improve its quality and reliability. It is essential that data is cleaned before being used in analyses, reporting, development or integration. Here are some common data cleaning methods:

Handling missing values:

Delete rows or columns with a high percentage of missing values if they don’t contribute significantly to the analysis.
Impute missing values by replacing them with a statistical measure such as mean, median, mode, or using more advanced techniques like regression imputation or k-nearest neighbors imputation.

Handling categorical variables:

Encode categorical variables into numerical representations using techniques like one-hot encoding, label encoding, or target encoding.

Removing duplicates:

Identify and remove duplicate records based on one or more key variables.
Be cautious when removing duplicates, as sometimes duplicated entries may be valid and intentional.

Handling outliers:

Identify outliers using statistical methods like z-scores, box plots, or domain knowledge.
Decide whether to remove outliers or transform them based on the nature of the data and the analysis goals.

Correcting inconsistent data:

Standardize data formats: Convert data into a consistent format (e.g., converting dates to a specific format).
Resolve inconsistencies: Identify and correct inconsistent values (e.g., correcting misspelled words, merging similar categories).

Dealing with irrelevant or redundant features:

Remove irrelevant features that do not contribute to the analysis or prediction task.
Identify and handle redundant features that provide similar information to avoid multicollinearity issues.

Data normalization or scaling:

Normalize numerical features to a common scale (e.g., min-max scaling or z-score normalization) to prevent certain features from dominating the analysis due to their larger magnitudes.

Data integrity issues:

Finally, you need to address data integrity issues.

Check for data integrity problems such as inconsistent data types, incorrect data ranges, or violations of business rules.
Resolve integrity issues by correcting or removing problematic data.

It’s important to note that the specific data cleaning methods that need to be applied to a dataset will vary depending on the nature of the dataset, the analysis goals, and domain knowledge. It’s recommended to thoroughly understand the data and consult with domain experts when preparing to perform data cleaning tasks.

Analytics, Business Intelligence, Data Analysis, Data Integration, Data Science July 19, 2020September 16, 2022

What is data analytics? And what are the different types of data analytics?

Data analytics is the overall process of capturing and using data to produce meaningful information, including metrics and trends, that can be used to better understand events and help make better decisions. Usually the goal is to improve the efficiency and outcomes of an operation, such as a business, a political campaign, or even an individual (such as an athlete). There are four (4) prevalent types of data analytics – descriptive, predictive, diagnostic, and prescriptive.

Descriptive analytics – provides information about “what has happened”. Examples of questions answered by descriptive analytics include: How much are our sales this month and what is over year-over-year sales increase? How many website visitors did we have and how many signups?
Predictive analytics – provides insight into “what may happen” in the future based on the past. Examples of questions answered by predictive analytics include: Based on previous customer service call patterns and outcomes, what is the likelihood of a customer switching to another provider? Based on a customer’s profile, how much should we charge him for insurance?
Diagnostic analytics – provides information to explain “why something happened”. In addition to the direct data, this may also involve more indirect or macro data sources, such as, weather data, local or national economic data, or competitor data. And it may also involve forming logical theories about the correlation of events. Examples of questions answered by diagnostic analytics include: How effective was the marketing blitz and which channel had the most impact? Did the weather affect sales or was it the price increase?
Prescriptive analytics – provides insight into “what to do to make something happen”. Examples of questions answered by prescriptive analytics include: Based on the results of our test marketing blitz campaign, if we roll out the full campaign with adjustments to the channel spread, how many additional temporary customer service staff will we need to handle the increased volume without long wait times?

Descriptive analytics is the simplest and most common form of analytics used in organizations and is widely referred to as Business Intelligence (BI). There is widespread interest in predictive analytics but less than 50% of companies currently use it as it requires additional, more expensive skills. Diagnostic and prescriptive analytics have always been around because companies have always used information from descriptive analytics to hypothesize “why things happened” and make decisions on “what to do”. But it’s the automation of these types through new methods and the integration of more data inputs that is fairly new. The latter three forms are sometimes called Advanced Analytics or Data Science.

All the types of analytics will require some form of data integration and use some of the same data in an environment, but while descriptive analytics only needs data from the time periods being analyzed and usually from a narrower data set, the predictive, prescriptive and diagnostic analytics produce better results using as much data as is available from a wider timeframe and from a broader set of sources. There is overlap with the different types of analytics because the analysis of “what may happen” is driven by “what has happened” in the past and “why it happened”; and determining “what to do” will be driven by “what has happened”, “why it happened”, and “what may happen”. Companies on the forefront of data analytics will tend to use all four types.

Business Intelligence Platform, Data Integration, Database Administration, Performance Tuning November 13, 2019

BI Application getting ORA-02391 error

Last week we rolled out a new dashboard that uses a new data source.
In one of our BI environments, the application was throwing an error:
“ORA-02391: exceeded simultaneous SESSIONS_PER_USER limit at OCI call OCISessionBegin“

This is an Oracle Database error, and not an error directly from the BI Application.

For the “ORA-02391: exceeded simultaneous SESSIONS_PER_USER limit” error …
The Cause is: An attempt was made to exceed the maximum number of concurrent sessions allowed by the SESSIONS_PER_USER clause of the user profile.
And the Action for resolution is: End one or more concurrent sessions or ask the database administrator to increase the SESSIONS_PER_USER limit of the user profile.

Turns out the SESSIONS_PER_USER parameter was set too low; it was set to 3 for the user being used to access the database from the BI application. This error could have also been observed from an ETL tool accessing the database with an ID with the same parameter setting.

One of the DBAs bumped this parameter up to 30 for the user, and that resolved the issue.
We requested for this change to be done on the BI application databases in all the environments – Development, Test, QA, and Production.

Although all seems to be well, we will now monitor to see how many sessions the application is using and if there is any negative impact on the source application. This will allow us to determine if we need to make any other adjustments.

Thanks for reading. I hope you found this information useful.

Business Intelligence, Data Analysis, Data Development, Data Governance, Data Integration, Data Quality Assurance, Data Security, Data Strategy, Data Warehousing, Performance Tuning November 6, 2019January 1, 2026

Quality Assurance (QA) for Data Projects or Data Applications

This post discusses Quality Assurance (QA) activities for data projects.

What is Quality Assurance (QA)? Simply put, Quality Assurance, also called QA, Testing or Validation, is about testing an application or solution to ensure that all the stated/promised/expected requirements are met. It is a critically important activity for all software application development or implementations. Data applications are no different. They need to be tested to ensure they work as intended.

QA stands between development and deployment. And QA makes the difference between a delivered product and a high quality delivered product.

There are a number of things to keep in mind when you plan your Quality Assurance activities for data solutions. I present some of them in this post as suggestions, considerations, or prompting questions. The things mentioned here will not apply to all data applications but can be used as a guide or a check.

People / Teams

The number of people and teams involved in a project will vary depending on the size, scope and complexity of the project.

The technical team building the application needs to perform an initial level of validation of the solution.

If there is a Quality Assurance team that performs the validation tasks, then that team will need to perform the “official” validation.

The business analysts and end-users of the application also need to validate. Where possible, work with as many end users as efficiently possible. The more real users you have testing the application, the better the chances of finding issues early.

Where it makes sense, Test IDs that simulate various types of users or groups should be used to help test various usage and security scenarios. This is particularly useful in automated testing.

On large projects where there is a lot to be tested, it is best to break up the testing across multiple people or teams. This will help to prevent testing fatigue and sloppy testing and result in higher quality testing.

Plan ahead to ensure that access for all the relevant users is set up in the testing environments.

Communication

With all the teams and people involved, it is important to have a plan for how they will communicate. Things to consider and have a plan for include:

How will teams communicate within? Email, Microsoft Teams, SharePoint, Shared Files, are some options.
How will the various teams involved communicate with each other? In other words, how will cross-team communication be handled? As above, Email, Microsoft Teams, SharePoint, Shared Files, are some options.
How will issues and status be communicated? Weekly meetings, Status emails or documents, Shared files available on shared spaces are options.
How will changes and resolutions be tracked? Files, SDLC applications, Change Management applications are options.
How will teams and individuals be notified when they need to perform a task? Manual communication or automated notifications from tools are options.

Data

The most important thing to ensure in data projects is that the data is high quality, particularly the “base” data set. If the base data is incorrect, everything built on top of it will be bad. Of course, the correctness of intermediate and user-facing data is also just as important, but the validation of the base data is critical to achieving the correct data all over.

Ensure that table counts, field counts and row counts of key data are correct.
Does the data warehouse data match the source data?
Test detailed, low level records with small samples of data
Test to ensure that the data and the values conform to what is expected. For example, ensuring that there is no data older than 3 years old, or ensuring that there are no account values outside a certain range. The Data Governance Team may become involved in these activities across all projects.

Next in line is the “intermediate” data such as derived metrics, aggregates, specialized subsets, and more. These will also need to be verified.

Are the calculated values correct?
Are the aggregates correct? Test aggregate data with small, medium and large sets of data
Verify metric calculations

Then the user-facing data or data prepared for self-service usage needs to be validated.

Does the data on the dashboard match the data in the database?
Are the KPIs correctly reflecting the status?

Test the full flow of the data. The validity of the data should be verified at each stage of the data flow – from the source, to the staging, to the final tables in the data warehouse, to aggregates or subsets, to the dashboard.

Take snapshots of key datasets or reports so you can compare results post data migration.

Some additional data prep might be needed in some cases.

These include making sure that you have sourced adequate data for testing. For example, if you need to test an annual trend, then it might be best to have at least a year’s worth of data, preferably two.
You may need to scramble or redact some data for testing. Often Test data is taken from the Production environment and then scrambled and/or redacted in order to not expose sensitive information.
You may need to temporarily load in data for testing. For various reasons, you may need to load some Production data into the QA environment just to test the solution or a particular feature and then remove the data after the testing is complete. While this can be time consuming, sometimes it’s necessary, and it’s good to be aware of the need early and make plans accordingly.

Aesthetics & Representation of Data

Presentation matters. Although the most critical thing is data correctness, how the data is presented is also very important. Good presentation helps with understanding, usability, and adoption. A few things to consider include:

Does the application, such as dashboard, look good? Does it look right?
Are the components laid out properly so that there is no overcrowding?
Are the logos, colors and fonts in line with company expectations?
Are proper chart options used to display the various types of data and metrics?
Is the information provided in a way that users can digest?

Usage

The data application or solution should be user friendly, preferably intuitive or at least have good documentation. The data must be useful to the intended audience, in that, it should help them to understand the information and make good decisions or take sensible actions based on it.

The application should present data in a manner that is effective – easy to access, and easy to understand.

The presentation should satisfy the analytic workflows of the various users. Users should be able to logically step through the application to find information at the appropriate level of detail that they need based on their role.

A few things that affect usability include:

Prompts – ensure that all the proper prompts or selections are available to users to slice and filter the data as necessary. And of course, verify that they work.
Drill downs and drill throughs – validate that users can drill-down and across data to find the information they need in a simple, logical manner.
Easy interrogation of the data – if the application is ad-hoc in nature, validate that users can navigate it or at least verify that the documentation is comprehensive enough for users to follow.

Security

Securing the application and its data so that only authorized users have access to it is critical.

Application security comprises of “authentication”– access to the application, and “authorization” – what a user is authorized to do when he or she accesses the application.

Authorization (what a user is authorized to do within the application) can be broken into “object security” – what objects or features a user has access to, and “data security” – what data elements a user has access to within the various objects or features.

For example, a user has access to an application (authenticated / can log in), and within the application the user has access to (authorized to see and use) 3 of 10 reports (object-level security). The user is not authorized to see the other 7 reports (object-level security) and, therefore, will not have access to them. Now, within the 3 reports that the user has access to, he or she can only see data related to 1 of 5 departments (data-level security).

All object-level and data-level security needs to be validated. This includes negative testing. Not only test to make sure that users have the access they need, but testing should also ensure that users do not have access that they should not have.

Data for testing should be scrambled or redacted as appropriate to protect it.
Some extremely sensitive data may need to be filtered out entirely.
Can all the appropriate users access the application?
Are non-authorized users blocked from accessing the application?
Can user see the data they should be able to see to perform their jobs?

Performance

Performance of the data solution is important to user efficiency and user adoption. If users cannot get the results they need in a timely manner, they will look elsewhere to get what they need. Even if they have no choice, a poorly performing application will result in wasted time and dollars.

A few things to consider for ensuring quality around performance:

Application usage – is the performance acceptable? Do the results get returned in an acceptable time?
Data Integration – is the load performance acceptable?
Data processing – can the application perform all the processing it needs to do in a reasonable amount of time?
Stress Testing – how is performance with many users? How is it with a lot data?
How is performance with various selections or with no selections at all?
Is ad-hoc usage setup to be flexible but avoid rogue analyses that may cripple the system?
Is real-time analysis needed and is the application quick enough?

These items need to be validated and any issues need to be reported to the appropriate teams for performance tuning before the application is released for general usage.

Methodology

Each organization, and even each team within an organization, will have a preferred methodology for application development and change management, including how they perform QA activities.

Some things to consider include:

Get QA resources involved in projects early so that they gain an early understanding of the requirements and the solutions to assess and plan how best to test.
When appropriate, do not wait until all testing is complete before notifying development teams of issue discovered. By notifying them early, this could make the difference between your project being on-time or late.
Create a test plan and test scripts – even if they are high-level.
Where possible, execute tasks in an agile, iterative manner.
Each environment will have unique rules and guidelines that need to be validated. For example, your application may have a special naming convention, color & font guidelines, special metadata items, and more. You need to validate that these rules and guidelines are followed.
Use a checklist to ensure that you validate with consistency from deliverable to deliverable
When the solution being developed is replacing an existing system or dataset, use the new and old solutions in parallel to validate the new against the old.
Document test results. All testing participants should document what has been tested and the results. This may be as simple as a checkmark or a “Done” status, but may also include things like data entered, screenshots, results, errors, and more.
Update the appropriate tracking tools (such as your SDLC or Change Management tools) to document changes and validation. These tools will vary from company to company, but it is best to have a trail of the development, testing, and release to production.
For each company and application, there will a specific, unique set of things that will need to be done. It is best if you have a standard test plan or test checklist to help you confirm that you have tested all important aspects and scenarios of the application.

This is not an all-encompassing coverage of Quality Assurance for data solutions, but I hope the article gives you enough information to get started or tips for improving what you currently have in place. You can share your questions, thoughts and input via comments to this post. Thanks for reading!

Analytics, Big Data, Data Analysis, Data Integration, Data Science October 10, 2019

Learning Hadoop: The key features and benefits of Hadoop

What are the key features and benefits of Hadoop? Why is Hadoop such a successful platform?

Apache Hadoop, mostly called just Hadoop, is a software framework and platform for reading, processing, storing and analyzing very large amounts of data. There are several features of Hadoop that make it a very powerful solution for data analytics.

Hadoop is Distributed

With Hadoop, from a few to hundreds or thousands of commodity servers (called nodes) can be connected (forming a cluster) to work together to achieve whatever processing power and storage capability is needed. The software platform enables the nodes to work together, passing work and data between them. Data and processing is distributed across nodes which spreads the load and significantly reduces the impact of failure.

Hadoop is Scalable

In the past, to achieve extremely powerful computing, a company would have to buy very expensive, large, monolithic computers. As data growth exploded, eventually even those super computers would become insufficient. With Hadoop, from a few to hundreds or thousands or even millions of commodity servers can be relatively easily connected to work together to achieve whatever processing power and storage capability is needed. This allows a company or project to start out small and then grow as needed inexpensively, without any concern about hitting a limitation.

Hadoop is Fault Tolerant

Hadoop was designed and built around the fact that there will be frequent failures on the commodity hardware servers that make up the Hadoop cluster. When a failure occurs, the software handles the automatic reassignment of work and replication of data to other nodes in the cluster, and the system continues to function properly without manual intervention. When a node recovers, from a reboot for example, it will rejoin the cluster automatically and become available for work.

Hadoop is backed by the power of Open Source

Hadoop is open source software, which means that it can be downloaded, installed, used and even modified for free. It is managed by the renown non-profit group, Apache Software Foundation (ASF), hence the name Apache Hadoop. The group is made up of many brilliant people from all over the world, many of whom work at some of the top technology companies, who commit their time to managing the software. In addition, there are also many developers that contribute code to enhance or add new features and functionality to Hadoop or to add new tools that work with Hadoop. The various tools that have been built over the years to complement core Hadoop make up what is called the Hadoop ecosystem. With a large community of people from all over the world continuously adding to the growth of the Hadoop ecosystem in a well-managed way, it will only get better and become more useful to many more use-cases.

These are the reasons Hadoop has become such a force within the data world. Although there is some hype around the big data phenomenon, the benefits and solutions based on the Hadoop ecosystem are real.

You can learn more at https://hadoop.apache.org

Business Intelligence Platform, Data Development, Data Integration, Data Integration (ETL) April 30, 2018October 5, 2019

InfatoODI – Informatica to ODI conversion tool

We are currently in the process of upgrading Oracle Business Intelligence Applications (OBIA) from version 7.9.6 to OBIA 11g. Oracle has replaced Informatica as the data integration tool in the platform with it’s own tool, Oracle Data Integrator (ODI). This was a selfish, profit-driven move on Oracle’s part with no consideration for the impact on customers, but it is what it is.

Because of this, as a part of the upgrade to the new OBIA release, we need to convert all our hundreds of Informatica mappings to ODI. As you can imagine, this is a lot of work. We are getting help from a company that has developed a specialized conversion tool called InfatoODI, which converts Informatica mappings to ODI interfaces.

We are performing the conversions specifically for an OBIA application, but the tool can be used as a straight conversion tool for Informatica-to-ODI for any type of application.

We are in the beginning stages of the project, but early indications are that the tool will save us time, but I am not sure how significant as yet. I will post updates as we progress through the conversions with my experience and opinion of the tool.

Data Integration, Data Integration (ETL) April 30, 2018October 5, 2019

Oracle Data Integrator (ODI) Knowledge Modules (KMs)

I am currently working on a project to upgrade (Oracle Business Intelligence Applications) OBIA 7 to OBIA 11g. OBIA 11g and all future releases of OBIA (per Oracle) will use Oracle Data Integrator (ODI) as the ETL platform, replacing Informatica.

Due to this, I need to become very familiar with ODI to be able to manage and support the new release, and will be writing about ODI from time to time.

One key component in ODI is Knowledge Modules (KM’s). In this post, I will describe what Knowledge Modules are and the various types that are in ODI.

Knowledge Modules (KMs) are generic code templates or modules that can be configured/coded to meet specific data integration needs and each type is dedicated to a specialized function in the overall data integration process.

Each of the 6 out-of-the-box (OOB) Knowledge Modules contain the “knowledge” to perform a specific set of actions on a specific combination of technologies, including connecting, extracting, transforming, loading, and checking data. While the 6 OOB KMs meet most data integration needs, there will surely be cases when more custom features are needed. ODI KMs are extensible, and new totally custom KMs can be built.

The 6 OOB KMs are:

Reverse Knowledge Module (RKM)
This KM is used to retrieve metadata from data sources and targets to the Oracle Data Integrator work repository. You can use it in models to perform customized reverse-engineering.

Loading Knowledge Module (LKM)
This KM is used to load heterogeneous data to a staging area. It is used in interfaces with heterogeneous sources. The LKM and the IKM are the two most frequently used KM’s in our environment.

Journalizing Knowledge Module (JKM)
This KM is used in models, sub models and databases to create, start and stop journals and to register subscribers. It creates the Change Data Capture framework objects in the source staging area.

Integration Knowledge Module (IKM)
This KM is used in Interfaces to integrate data from the staging area to a target. The LKM and the IKM are probably the two most frequently used KM’s in our environment.

Check Knowledge Module (CKM)
This KM is used to perform consistency checks of data against defined constraints. It is used in models, sub models and databases for data integrity audit, and used in interfaces for flow control or static control.

Service Knowledge Module (SKM)
This KM is used in models and databases. It is used to generate data manipulation web services.

These KM’s are central to ODI and I will need to master the usage of these KM’s and if you are planning on using ODI, you will need to also.

Analytics, Business Intelligence, Data Analysis, Data Integration, Data Integration (ETL), Data Science, Data Security, Data Strategy, Data Warehousing, Reporting April 6, 2018January 1, 2026

Creating a Business Intelligence (BI) & Analytics Strategy and Roadmap

This post provides some of my thoughts on how to go about creating a Business Intelligence (BI) & Analytics Strategy and Roadmap for your client or company. Please comment with your suggestions from your experience for improving this information.

When creating or updating the BI & Analytics Strategy and Roadmap for a company, one of the first things to understand is:

Who are all the critical stakeholders that need to be involved?

Understanding who needs and uses the BI & Analytics systems is critical for starting the process of understanding and documenting the “who needs what, why, and when”.

These are some of the roles that are typically important stakeholders:

High-level business executives that are paying for the projects
Business directors involved in the usage of the systems
IT directors involved in the developing and support of the systems
Business Subject Matter Experts (SME’s) & Business Analysts
BI/Analytics/Data/System Architects
BI/Analytics/Data/System Developers and Administrators

Then, you need to ask all these stakeholders, especially those from the business:

What are the drivers for BI & Analytics? And what is the level of importance for each of these drivers?

This will help you to understand and document what business needs are creating the need for new or modified BI & Analytics solutions. You should then go deeper to understand … what are the business objectives and goals that are driving these business needs. This will help you to understand and document the bigger picture so that a more comprehensive strategy and roadmap can be created.

The questions and discussions surrounding the above will require deep and broad business involvement. Getting the perspective of a wide range of users from all business areas that are using the BI & Analytics Systems is critical. The business should be involved throughout the process of creating the strategy and roadmap, and all decisions should tie back to support for business objectives and goals. And the trail leading to all these decisions must be documented.

Some examples of business drivers include:

Gain more insight into who our best customers are and how best to acquire them.
Understand how weather affects our sales/revenue.
Determine how we can sell more to our existing customers.
Understand what causes employee turnover.
Gain insight into how we can improve staffing schedules.

And examples of business objectives and goals may include things like:

Increase corporate revenues by 10%
Grow our base of recurring customers
Stabilize corporate revenues over all seasons
Create an environment where employees love to work
Reduce payroll costs without a reduction in staff, for example, reduce turnover.

Then, turn to understanding and documenting the current scenario (if not already known). Identify what systems (including data sources) are in place, who are using them (and why and how), what capabilities do they offer, what are the must-haves, and what are the pain points and positive highlights.

Also, you will need to determine the current workload (and future workload if it can be determined) of the primary team members involved in developing, testing, and implementing BI & Analytics solutions.

This will help you understand a few things:

Some of the highest priority needs of the users
Gaps in capabilities and data between what is needed and what is currently in place (including an understanding of what is liked and disliked about the current systems)
Current user base knowledge and engagement
IT knowledge and skills
Resource availability – when are people available to work on new initiatives

What are the options and limitations?

Can existing systems be customized to meet the requirements?
Can they be upgraded to a new version that has the needed functionality?
Do we need to consider adding a new platform or replacing one or more of the existing systems with a new platform?
Can we migrate from/integrate one system to/with another system that we already have up and running?
Are any of our current systems losing vendor support or require an upgrade for other reasons? Has the pricing changed for any of our software applications?
What options does our budget permit us to explore?
What options do our knowledge and skills permit us to explore?

Once you have identified these items …

Identify and engage stakeholders, and document these roles and the people
Identify and document business drivers, objectives and goals
Understand and document the current landscape – needs (including must-haves), technology, gaps, users, IT staff, resource availability, and more
Identify and document options – based on current landscape, technology, budget, staff resources, etc.

… you can develop a “living” Strategy and Roadmap for BI & Analytics. And when I say “living”, I mean it will not be a static document, but will be fine-tuned over time as new information emerge and as changes arise in business needs, technology, and staff resources.

Your Strategy and Roadmap for BI & Analytics should include, but is not limited to:

BI & Analytics that will be used to satisfy business drivers, objectives and goals
Data acquisition and storage plan for meeting the analytics needs
Technology platforms that will be used to process and store data, and deliver the analytics
Information about any new technologies that needs to be acquired or implemented, and schedules
Roles and Responsibilities for all stakeholders involved in BI & Analytics projects
Planned staffing allocations and schedules
Planned staffing changes and schedules
User training (business users) and Delivery team training (technical implementers & developers for example)
List dependencies for each item or set of items

Business Intelligence, Data Integration August 20, 2016

How to load data from multiple Excel files with similar but different names and varying number of sheets in QlikView

This post describes a scenario for loading data into QlikView from multiple Excel files with similar but different names and a different number of tabs.

Let’s say you need to load multiple Excel files containing information about orders into your QlikView application. These files have different names, and each file may have a different amount of sheets.

For example, you may have several files with Order information from different sources for multiple dates such as:

CallCenter_Orders_20150312.xlsx
InStore_Orders_20150311.xlsx
SalesRep_Orders_20150312.xlsx
SalesRep_Orders_20150311.xlsx

Let’s say each file has one or more sheets representing regions/divisions – West, Mid-West, North East, and South. Some files may have all 4 region/division sheets, while others may have just one region sheet.

This script is one possible way of loading this data in QlikView using a single script. With some adjustments, this script may also work for Qlik Sense, but I did not test exactly what changes would be needed.

//-----------------------------------------------
// set the errormode so that your script will not fail when one or more of
// the 4 sheets is not found in any particular file
 SET ErrorMode = 0;

OrdersFileData:
 LOAD [CustomerID]  as [Customer ID],
 [OrderID           as [Order Number],
 [OrderDate]        as [Order Date],
 [ShipDate]         as [Ship Date],
 [Notes]            as [Order Notes],
 [Turn around days] as [Turnaround Days],
 'WEST'             as [Division] //identify region/division on all records
 FROM [..DataText Files*Orders*.xlsx] //wildcard allows load from all
                                         //xlsx files with “Orders” in the name
 (ooxml, embedded labels, table is WEST);  //load from the West sheet

CONCATENATE (OrdersFileData) //append data from Midwest sheet from all files
 LOAD [CustomerID]  as [Customer ID],
 [OrderID]          as [Order Number],
 [OrderDate]        as [Order Date],
 [ShipDate]         as [Ship Date],
 [Notes]            as [Order Notes],
 [Turn around days] as [Turnaround Days],
 'MIDWEST'          as [Division]
 FROM [..DataText Files*Orders*.xlsx]
 (ooxml, embedded labels, table is MIDWEST);

CONCATENATE (OrdersFileData) //append data from Northeast sheet from all files
 LOAD [CustomerID] as [Customer ID],
 [OrderID] as [Order Number],
 [OrderDate] as [Order Date],
 [ShipDate] as [Ship Date],
 [Notes] as [Order Notes],
 [Turn around days] as [Turnaround Days],
 'NORTHEAST' as [Division]
 FROM [..DataText Files*Orders*.xlsx]
 (ooxml, embedded labels, table is NORTHEAST);

CONCATENATE (OrdersFileData) //append data from South sheet from all files
 LOAD [CustomerID] as [Customer ID],
 [OrderID] as [Order Number],
 [OrderDate] as [Order Date],
 [ShipDate] as [Ship Date],
 [Notes] as [Order Notes],
 [Turn around days] as [Turnaround Days],
 'SOUTH' as [Division]
 FROM [..DataText Files*Orders*.xlsx]
 (ooxml, embedded labels, table is SOUTH);

STORE OrdersFileData into ..DataQVDsOrdersData.QVD; // if loading to QVD 
DROP Table OrdersFileData; //if loading to QVD and not needed in memory
//-----------------------------------------------

Business Intelligence, Data Integration, Data Modeling, Data Warehousing December 31, 2013

Oracle Business Intelligence Applications (OBIA) Fact Tables

Dimensionally modeled (star-schema designed) data warehouses are primarily made up of two types of tables – Fact and Dimension. Fact tables store the measurements generated by business events (# of orders, amount of dollars, etc.); and Dimension tables store the descriptive attributes that provide context to the measurements (product [product name], customer [customer type], date, etc.).

This post describes the types of Fact tables found in Oracle Business Intelligence Applications (OBIA) data warehouse – Oracle Business Analytics Warehouse (OBAW). There will be future posts that describe in detail the other table types in OBIA (Dimension, Internal, etc.).

The 5 types of Fact tables used in the OBAW are:

Transactional
Aggregate
Cycle Lines
Snapshot
State Transition.

The Transactional Fact Table is the main type of fact table. It stores the lowest-level of information from transactional sources. An example of a Fact table in OBIA (Financial Analytics) is: W_GL_BALANCE_F
Note: Fact tables in OBIA end with “_F”.
This table stores the current balance for GL accounts by GL_ACCOUNT and other dimensions.

The Aggregate Fact Table is typically used for performance improvements. It is a summarized or rolled-up version of the Transactional fact table. Instead of querying the data at the transactional level – which is the most detailed level and the level with the most records, the Aggregate table allows you to query the data at a more rolled up level when appropriate. One of the most frequent roll-ups is time – for example, a transactional table at a day level is rolled up to the month level.
Aggregate tables can be tens of times less (or even hundreds) than their transactional versions. These types of tables are also very common in OBIA and in data warehousing in general.

An example of an Aggregate Fact Table in OBIA (Financial Analytics) is: W_GL_BALANCE_A
Note: Aggregate Fact tables in OBIA end with “_A”.
This table stores the GL account balances aggregated by GL Account Segment and other dimensions. Instead of having data at the GL_ACCOUNT level as in the Transactional fact table, the data is at the GL Account Segment level in the Aggregate table. Aggregate Fact tables are derived from Transactional Fact Tables or other Aggregate Fact tables. This table is derived from the transactional fact table mentioned above: W_GL_BALANCE_F.

The Snapshot Fact Table stores “snapshots” of measurements taken at well-defined, predetermined time intervals – such as daily, monthly, annually, etc. Examples include Inventory and Account Balance snapshots, and AR/AP aging snapshots. Common items such as financial reports or bank statements are examples of reports from Snapshot Fact tables.

An example of a Snapshot table in OBIA(Supply Chain Analytics) is: W_INVENTORY_DAILY_BAL_F
Oracle’s description of this table will help to clarify its makeup and purpose.
The W_INVENTORY_DAILY_BAL_F fact table is used to represent at a point in time information of all inventory balances and inventory values related to products whose inventory is maintained by the business organization, these would typically include all inbound (purchased from external entities) products as well as outbound (sold to external entities) products. The inventory balance information is trended by copying historical snapshot information from this table at periodic points in time into history table W_INVENTORY_MONTHLY_BAL_F.
The W_INVENTORY_MONTHLY_BAL_F table stores a snapshot of inventory balance.
There is one row for each product and product storage location whose point in time inventory quantity and value information is maintained. The storage location could represent a warehouse or further divisions within a warehouse. This aspect is configurable within the product. All the dimension key links to the other Oracle Business Analytics Warehouse dimension tables, such as W_DAY_D, W_BUSN_LOC_D, W_PRODUCT_D, W_INVENTORY_PRODUCT_D, and so on, represent information associations at that point in time for that product inventory information. The DATE_WID column represents the date on which the inventory balance information is valid.

These tables can also have Aggregate versions:
As mentioned in the description for the W_INVENTORY_DAILY_BAL_F table above, there is an aggregate version. However, snapshot tables are not necessarily aggregated like transactional tables, because many times the measures are non-additive or semi-additive. For example, you would not take your account monthly balance in January and add it to your account monthly balance in February to determine how much money you have – that would be wrong.

The W_INVENTORY_MONTHLY_BAL_F fact table is used to represent the monthly information of all the inventory balances and the inventory values related to products whose inventory is maintained by the business organization. This information includes all inbound (purchased from external entities) products and outbound (sold to external entities) products. The aggregation period is configurable, and has a preconfigured value of Monthly.
There is one row for each product and product storage location whose point in time (as of a month) inventory quantity and value information is maintained. All the dimension key links to the other Oracle Business Analytics Warehouse dimension tables such as W_DAY_D, W_BUSN_LOC_D, W_PRODUCT_D, W_INVENTORY_PRODUCT_D, and so on, and represents information and associations at that point in time for that product inventory information. The PERIOD_START_DT_WID and PERIOD_END_DT_WID column represents the aggregation bucket start and end dates. The column INV_BALANCE_DT_WID represents the date within this aggregation period on which the inventory balance information is valid.

The Cycle Lines Fact Table store measurements for multiple related business events and are therefore typically derived from multiple fact tables. They typically store process cycle times or provide the ability to easily determine process cycle times. These tables are also called Accumulating Snapshot Fact tables because they are snapshots of different events accumulated on each other. An example of a Cycle Lines Fact table is W_PURCH_CYCLE_LINE_F.

Here is Oracle’s description of the table which should help clarify its purpose: W_PURCH_CYCLE_LINE_F table tracks the time duration of all events pertaining to the purchase process commencing with a requisition. Information in this table enables analysis of the direct spend process within an organization beginning with a purchase requisition, its approval, the creation of an approved purchase order, its submission to a supplier, the creation of a purchase schedule and ending with its receipt of the products. It can be used to calculate the time taken to receive products that have been ordered, the time between the first receipt and last receipt of products that have scheduled for delivery. The W_PURCH_CYCLE_LINE_F table contains all the various dates associated with the processes such as submission, approval, ordering and receiving as well as quantities and amounts. While Other spend related fact tables capture individual process such as requesting, ordering, scheduling this table combines all the in one place for ease of analysis and reporting.

These Cycle Lines tables can also have aggregate versions. For example: W_PURCH_CYCLE_LINE_A This is an aggregate table of W_PURCH_CYCLE_LINE_F at a higher level of dimensionality. The Product dimension is replaced by a Product type dimension to give a high level analysis of the sourcing data. It stores Purchase Cycle Line records aggregated over a preconfigured Monthly time period and product types.

State Transition Fact Tables store state-transition metrics based on business events, such as customer state – new, top, dormant, lost, etc – based on the customer order activity. These tables store or allow you to easily derived counts of the various states. State Transition Fact tables are derived from Transactional or Snapshot fact tables.

Below are two examples of State Transition Fact tables in OBIA (Marketing Analytics):

The Customer Status History Fact: W_CUSTOMER_STATUS_HIST_F
This is a fact table that tracks the status of customers based on the frequency of orders they place with the organization. Possible statuses are NEW, RECENT, DORMANT and LOST. The time duration for each status bucket is configurable, out of the box being a calendar year.
The grain of this table is at a Customer, Customer Status and the Status Start Date level. Other important columns in this table include the Sold to and the Ship to location for the customer. These are derived based on the status bucket start date against the Customer Locations dimension table.

The Loyalty Member Status History Fact: W_LOY_MEMBER_STATUS_HIST_F
W_LOY_MEMBER_STATUS_HIST_F Fact table stores status changes of Loyalty members. Grain: One record for each member status changed.

That’s it for OBIA fact tables. Understanding the types of fact tables and their purpose helps us to make better design choices when we set out to build new fact tables to represent business events, and it also helps us to quicker recognize and better analyze the data in these tables.
I hope you found this information useful. If you have information about other fact table types, please share.

The Data Community