Category: Data Analysis

Python Libraries for Data Science

Python has grown quickly to become one of the most widely used programming languages. While it’s a powerful, multi-purpose language used for creating just about any type of application, it has become a go-to language for data science, rivaling even “R”, the longtime favorite language and platform for data science.

Python’s popularity for data-based solutions has grown because of the many powerful, opensource, data-centric libraries it has available. Some of these libraries include:

NumPy

A library used for creating and manipulating multi-dimensional data arrays and can be used for handling multi-dimensional data and difficult mathematical operations.

Pandas

Pandas is a library that provides easy-to-use but high-performance data structures, such as the DataFrame, and data analysis tools.

Matplotlib

Matplotlib is a library used for data visualization such as creating histograms, bar charts, scatter plots, and much more.

SciPy

SciPy is a library that provides integration, statistics, and linear algebra packages for numerical computations.

Scikit-learn

Scikit-learn is a library used for machine learning. It is built on top of some other libraries including NumPy, Matplotlib, and SciPy.

There are many other data-centric Python libraries and some will be introduced in future articles. More can be learned here: https://www.python.org/

What is data analytics? And what are the different types of data analytics?

Data analytics is the overall process of capturing and using data to produce meaningful information, including metrics and trends, that can be used to better understand events and help make better decisions. Usually the goal is to improve the efficiency and outcomes of an operation, such as a business, a political campaign, or even an individual (such as an athlete). There are four (4) prevalent types of data analytics – descriptive, predictive, diagnostic, and prescriptive.

  1. Descriptive analytics – provides information about “what has happened”. Examples of questions answered by descriptive analytics include: How much are our sales this month and what is over year-over-year sales increase? How many website visitors did we have and how many signups?
  2. Predictive analytics – provides insight into “what may happen” in the future based on the past. Examples of questions answered by predictive analytics include: Based on previous customer service call patterns and outcomes, what is the likelihood of a customer switching to another provider? Based on a customer’s profile, how much should we charge him for insurance?
  3. Diagnostic analytics – provides information to explain “why something happened”. In addition to the direct data, this may also involve more indirect or macro data sources, such as, weather data, local or national economic data, or competitor data. And it may also involve forming logical theories about the correlation of events. Examples of questions answered by diagnostic analytics include: How effective was the marketing blitz and which channel had the most impact? Did the weather affect sales or was it the price increase?
  4. Prescriptive analytics – provides insight into “what to do to make something happen”. Examples of questions answered by prescriptive analytics include: Based on the results of our test marketing blitz campaign, if we roll out the full campaign with adjustments to the channel spread, how many additional temporary customer service staff will we need to handle the increased volume without long wait times?
The four (4) types of data analytics

Descriptive analytics is the simplest and most common form of analytics used in organizations and is widely referred to as Business Intelligence (BI). There is widespread interest in predictive analytics but less than 50% of companies currently use it as it requires additional, more expensive skills. Diagnostic and prescriptive analytics have always been around because companies have always used information from descriptive analytics to hypothesize “why things happened” and make decisions on “what to do”. But it’s the automation of these types through new methods and the integration of more data inputs that is fairly new. The latter three forms are sometimes called Advanced Analytics or Data Science.

All the types of analytics will require some form of data integration and use some of the same data in an environment, but while descriptive analytics only needs data from the time periods being analyzed and usually from a narrower data set, the predictive, prescriptive and diagnostic analytics produce better results using as much data as is available from a wider timeframe and from a broader set of sources. There is overlap with the different types of analytics because the analysis of “what may happen” is driven by “what has happened” in the past and “why it happened”; and determining “what to do” will be driven by “what has happened”, “why it happened”, and “what may happen”. Companies on the forefront of data analytics will tend to use all four types.

Quality Assurance (QA) for Data Projects or Data Applications

This post discusses Quality Assurance (QA) activities for data projects.

What is Quality Assurance (QA)?  Simply put, Quality Assurance, also called QA, Testing or Validation, is about testing an application or solution to ensure that all the stated/promised/expected requirements are met. It is a critically important activity for all software application development or implementations. Data applications are no different. They need to be tested to ensure they work as intended.

QA stands between development and deployment. And QA makes the difference between a delivered product and a high quality delivered product.

There are a number of things to keep in mind when you plan your Quality Assurance activities for data solutions. I present some of them in this post as suggestions, considerations, or prompting questions. The things mentioned here will not apply to all data applications but can be used as a guide or a check.

People / Teams

The number of people and teams involved in a project will vary depending on the size, scope and complexity of the project.

The technical team building the application needs to perform an initial level of validation of the solution.

If there is a Quality Assurance team that performs the validation tasks, then that team will need to perform the “official” validation.

The business analysts and end-users of the application also need to validate. Where possible, work with as many end users as efficiently possible. The more real users you have testing the application, the better the chances of finding issues early.

Where it makes sense, Test IDs that simulate various types of users or groups should be used to help test various usage and security scenarios. This is particularly useful in automated testing.

On large projects where there is a lot to be tested, it is best to break up the testing across multiple people or teams. This will help to prevent testing fatigue and sloppy testing and result in higher quality testing.

Plan ahead to ensure that access for all the relevant users is set up in the testing environments.

Communication

With all the teams and people involved, it is important to have a plan for how they will communicate. Things to consider and have a plan for include:

  • How will teams communicate within? Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will the various teams involved communicate with each other? In other words, how will cross-team communication be handled? As above, Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will issues and status be communicated? Weekly meetings, Status emails or documents, Shared files available on shared spaces are options.
  • How will changes and resolutions be tracked? Files, SDLC applications, Change Management applications are options.
  • How will teams and individuals be notified when they need to perform a task? Manual communication or automated notifications from tools are options.

Data

The most important thing to ensure in data projects is that the data is high quality, particularly the “base” data set. If the base data is incorrect, everything built on top of it will be bad. Of course, the correctness of intermediate and user-facing data is also just as important, but the validation of the base data is critical to achieving the correct data all over.

  • Ensure that table counts, field counts and row counts of key data are correct.
  • Does the data warehouse data match the source data?
  • Test detailed, low level records with small samples of data
  • Test to ensure that the data and the values conform to what is expected. For example, ensuring that there is no data older than 3 years old, or ensuring that there are no account values outside a certain range. The Data Governance Team may become involved in these activities across all projects.

Next in line is the “intermediate” data such as derived metrics, aggregates, specialized subsets, and more. These will also need to be verified.

  • Are the calculated values correct?
  • Are the aggregates correct? Test aggregate data with small, medium and large sets of data
  • Verify metric calculations

Then the user-facing data or data prepared for self-service usage needs to be validated.

  • Does the data on the dashboard match the data in the database?
  • Are the KPIs correctly reflecting the status?

Test the full flow of the data. The validity of the data should be verified at each stage of the data flow – from the source, to the staging, to the final tables in the data warehouse, to aggregates or subsets, to the dashboard.

Take snapshots of key datasets or reports so you can compare results post data migration.

Some additional data prep might be needed in some cases.

  • These include making sure that you have sourced adequate data for testing. For example, if you need to test an annual trend, then it might be best to have at least a year’s worth of data, preferably two.
  • You may need to scramble or redact some data for testing. Often Test data is taken from the Production environment and then scrambled and/or redacted in order to not expose sensitive information.
  • You may need to temporarily load in data for testing. For various reasons, you may need to load some Production data into the QA environment just to test the solution or a particular feature and then remove the data after the testing is complete. While this can be time consuming, sometimes it’s necessary, and it’s good to be aware of the need early and make plans accordingly.

Aesthetics & Representation of Data

Presentation matters. Although the most critical thing is data correctness, how the data is presented is also very important. Good presentation helps with understanding, usability, and adoption. A few things to consider include:

  • Does the application, such as dashboard, look good?  Does it look right? 
  • Are the components laid out properly so that there is no overcrowding?
  • Are the logos, colors and fonts in line with company expectations?
  • Are proper chart options used to display the various types of data and metrics?
  • Is the information provided in a way that users can digest?

Usage

The data application or solution should be user friendly, preferably intuitive or at least have good documentation. The data must be useful to the intended audience, in that, it should help them to understand the information and make good decisions or take sensible actions based on it.

The application should present data in a manner that is effective – easy to access, and easy to understand.

The presentation should satisfy the analytic workflows of the various users. Users should be able to logically step through the application to find information at the appropriate level of detail that they need based on their role.

A few things that affect usability include:

  • Prompts – ensure that all the proper prompts or selections are available to users to slice and filter the data as necessary. And of course, verify that they work.
  • Drill downs and drill throughs – validate that users can drill-down and across data to find the information they need in a simple, logical manner.
  • Easy interrogation of the data – if the application is ad-hoc in nature, validate that users can navigate it or at least verify that the documentation is comprehensive enough for users to follow.

Security

Securing the application and its data so that only authorized users have access to it is critical.

Application security comprises of “authentication”– access to the application, and “authorization” – what a user is authorized to do when he or she accesses the application.

Authorization (what a user is authorized to do within the application) can be broken into “object security” – what objects or features a user has access to, and “data security” – what data elements a user has access to within the various objects or features.

For example, a user has access to an application (authenticated / can log in), and within the application the user has access to (authorized to see and use) 3 of 10 reports (object-level security). The user is not authorized to see the other 7 reports (object-level security) and, therefore, will not have access to them. Now, within the 3 reports that the user has access to, he or she can only see data related to 1 of 5 departments (data-level security).

All object-level and data-level security needs to be validated. This includes negative testing. Not only test to make sure that users have the access they need, but testing should also ensure that users do not have access that they should not have.

  • Data for testing should be scrambled or redacted as appropriate to protect it.
  • Some extremely sensitive data may need to be filtered out entirely.
  • Can all the appropriate users access the application?
  • Are non-authorized users blocked from accessing the application?
  • Can user see the data they should be able to see to perform their jobs?

Performance

Performance of the data solution is important to user efficiency and user adoption. If users cannot get the results they need in a timely manner, they will look elsewhere to get what they need. Even if they have no choice, a poorly performing application will result in wasted time and dollars.

A few things to consider for ensuring quality around performance:

  • Application usage – is the performance acceptable? Do the results get returned in an acceptable time?
  • Data Integration – is the load performance acceptable?
  • Data processing – can the application perform all the processing it needs to do in a reasonable amount of time?
  • Stress Testing – how is performance with many users? How is it with a lot data?
  • How is performance with various selections or with no selections at all?
  • Is ad-hoc usage setup to be flexible but avoid rogue analyses that may cripple the system?
  • Is real-time analysis needed and is the application quick enough?

These items need to be validated and any issues need to be reported to the appropriate teams for performance tuning before the application is released for general usage.

Methodology

Each organization, and even each team within an organization, will have a preferred methodology for application development and change management, including how they perform QA activities.

Some things to consider include:

  • Get QA resources involved in projects early so that they gain an early understanding of the requirements and the solutions to assess and plan how best to test.
  • When appropriate, do not wait until all testing is complete before notifying development teams of issue discovered. By notifying them early, this could make the difference between your project being on-time or late.
  • Create a test plan and test scripts – even if they are high-level.
  • Where possible, execute tasks in an agile, iterative manner.
  • Each environment will have unique rules and guidelines that need to be validated. For example, your application may have a special naming convention, color & font guidelines, special metadata items, and more. You need to validate that these rules and guidelines are followed.
  • Use a checklist to ensure that you validate with consistency from deliverable to deliverable
  • When the solution being developed is replacing an existing system or dataset, use the new and old solutions in parallel to validate the new against the old.
  • Document test results. All testing participants should document what has been tested and the results. This may be as simple as a checkmark or a “Done” status, but may also include things like data entered, screenshots, results, errors, and more.
  • Update the appropriate tracking tools (such as your SDLC or Change Management tools) to document changes and validation. These tools will vary from company to company, but it is best to have a trail of the development, testing, and release to production.
  • For each company and application, there will a specific, unique set of things that will need to be done. It is best if you have a standard test plan or test checklist to help you confirm that you have tested all important aspects and scenarios of the application.

This is not an all-encompassing coverage of Quality Assurance for data solutions, but I hope the article gives you enough information to get started or tips for improving what you currently have in place. You can share your questions, thoughts and input via comments to this post. Thanks for reading!

Learning Hadoop: The key features and benefits of Hadoop

What are the key features and benefits of Hadoop? Why is Hadoop such a successful platform?

Apache Hadoop, mostly called just Hadoop, is a software framework and platform for reading, processing, storing and analyzing very large amounts of data. There are several features of Hadoop that make it a very powerful solution for data analytics.

Hadoop is Distributed

With Hadoop, from a few to hundreds or thousands of commodity servers (called nodes) can be connected (forming a cluster) to work together to achieve whatever processing power and storage capability is needed. The software platform enables the nodes to work together, passing work and data between them. Data and processing is distributed across nodes which spreads the load and significantly reduces the impact of failure.

Hadoop is Scalable

In the past, to achieve extremely powerful computing, a company would have to buy very expensive, large, monolithic computers. As data growth exploded, eventually even those super computers would become insufficient. With Hadoop, from a few to hundreds or thousands or even millions of commodity servers can be relatively easily connected to work together to achieve whatever processing power and storage capability is needed. This allows a company or project to start out small and then grow as needed inexpensively, without any concern about hitting a limitation.

Hadoop is Fault Tolerant

Hadoop was designed and built around the fact that there will be frequent failures on the commodity hardware servers that make up the Hadoop cluster. When a failure occurs, the software handles the automatic reassignment of work and replication of data to other nodes in the cluster, and the system continues to function properly without manual intervention. When a node recovers, from a reboot for example, it will rejoin the cluster automatically and become available for work.

Hadoop is backed by the power of Open Source

Hadoop is open source software, which means that it can be downloaded, installed, used and even modified for free. It is managed by the renown non-profit group, Apache Software Foundation (ASF), hence the name Apache Hadoop. The group is made up of many brilliant people from all over the world, many of whom work at some of the top technology companies, who commit their time to managing the software. In addition, there are also many developers that contribute code to enhance or add new features and functionality to Hadoop or to add new tools that work with Hadoop. The various tools that have been built over the years to complement core Hadoop make up what is called the Hadoop ecosystem. With a large community of people from all over the world continuously adding to the growth of the Hadoop ecosystem in a well-managed way, it will only get better and become more useful to many more use-cases.

These are the reasons Hadoop has become such a force within the data world. Although there is some hype around the big data phenomenon, the benefits and solutions based on the Hadoop ecosystem are real.

You can learn more at https://hadoop.apache.org

Connecting to Microsoft SQL Server database from Oracle SQL Developer

If you work primarily with Oracle databases, you may use SQL Developer. But you may also need to connect to Microsoft SQL Server databases and not necessarily want to install a new front-end database tool, such as Microsoft SQL Server Management Studio (SSMS).  You can connect to SQL Server from SQL Developer.

First, download the appropriate JDBC Driver for the version of SQL Server that you need to connect to. Then follow the steps in the video at the link below on the Oracle website.

https://www.oracle.com/technetwork/developer-tools/sql-developer/sql-server-connection-viewlet-swf-089886.html

Good luck.

 

Creating a Business Intelligence (BI) & Analytics Strategy and Roadmap

This post provides some of my thoughts on how to go about creating a Business Intelligence (BI) & Analytics Strategy and Roadmap for your client or company.  Please comment with your suggestions from your experience for improving this information.

 

When creating or updating the BI & Analytics Strategy and Roadmap for a company, one of the first things to understand is:

Who are all the critical stakeholders that need to be involved?

Understanding who needs and uses the BI & Analytics systems is critical for starting the process of understanding and documenting the “who needs what, why, and when”.

These are some of the roles that are typically important stakeholders:

  • High-level business executives that are paying for the projects
  • Business directors involved in the usage of the systems
  • IT directors involved in the developing and support of the systems
  • Business Subject Matter Experts (SME’s) & Business Analysts
  • BI/Analytics/Data/System Architects
  • BI/Analytics/Data/System Developers and Administrators

 

Then, you need to ask all these stakeholders, especially those from the business:

What are the drivers for BI & Analytics? And what is the level of importance for each of these drivers?

This will help you to understand and document what business needs are creating the need for new or modified BI & Analytics solutions. You should then go deeper to understand … what are the business objectives and goals that are driving these business needs.  This will help you to understand and document the bigger picture so that a more comprehensive strategy and roadmap can be created.

The questions and discussions surrounding the above will require deep and broad business involvement. Getting the perspective of a wide range of users from all business areas that are using the BI & Analytics Systems is critical.  The business should be involved throughout the process of creating the strategy and roadmap, and all decisions should tie back to support for business objectives and goals. And the trail leading to all these decisions must be documented.

Some examples of business drivers include:

  • Gain more insight into who our best customers are and how best to acquire them.
  • Understand how weather affects our sales/revenue.
  • Determine how we can sell more to our existing customers.
  • Understand what causes employee turnover.
  • Gain insight into how we can improve staffing schedules.

 

And examples of business objectives and goals may include things like:

  • Increase corporate revenues by 10%
  • Grow our base of recurring customers
  • Stabilize corporate revenues over all seasons
  • Create an environment where employees love to work
  • Reduce payroll costs without a reduction in staff, for example, reduce turnover.

 

Then, turn to understanding and documenting the current scenario (if not already known). Identify what systems (including data sources) are in place, who are using them (and why and how), what capabilities do they offer, what are the must-haves, and what are the pain points and positive highlights.

Also, you will need to determine the current workload (and future workload if it can be determined) of the primary team members involved in developing, testing, and implementing BI & Analytics solutions.

This will help you understand a few things:

  • Some of the highest priority needs of the users
  • Gaps in capabilities and data between what is needed and what is currently in place (including an understanding of what is liked and disliked about the current systems)
  • Current user base knowledge and engagement
  • IT knowledge and skills
  • Resource availability – when are people available to work on new initiatives

 

What are the options and limitations?

  • Can existing systems be customized to meet the requirements?
  • Can they be upgraded to a new version that has the needed functionality?
  • Do we need to consider adding a new platform or replacing one or more of the existing systems with a new platform?
  • Can we migrate from/integrate one system to/with another system that we already have up and running?
  • Are any of our current systems losing vendor support or require an upgrade for other reasons? Has the pricing changed for any of our software applications?
  • What options does our budget permit us to explore?
  • What options do our knowledge and skills permit us to explore?

 

Once you have identified these items …

  • Identify and engage stakeholders, and document these roles and the people
  • Identify and document business drivers, objectives and goals
  • Understand and document the current landscape – needs (including must-haves), technology, gaps, users, IT staff, resource availability, and more
  • Identify and document options – based on current landscape, technology, budget, staff resources, etc.

… you can develop a “living” Strategy and Roadmap for BI & Analytics. And when I say “living”, I mean it will not be a static document, but will be fine-tuned over time as new information emerge and as changes arise in business needs, technology, and staff resources.

 

Your Strategy and Roadmap for BI & Analytics should include, but is not limited to:

  • BI & Analytics that will be used to satisfy business drivers, objectives and goals
  • Data acquisition and storage plan for meeting the analytics needs
  • Technology platforms that will be used to process and store data, and deliver the analytics
  • Information about any new technologies that needs to be acquired or implemented, and schedules
  • Roles and Responsibilities for all stakeholders involved in BI & Analytics projects
  • Planned staffing allocations and schedules
  • Planned staffing changes and schedules
  • User training (business users) and Delivery team training (technical implementers & developers for example)
  • List dependencies for each item or set of items

Your career in 2018 (referencing 2017 Gartner Magic Quadrant for Business Intelligence and Analytics)

How will you grow your career in 2018?

Each year Gartner publishes Magic Quadrants for several technologies, including one for Business Intelligence and Analytics. Let’s take a look at this year’s.

The “2017 Gartner Magic Quadrant for Business Intelligence and Analytics” document was published earlier in the year, but as we are at the end of the year, it is a good time to take stock of the industry and your career in this field.

Below is an image of the results published by Gartner for 2017.  As you can see, from their perspective, Tableau and Microsoft lead the way, with Qlik also in the leader quadrant.

Source: Gartner

Since the same 3 players were also in the leader quadrant in 2016 and 2015 (along with others), we know they are solid.

So, based on this, if you are primarily an Oracle BI (OBI), IBM Cognos, or SAP Business Objects (BO) person (for example), you may consider learning or at least getting exposure to one or more of the 3 platforms in the Leaders quadrant.  There is still plenty of work out there for OBI, Cognos, BO and other platforms, but having additional technical/platform skills in leading platforms never hurts.

And this is just a broad suggestion, because you may find that developing skills in Salesforce (for example) is a better strategic move for you based on your situation.  The bottom-line is, assess your career/skills/goals/etc., and plan on learning something new in 2018.  Don’t stop learning!

Happy New Year to you, and best wishes for 2018!

 

 

 

 

Working with built-in datasets in R

This is a quick cheat sheet of commands for working with built-in datasets in R.

R has a number of base datasets that come with the install, and there are many packages that also include additional datasets.  These are very useful and easy to work with.

 

> ?datasets                                  # to get help info on the datasets package

> library(help=”datasets”)     # provides detailed information on the datasets package, including listing and description of the datasets in the package

> data()                                       # lists all the datasets currently available by package

data(package = .packages(all.available = TRUE))  # lists all the available datasets by package even if not installed

> data(EuStockMarkets)            # loads  the dataset EuStockMarkets into the workspace

> summary(AirPassengers)      # provides a summary of the dataset AirPassengers

> ?airquality                                    # outputs in the Help window, information about the dataset “airquality”

> str(airmiles)                                 # shows the structure of the dataset “airmiles”

> View(airmiles)             # shows the data of the dataset “airmiles” in spreadsheet format

 

 

Updating R packages

In this post, I cover updating R packages in RStudio.  You have a few options for updating your R packages in RStudio.

(Option 1) From the menu, select Tools -> “Check for Package Updates”RStudio_Menu_Tools_CheckforPackageUpdates

You will get this message if all packages are up-to-date:

RStudio_AllPackagesUpToDate

(Option 2) From the Packages tab in the bottom-right pane, click on “Update”RStudio_PackagesTab_Update

(Option 3) Execute the command update.packages()

RStudio_Script_UpdatePackages

You may see output like this …

RStudio_Script_UpdatePackages2

Enter “y” to update, “N” to not update, and “c” to cancel.

Thanks for reading!

Oracle EBS Receivables Data Flow and Data Model

The Accounts Receivable function is responsible for managing outgoing invoices to customers who purchased goods or services, and the collection and application of all payments, including payments for invoices.  The Oracle Receivables module (a part of the Oracle EBS Financials Suite) helps the Accounts Receivable departments to manage this function effectively and efficiently.

This post describes a summary of the Oracle Receivables data model and data flow.  Some of these tables are source tables for the Oracle Business Intelligence Applications – Financial Analytics module, specifically providing information for the Payabales dashboard.

OracleReceivablesDataflowDataModel

To be in the position where you need to handle and process a payment in Receivables, you need to have a buyer/payer (most times this is a customer but there are exceptions). Customer records are stored in the HZ_CUST_ACCOUNTS and HZ_PARTIES tables.  Each customer needs to have a site (a location/address of business) for which information is stored in HZ_CUST_ACCT_SITES_ALL and HZ_PARTY_SITES_ALL.

When a customer purchases goods or services from your company, an invoice is generated for the customer.  These invoice transactions are recorded in RA_CUSTOMER_TRX_ALL (invoice headers) and RA_CUSTOMER_TRX_LINES_ALL (invoice lines).

When the customer makes a payment, this generates new transactions.  These are recorded in AR_CASH_RECEIPTS_ALL and AR_CASH_RECEIPT_HISTORY.  If there is adjustment to an invoice, this is recorded in AR_ADJUSTMENTS.

Sometimes payments are received in batches, where a single payment is for multiple invoices.  These batch payments have records in AR_BATCHES.

The AR_PAYMENT_SCHEDULE table holds one record per payment.  Therefore, for payments that pay an invoice in full, there will only be one record related to that invoice.  However, if payments for an invoice are broken up into a payment plan, or if a partial payment is received for an invoice, additional records will be generated in this table for each payment.

I mentioned above that “most times payments are from customers, but there are exceptions”. An example of an exception is “payment from a bank for interest earned”.  The payment is not from a customer and it’s not for goods/services provided.  These types of payments are recorded in AR_MISC_CASH_DISTRIBUTIONS.

These transactions affect accounting which will eventually make their way to the GL (when the Receivables Transfer to GL program is run). The accounting transactions are generated in RA_CUST_TRX_LINE_GL_DIST and AR_RECEIVABLE_APPLICATIONS.