Category: Data Science

Data Cleaning methods

Data cleaning is an essential step in the data preprocessing pipeline when preparing data for analytics or data science. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the dataset to improve its quality and reliability. It is essential that data is cleaned before being used in analyses, reporting, development or integration. Here are some common data cleaning methods:

Handling missing values:

  • Delete rows or columns with a high percentage of missing values if they don’t contribute significantly to the analysis.
  • Impute missing values by replacing them with a statistical measure such as mean, median, mode, or using more advanced techniques like regression imputation or k-nearest neighbors imputation.

Handling categorical variables:

  • Encode categorical variables into numerical representations using techniques like one-hot encoding, label encoding, or target encoding.

Removing duplicates:

  • Identify and remove duplicate records based on one or more key variables.
  • Be cautious when removing duplicates, as sometimes duplicated entries may be valid and intentional.

Handling outliers:

  • Identify outliers using statistical methods like z-scores, box plots, or domain knowledge.
  • Decide whether to remove outliers or transform them based on the nature of the data and the analysis goals.

Correcting inconsistent data:

  • Standardize data formats: Convert data into a consistent format (e.g., converting dates to a specific format).
  • Resolve inconsistencies: Identify and correct inconsistent values (e.g., correcting misspelled words, merging similar categories).

Dealing with irrelevant or redundant features:

  • Remove irrelevant features that do not contribute to the analysis or prediction task.
  • Identify and handle redundant features that provide similar information to avoid multicollinearity issues.

Data normalization or scaling:

  • Normalize numerical features to a common scale (e.g., min-max scaling or z-score normalization) to prevent certain features from dominating the analysis due to their larger magnitudes.

Data integrity issues:

Finally, you need to address data integrity issues.

  • Check for data integrity problems such as inconsistent data types, incorrect data ranges, or violations of business rules.
  • Resolve integrity issues by correcting or removing problematic data.

It’s important to note that the specific data cleaning methods that need to be applied to a dataset will vary depending on the nature of the dataset, the analysis goals, and domain knowledge. It’s recommended to thoroughly understand the data and consult with domain experts when preparing to perform data cleaning tasks.

Python Libraries for Data Science

Python has grown quickly to become one of the most widely used programming languages. While it’s a powerful, multi-purpose language used for creating just about any type of application, it has become a go-to language for data science, rivaling even “R”, the longtime favorite language and platform for data science.

Python’s popularity for data-based solutions has grown because of the many powerful, opensource, data-centric libraries it has available. Some of these libraries include:

NumPy

A library used for creating and manipulating multi-dimensional data arrays and can be used for handling multi-dimensional data and difficult mathematical operations.

Pandas

Pandas is a library that provides easy-to-use but high-performance data structures, such as the DataFrame, and data analysis tools.

Matplotlib

Matplotlib is a library used for data visualization such as creating histograms, bar charts, scatter plots, and much more.

SciPy

SciPy is a library that provides integration, statistics, and linear algebra packages for numerical computations.

Scikit-learn

Scikit-learn is a library used for machine learning. It is built on top of some other libraries including NumPy, Matplotlib, and SciPy.

There are many other data-centric Python libraries and some will be introduced in future articles. More can be learned here: https://www.python.org/

What is data analytics? And what are the different types of data analytics?

Data analytics is the overall process of capturing and using data to produce meaningful information, including metrics and trends, that can be used to better understand events and help make better decisions. Usually the goal is to improve the efficiency and outcomes of an operation, such as a business, a political campaign, or even an individual (such as an athlete). There are four (4) prevalent types of data analytics – descriptive, predictive, diagnostic, and prescriptive.

  1. Descriptive analytics – provides information about “what has happened”. Examples of questions answered by descriptive analytics include: How much are our sales this month and what is over year-over-year sales increase? How many website visitors did we have and how many signups?
  2. Predictive analytics – provides insight into “what may happen” in the future based on the past. Examples of questions answered by predictive analytics include: Based on previous customer service call patterns and outcomes, what is the likelihood of a customer switching to another provider? Based on a customer’s profile, how much should we charge him for insurance?
  3. Diagnostic analytics – provides information to explain “why something happened”. In addition to the direct data, this may also involve more indirect or macro data sources, such as, weather data, local or national economic data, or competitor data. And it may also involve forming logical theories about the correlation of events. Examples of questions answered by diagnostic analytics include: How effective was the marketing blitz and which channel had the most impact? Did the weather affect sales or was it the price increase?
  4. Prescriptive analytics – provides insight into “what to do to make something happen”. Examples of questions answered by prescriptive analytics include: Based on the results of our test marketing blitz campaign, if we roll out the full campaign with adjustments to the channel spread, how many additional temporary customer service staff will we need to handle the increased volume without long wait times?
The four (4) types of data analytics

Descriptive analytics is the simplest and most common form of analytics used in organizations and is widely referred to as Business Intelligence (BI). There is widespread interest in predictive analytics but less than 50% of companies currently use it as it requires additional, more expensive skills. Diagnostic and prescriptive analytics have always been around because companies have always used information from descriptive analytics to hypothesize “why things happened” and make decisions on “what to do”. But it’s the automation of these types through new methods and the integration of more data inputs that is fairly new. The latter three forms are sometimes called Advanced Analytics or Data Science.

All the types of analytics will require some form of data integration and use some of the same data in an environment, but while descriptive analytics only needs data from the time periods being analyzed and usually from a narrower data set, the predictive, prescriptive and diagnostic analytics produce better results using as much data as is available from a wider timeframe and from a broader set of sources. There is overlap with the different types of analytics because the analysis of “what may happen” is driven by “what has happened” in the past and “why it happened”; and determining “what to do” will be driven by “what has happened”, “why it happened”, and “what may happen”. Companies on the forefront of data analytics will tend to use all four types.

Salesforce Einstein Bots

What is a Salesforce Einstein Bot?

According to Salesforce a bot is “a computer program which conducts a conversation via auditory or textual methods.”.

So, before we get more into what a bot is let’s first look at the platform they are created on, Salesforce’s Einstein Analytics.

Salesforce’s Einstein Analytics provides impressive mechanisms that assist organizations and their users of the Salesforce platform to connect, communicate & interpret customer needs. By implementing elements of artificial intelligence, data mining and predictive analytics Salesforce users can get deeper insights into their customers data and begin to build an improved base of knowledge related to their business. With an underlying engine tuned for performance and a presentation layer which can display key details or high level metrics on dashboards Einstein Analytics is the next step in reporting on the health of your sales pipeline, exposing opportunities and providing suggestions to help guide you in identifying & visualizing growth which aligns to your business.

Now, back to bots …

Basically, a bot is a means to facilitate communication between humans and computers with either voice or text and subsequently executing an action tied to the input provided. Bots can learn over time to interact with humans by leveraging Salesforce’s Einstein Analytics platform and your data which resides in Salesforce and respond using Natural Language Processing. According to Wikipedia, “Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.”

Why are Salesforce Einstein Bots important?

Organizations are creating and implementing bots at an ever-increasing rate. By creating and implementing a bot an organization can begin to get a handle on support cases resolving many of them very quickly and for some scenarios eliminating the need to open one at all.

Of course, a bot isn’t something that is intended to supplant interaction with a human. However, they can be leveraged to provide a decision path for customers and route customer’s requests quickly based on their general needs while providing a positive initial reception which can augment your current customer service model.

Not only can bots improve productivity of agents by freeing them from having to spend time addressing some of the simpler, frequent requests but can now allow them to focus on more time consuming, complex issues.

Bots in a sense can also be considered another channel for content. However, instead of thinking of new ways to formulate questions from scratch organizations should try to marry current content to bot questions. Reusing content is good but it should rely on content that is based on existing knowledge. This reuse of inhouse documentation & materials will ultimately bring development costs down leading toward a more uniform experience with a higher degree of excellence for the interaction.

How to configure the platform for Salesforce Einstein Bots?

Before you jump in and start creating bots you would be best served by allocating time to plan your bot and consider how it will interact with your customers.

Collaborating and soliciting feedback from agents regarding the issues they experience with customers that are potential areas a bot could address is a good start.

Think about the bot’s persona, what its name should be and how you would like it to convey & reiterate a consistent image of the company overall.

Decisions related to which channels to use, ways in which customers can enter their questions, which licenses are required, which profile to use, whether to provide a menu, what is not in scope for the bot, … etc. should all be worked out in advance of bot development.

At what point does a human need to take over from the bot’s interaction with the customer, if at all?

In Salesforce you will need a Service Cloud license and a Chat or Messaging license. Once that is obtained you will need to turn on Lightning Experience. There is a guided setup flow for Chat you will need to run through. If your organization has Knowledge articles you want to make available to customers through the bot that will need to be enabled also. In your Salesforce Org if you go to Setup and type Einstein Bots in the quick select area it will return Einstein Bots you can click on. Then under Settings there is a toggle to enable Einstein Bots.

When ready make an Embedded Chat button available on your published Salesforce site or community site for your customers to interact with. A Salesforce community site is preferred.

Check out the https://trailhead.salesforce.com/en/home free training to find out more about how to create bots within Salesforce.

Things to consider when maintaining Salesforce Einstein Bots.

Salesforce documentation indicates that the following items also be considered when planning bot creation:

  • Chat and Messaging licenses support different channels (such as SMS or Facebook Messenger) and might have different requirements.
  • Each org is provided 25 Einstein Bots conversations per month for each user with an active subscription.
  • To make full use of the Einstein Bots Performance page, obtain the Service Analytics App.

Learning Hadoop: The key features and benefits of Hadoop

What are the key features and benefits of Hadoop? Why is Hadoop such a successful platform?

Apache Hadoop, mostly called just Hadoop, is a software framework and platform for reading, processing, storing and analyzing very large amounts of data. There are several features of Hadoop that make it a very powerful solution for data analytics.

Hadoop is Distributed

With Hadoop, from a few to hundreds or thousands of commodity servers (called nodes) can be connected (forming a cluster) to work together to achieve whatever processing power and storage capability is needed. The software platform enables the nodes to work together, passing work and data between them. Data and processing is distributed across nodes which spreads the load and significantly reduces the impact of failure.

Hadoop is Scalable

In the past, to achieve extremely powerful computing, a company would have to buy very expensive, large, monolithic computers. As data growth exploded, eventually even those super computers would become insufficient. With Hadoop, from a few to hundreds or thousands or even millions of commodity servers can be relatively easily connected to work together to achieve whatever processing power and storage capability is needed. This allows a company or project to start out small and then grow as needed inexpensively, without any concern about hitting a limitation.

Hadoop is Fault Tolerant

Hadoop was designed and built around the fact that there will be frequent failures on the commodity hardware servers that make up the Hadoop cluster. When a failure occurs, the software handles the automatic reassignment of work and replication of data to other nodes in the cluster, and the system continues to function properly without manual intervention. When a node recovers, from a reboot for example, it will rejoin the cluster automatically and become available for work.

Hadoop is backed by the power of Open Source

Hadoop is open source software, which means that it can be downloaded, installed, used and even modified for free. It is managed by the renown non-profit group, Apache Software Foundation (ASF), hence the name Apache Hadoop. The group is made up of many brilliant people from all over the world, many of whom work at some of the top technology companies, who commit their time to managing the software. In addition, there are also many developers that contribute code to enhance or add new features and functionality to Hadoop or to add new tools that work with Hadoop. The various tools that have been built over the years to complement core Hadoop make up what is called the Hadoop ecosystem. With a large community of people from all over the world continuously adding to the growth of the Hadoop ecosystem in a well-managed way, it will only get better and become more useful to many more use-cases.

These are the reasons Hadoop has become such a force within the data world. Although there is some hype around the big data phenomenon, the benefits and solutions based on the Hadoop ecosystem are real.

You can learn more at https://hadoop.apache.org

Creating a Business Intelligence (BI) & Analytics Strategy and Roadmap

This post provides some of my thoughts on how to go about creating a Business Intelligence (BI) & Analytics Strategy and Roadmap for your client or company.  Please comment with your suggestions from your experience for improving this information.

 

When creating or updating the BI & Analytics Strategy and Roadmap for a company, one of the first things to understand is:

Who are all the critical stakeholders that need to be involved?

Understanding who needs and uses the BI & Analytics systems is critical for starting the process of understanding and documenting the “who needs what, why, and when”.

These are some of the roles that are typically important stakeholders:

  • High-level business executives that are paying for the projects
  • Business directors involved in the usage of the systems
  • IT directors involved in the developing and support of the systems
  • Business Subject Matter Experts (SME’s) & Business Analysts
  • BI/Analytics/Data/System Architects
  • BI/Analytics/Data/System Developers and Administrators

 

Then, you need to ask all these stakeholders, especially those from the business:

What are the drivers for BI & Analytics? And what is the level of importance for each of these drivers?

This will help you to understand and document what business needs are creating the need for new or modified BI & Analytics solutions. You should then go deeper to understand … what are the business objectives and goals that are driving these business needs.  This will help you to understand and document the bigger picture so that a more comprehensive strategy and roadmap can be created.

The questions and discussions surrounding the above will require deep and broad business involvement. Getting the perspective of a wide range of users from all business areas that are using the BI & Analytics Systems is critical.  The business should be involved throughout the process of creating the strategy and roadmap, and all decisions should tie back to support for business objectives and goals. And the trail leading to all these decisions must be documented.

Some examples of business drivers include:

  • Gain more insight into who our best customers are and how best to acquire them.
  • Understand how weather affects our sales/revenue.
  • Determine how we can sell more to our existing customers.
  • Understand what causes employee turnover.
  • Gain insight into how we can improve staffing schedules.

 

And examples of business objectives and goals may include things like:

  • Increase corporate revenues by 10%
  • Grow our base of recurring customers
  • Stabilize corporate revenues over all seasons
  • Create an environment where employees love to work
  • Reduce payroll costs without a reduction in staff, for example, reduce turnover.

 

Then, turn to understanding and documenting the current scenario (if not already known). Identify what systems (including data sources) are in place, who are using them (and why and how), what capabilities do they offer, what are the must-haves, and what are the pain points and positive highlights.

Also, you will need to determine the current workload (and future workload if it can be determined) of the primary team members involved in developing, testing, and implementing BI & Analytics solutions.

This will help you understand a few things:

  • Some of the highest priority needs of the users
  • Gaps in capabilities and data between what is needed and what is currently in place (including an understanding of what is liked and disliked about the current systems)
  • Current user base knowledge and engagement
  • IT knowledge and skills
  • Resource availability – when are people available to work on new initiatives

 

What are the options and limitations?

  • Can existing systems be customized to meet the requirements?
  • Can they be upgraded to a new version that has the needed functionality?
  • Do we need to consider adding a new platform or replacing one or more of the existing systems with a new platform?
  • Can we migrate from/integrate one system to/with another system that we already have up and running?
  • Are any of our current systems losing vendor support or require an upgrade for other reasons? Has the pricing changed for any of our software applications?
  • What options does our budget permit us to explore?
  • What options do our knowledge and skills permit us to explore?

 

Once you have identified these items …

  • Identify and engage stakeholders, and document these roles and the people
  • Identify and document business drivers, objectives and goals
  • Understand and document the current landscape – needs (including must-haves), technology, gaps, users, IT staff, resource availability, and more
  • Identify and document options – based on current landscape, technology, budget, staff resources, etc.

… you can develop a “living” Strategy and Roadmap for BI & Analytics. And when I say “living”, I mean it will not be a static document, but will be fine-tuned over time as new information emerge and as changes arise in business needs, technology, and staff resources.

 

Your Strategy and Roadmap for BI & Analytics should include, but is not limited to:

  • BI & Analytics that will be used to satisfy business drivers, objectives and goals
  • Data acquisition and storage plan for meeting the analytics needs
  • Technology platforms that will be used to process and store data, and deliver the analytics
  • Information about any new technologies that needs to be acquired or implemented, and schedules
  • Roles and Responsibilities for all stakeholders involved in BI & Analytics projects
  • Planned staffing allocations and schedules
  • Planned staffing changes and schedules
  • User training (business users) and Delivery team training (technical implementers & developers for example)
  • List dependencies for each item or set of items

Working with built-in datasets in R

This is a quick cheat sheet of commands for working with built-in datasets in R.

R has a number of base datasets that come with the install, and there are many packages that also include additional datasets.  These are very useful and easy to work with.

 

> ?datasets                                  # to get help info on the datasets package

> library(help=”datasets”)     # provides detailed information on the datasets package, including listing and description of the datasets in the package

> data()                                       # lists all the datasets currently available by package

data(package = .packages(all.available = TRUE))  # lists all the available datasets by package even if not installed

> data(EuStockMarkets)            # loads  the dataset EuStockMarkets into the workspace

> summary(AirPassengers)      # provides a summary of the dataset AirPassengers

> ?airquality                                    # outputs in the Help window, information about the dataset “airquality”

> str(airmiles)                                 # shows the structure of the dataset “airmiles”

> View(airmiles)             # shows the data of the dataset “airmiles” in spreadsheet format

 

 

Statistics Basics – Descriptive vs Inferential Statistics

Descriptive Statistics
Statistics that quantitatively describes an observed data set. Analysis for descriptive statistics is performed on and conclusions drawn from the observed data only, and does not take into account any larger population of data.

Inferential Statistics
Statistics that make inferences about a larger population of data based on the observed data set. Analysis for inferential statistics takes into account that the observed data is taken from a larger population of data, and infers or predicts characteristics about the population.

Statistics Basics – Measures of Central Tendency & Measures of Variability

Measures of Central Tendency and Measures of Variability are frequently used in data analysis.  This post provides simple definitions of the common measures.

 

Measures of Central Tendency

Mean / Average – sum of all data points or observations in a dataset divided by the total number of data points or observations in the dataset.

The mean or average of this dataset with 5 numbers {2, 4, 6, 8, 10} is: 6

Sum of all data points:     (2+4+6+8+10)
Divided by:                       ———————–  = 6
Number of data points:              5

Median – with the values (data points) in the dataset listed in increasing (ascending) order, the median is the midpoint of the values, such that there are an equal number of data points above and below the median.  If there are an odd number of data points in the dataset, then the median value will be a single midpoint value. If there an even number of data points in the dataset, then the median value will be the mean/average of the two midpoint values.

The median of the same dataset {2, 4, 6, 8, 10} is:  6
This dataset has an odd number of data points (5), and the middle data point is the value 6, with 2 numbers below (2, 4) and 2 numbers above (8, 10).

Using an example of a dataset with an even number of data points:
The median of this dataset {2, 4, 6, 8, 10, 12} is: (6 + 8) / 2 = 7
Since there are 2 middle data points (6, 8), then we need to calculate the mean of those 2 numbers to determine the median.

Mode – the data point that appears the most times in the dataset.

Using our original dataset {2, 4, 6, 8, 10}, since each of the values only appear once, none appearing more times than the others, this dataset does not have a mode.

Using a new dataset {2, 2, 4, 4, 4, 4, 6, 8, 8, 8, 10}, the Mode in this case is: 4
4 is the value that appears the most times in the dataset.

Measures of Variability

Min – the minimum value of the all values in the dataset.
Min {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 1.

Max – the maximum value of the all values in the dataset.
Max {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 9.

Variance – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from their average/mean value.  It is the average of the squared differences from the mean.

Variance of {2, 3, 4, 5, 6} is calculated as follows …

First find the Mean.  Mean = (2 + 3 + 4 + 5 + 6) / 5 = 4

Then, find the Squared Differences from the Mean … where ^2 means squared …
(2 – 4)^2 = (-2)^2 = 4
(3 – 4)^2 = (-1)^2 = 1
(4 – 4)^2 = (0)^2 = 0
(5 – 4)^2 = (1)^2 = 1
(6 – 4)^2 = (2)^2 = 4
Average of Squared Differences: (4 + 1 + 0 + 1 + 4) / 5 = 2

Standard Deviation – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from each other.  It is the square root of the Variance (defined above).

For the above dataset, Standard Deviation {2, 3, 4, 5, 6} = Square Root (2) =~ 1.414

Kurtosis – a calculated value that represents how close the tail of the distribution of the dataset is to the tail of a normal distribution*.

Skewness – a calculated value that represents how close the symmetry of the distribution of the dataset is to the symmetry of a normal distribution*.

* A normal distribution, also known as the bell curve, is a probability distribution in which most values are toward the center (closer to the average) and less and less observations occur as you go further from the center.

Range – the difference between the largest number in the dataset and the smallest number in the dataset.
Range {2, 4, 6, 8, 10} = 10 – 2 = 8

 

Thanks for reading!

 

Updating R packages

In this post, I cover updating R packages in RStudio.  You have a few options for updating your R packages in RStudio.

(Option 1) From the menu, select Tools -> “Check for Package Updates”RStudio_Menu_Tools_CheckforPackageUpdates

You will get this message if all packages are up-to-date:

RStudio_AllPackagesUpToDate

(Option 2) From the Packages tab in the bottom-right pane, click on “Update”RStudio_PackagesTab_Update

(Option 3) Execute the command update.packages()

RStudio_Script_UpdatePackages

You may see output like this …

RStudio_Script_UpdatePackages2

Enter “y” to update, “N” to not update, and “c” to cancel.

Thanks for reading!