Category: Data Science

Exploring the RStudio Interface

In this post we will explore the RStudio interface.  This is where you will manage your R environment, issue commands for processing and analyzing data, create scripts, view results, and much more.  Below is an image of the default RStudio interface.

RStudio_Environment

On the left:
Console – the window where you enter commands, and where output is displayed.

On the top-right:
Environment tab – shows the variables and values created through the console
History tab – shows the history of past executed commands

On the bottom-right:
Files tab – displays folders and files from the file system, from which you can select files, set working directory, create folders, copy and move folders and files, and more.
Plots tab – displays the plots that have been created and allows for you to export them.
Packages tab – displays all the packages currently installed and available.  Loaded packages will have the checkbox checked and packages must be loaded before they can be used.
Help tab – useful for getting help about R and R packages, and keyword search is available which can be very helpful when you don’t know exactly what you are looking for.
Viewer tab – can be used to view local web content, such as, static HTML files written to the session temporary directory or a locally run web application.

On the top-left (when a script is created or opened):
– Script pane and tabs – When you create or open a R script, it will create a new pane area in the top-left of the application window, and the Console pane will get shifted down to the bottom-left area.

RStudio_Environment_with_Script_pane

A new Script tab will open in this pane for each new script opened or created. From this window, you will be able to run your script line by line or in its entirety, among many other functions.

Thanks for reading!

Installing, Loading, Unloading, and Removing R Packages in RStudio

R has thousands of packages available for statistics and data analytics, but before you can use them, they need to be installed.  In this post I cover installing, loading, unloading, and removing R packages in RStudio. In these examples, I use the ggplot2 package – a popular graphics and visualization package in R.  Wherever you see ggplot2 in the examples below, you can replace it with the package you want to perform these actions on.

To install a package via the User Interface

In RStudio, select Tools -> Install Packages from the main menu, or click Install in the Packages tab on the bottom-right.
R_Installing_Packages_ToolsMenu

The Install Packages dialog appears.
R_Installing_Packages_ToolsMenu_InstallDialog

Start typing the name of the package you want to install, and a list of all packages that start with the letters you have type will show up in the selection list.
R_Installing_Packages_ToolsMenu_InstallDialog_PartialNameFind

Select (or type the full name of) the package you want to install, ensure that “Install dependencies” is checked, and click “Install”.
The statement will be automatically entered and run as shown below.
R_Installing_Packages_ToolsMenu_Output

And the output will show if the package is successfully installed.
R_Installing_Packages_ToolsMenu_Output2

At this point, you will be able to see the package in the list in the Packages tab on the right.
R_Installing_Packages_ToolsMenu_Output3

To install a package via script
Instead of using the user interface (menu), you can also install packages directly via script.

install.packages("ggplot2")

See script statement below.  And as before, the package shows in the Packages tab.
R_Installing_Packages_Script

After a package has been installed, it needs to be loaded before you can use it.

Loading and Unloading a package (via user interface or script):

To load a package you can simply check the checkbox beside the package name in the Packages tab – as shown by the yellow box highlight below.  This will automatically enter and execute the command shown with the yellow arrow.

Or you can enter the script, by entering the command as shown with the green arrow:

library("ggplot2")

As an alternative, the require(ggplot2) command will also load the package.

To unload the package, you can simply uncheck the checkbox beside the package name in the Packages tab, or enter the command shown by the red arrow:

detach("package:ggplot2", unload=TRUE)

R_Loading_Unloading_Packages You will notice that after running the detach command, the Package checkbox will not be checked (will be unchecked).

To remove (uninstall) a package (via user interface or script):

To remove a package, you can simply click the “x” icon shown to the right of the package in the Packages window. See the yellow box highlight beside ggplot2 below.

Or you can run the script command as shown below in the Console window.

remove.packages("ggplot2")

R_Removing_Packages_Script_or_GUI

The below shows the output after removing a package.  You will notice that the package is no longer in the Packages list on the right hand side.  In this example, ggplot2 is no longer in the list of packages.
R_Removing_Packages_Output

An advantage of using the script option instead of the user interface methods to perform the above actions is that you will have a history of what you have done.

Thanks for reading!

Installing RStudio on Windows

RStudio is an open-source integrated development environment (IDE) for R. It also has commercial versions with expansive capabilities available (at a cost).  It runs on the desktop with multiple operating systems, or in a browser connected to a RStudio server.  RStudio includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. This post covers installing RStudio.  Note that R needs to be installed first – see this post for installing R.

To get started, go to http://www.rstudio.com.

Under RStudio, click Download.
RStudio_download

Choose the desired version of RStudio.  You will likely want the “RStudio Desktop (Open Source License)” version. On the same page, you will be able to read about the various options available – free and pay versions.
RStudio_choose_version

This will bring you down in the page to the installers.  Choose the installer that is appropriate for you.  In this example, we are installing on Windows, and so we chose the “RStudio 1.0.153 – Windows Vista/7/8/10” version.  Note: RStudio requires that R is installed.  If you have not already installed R, do so first (see this post for Installing R).
RStudio_choose_installer

After the download is complete, run the exe by double-clicking on it.
RStudio_install_run_exe

Click Next at the Welcome screen.
RStudio_install_welcome_1

Choose the install directory, click Next
RStudio_install_location_2

Chooses a Start Menu Folder, click Install
RStudio_install_start_menu_folder_3

Installing …
RStudio_install_installing_4

Complete the installation.
RStudio_install_complete_5

Run RStudio
RStudio_install_RStudio_icon

RStudio IDE
RStudio_install_run_RStudio

You will notice that the left window “Console” is the same as the “R Console” window in the stand-alone R installation.  This is because RStudio is built on top of R.

Good luck on your R journey!

Installing R on Windows

R is an open source software platform for data manipulation, statistical computing, calculation, analytics, and graphics.  It provides a wide variety of statistical/mathematical and high-quality graphical capabilities.  Some of the statistical capabilities include linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and clustering, and more.
You will find R useful in Analytics and Business Intelligence environments where data needs to be analyzed to uncover patterns or for better understanding and help make predictions and decisions.
In this post, we cover the installation of R.

To get started, go to http://www.r-project.org.

Click the “download R” link (underlined in yellow below).
rprojectorg

Choose the CRAN (Comprehensive R Archive Network) mirror location closest to you.
ChooseBestLocation

Choose the version for your install computer’s operating system (OS). In this example, we are installing on Windows – so we chose “Download R for Windows”.
R_ChooseOS

Assuming this is your first install, click “base” or “Install R for the first time”.
R_Install

Then, click “Download R 3.4.1 for Windows” (or whatever the appropriate version is at the time)
R_download

After the download is complete, go to the download directory, and double-click the R exe to run it.
R_run_exe

Choose your language
R_install_lang

Click Next
R_install_welcome_2

Review the license agreement, click Next
R_install_license_3

Accept the default directory or enter/select a new one.
R_install_dir_4

Select the components you want.  If your PC is 32-bit, then unselect 64-bit if it is shown as an option.  If your PC is 64-bit, you can install both 32-bit and 64-bit (default) or choose one of them.
R_install_components_5

Choose No and click Next (unless you want to customize the startup options for R, but this can be done later)
R_install_startup_6

Click Next
R_install_start_menu_folder_7

Choose Icon and Registry options
R_install_additional_8

Installing
R_install_installing_9

Click Finish to complete the installation
R_install_complete_10

Desktop and Quick Launch icons
R_install_desktop_icons     R_install_quicklaunch_icons

Run R.
R_install_runR

Next, we’ll cover installing RStudio.

Good luck on your R journey.

Data Science Fundamentals: Matching

This is a continuation of a series of Data Science Fundamentals posts.  In this post I will briefly describe Matching.

Matching, also known as Similarity Matching, is a technique of using data about objects to identify “like” objects. For example, Amazon or Walmart may use matching to identify “like” customers based on their browsing, liking, and purchasing history.

This information can then be used to provide product recommendations to these customers.

matching-recommendations
Product recommendations based on browsing and purchase history, and similarity matching

The results of Matching can be used for Classification and Regression; and Matching underlies Clustering.  These techniques were described in previous posts.

Data Science Fundamentals: Regression

Data Science is very complementary to Business Intelligence, in that they are both used to gain insights from data. While Business Intelligence, generally speaking, is more about answering known questions, Data Science is more about discovery and providing information for previously unknown questions.

This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks.  In this post, I will be covering Regression and will include an example to make it more meaningful.  Previous posts covered Classification and Clustering. Upcoming posts over the next few days will cover Matching, and other data science fundamental concepts.

Regression analysis is a predictive modeling technique which investigates the relationship between a dependent or target variable and one or more independent or predictor variables. regressionIt can be used to predict the value of a variable and the class the variable belongs to and identifies the strength of the relationships and the strength of impact between the variables.  There are many variations of regression with linear and logistic regression being the most commons methods used.  The various regression methods will be explored at a later point in time.

An example of how Regression can be used is, you may identify products similar to a given product, that is, products that are in the same class or category as your subject product. Then review the historical performance of those similar products under certain promotions, and use that to estimate/predict how well the subject product will perform under similar promotions.

Another example is, you may use the classification of a customer or prospect to estimate/predict how much that customer/prospect is likely to spend on your products and services each year.

Classification determines the group/class of an entity, whereas Regression determines where on the spectrum (expressed as a numerical value) of that class the entity falls.  An example using a hotel customer – Classification: Elite Customer; Regression: 200 nights per year (on a scale of 100-366 nights per year)  or  top 10% of customers.

Data Science Fundamentals: Clustering

Like Business Intelligence, the essential purpose of Data Science is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.

This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks.  In this post, I will be covering Clustering and will include an example to make it more meaningful.  A previous post covered Classification. Upcoming posts over the next few days will cover Regression, Matching, and other data science fundamental concepts.

Clustering is similar to Classification, in that, they are both used to categorize and segment data.  But Clustering is different from Classification, in that, clustering segments the data into groups (clusters) not previously defined or even known in some cases.  Clustering explores the data and finds natural groupings/clusters/classes without any targets (previously defined classes).  This is called “unsupervised” segmentation.  It clusters the data entities based on some similarity that makes them more like each other than entities in other clusters.  Therefore, this is a great first step if information about the data set is unknown.

clustering_with_outlier
Clustering: 3 clusters formed (with an outlier)

The Clustering process may yield clusters/groups than can be later used for Classification. Using the defined classes as targets is called “supervised” segmentation.  In the diagram to the right, there are 3 clusters that have been formed (red pluses, blue circles, green diamonds).

After a Clustering process is completed, there may be some data entities that are clustered by themselves.  In other words, they do not fall into any of the other clusters containing multiple entities.  These are classified as outliers.  An example of this can be seen in the diagram where there is an outlier in the top-left corner (purple square).  Analysis on these outliers can sometimes yield additional insight.

Software such as R and Python provides functions for performing cluster analysis/segmentation on datasets.  Future posts will cover these topics along with more details on Clustering.

Data Science Fundamentals: Classification and Class Probability Estimation (Scoring)

Over the next 3 months, I will be focusing on Data Science and my next few posts will cover some fundamental topics of Data Science.

The essential purpose of Data Science, like Business Intelligence, is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.

In this post, I will be covering Classification and will include examples to make it more meaningful.  Upcoming posts over the next few days will cover Clustering, Regression, Matching, and other data science fundamental concepts.

Classification is the process of using characteristics, features, and attributes of a data entity (such as a person, company, or thing) to determine what class (group or category) it belongs to and assigning it to that class.  As an example, demographic data is usually a classification – marital status (married, single, divorced), income bracket (wealthy, middle-class, poor), homeowner status (homeowner or renter), age bracket (old, middle-aged, young), etc.

classification
Shapes are classified by characteristics such as number of sides, length of sides, etc.

When a large amount of data needs to be analyzed, Classification needs to be an automated process.  If the classes are not know ahead of time, a process called Clustering can be used on existing data to discover groups that can in some way be used to form the classes.(Clustering will be covered in an upcoming post)

Class Probability Estimation (Scoring) is the process of producing a score that represents the probability of the data entity being in a particular class.  As an example, Income Bracket – top 5%.

A few Use Cases and examples of Classification and Class Probably Estimation/Scoring are:

(1) Financial: credit risk – High-Risk, Medium-Risk, Low-Risk, Safe.
A person’s past credit history (or lack of one) will determine their credit score. And their credit score will determine what class of credit risk they fall into, and therefore, will determine if they get the loan, and how favorable the terms of the loan would be.

As an example of Class Probability Estimation (Scoring) for this use case, a person may fall in the Low-Risk class, but their credit score (sometime called FICO score) shows that they are in the low-end of the Low-Risk class making them bordering on Medium-Risk.

(2) Marketing: Marketing offer/promotion interest – Highly likely, Likely, Unlikely
Based on past promotions and those who responded to it, classification can be used to determine the likelihood of a person being interested in a specific marketing offer/promotion.  This is known as targeted marketing where specific promotions are sent only to those who will likely be interested, and therefore, different classes/groups may receive different marketing messages from the same company.

As an example of Class Probability Estimation (Scoring) for this use case, a customer or prospect could be scored as 70% Unlikely, or 90% Highly Likely.

(3) Customer Base: Top-customer, Seasonal Customer, Loyal customer, High-Chance of Losing customer, …
A company may use some set of criteria to classify customers into various categories. These categories can be used for various customer-focused efforts, such as marketing, special offers, rewards, and more.

(4) Fraud detection & security:  Transaction or Activity occurrence – Highly Unusual, Unusual, Normal
Based on past activity and all other activities as a whole, a person’s activity/transaction can be classified as unusual or normal, and the appropriate actions taken to protect their accounts.

(5) Healthcare:
Data from past health analysis and treatments can be used to classify the level of a patient’s illness, and classify their treatment class. This will then drive the recommended treatment.

(6) Human behavior/Workforce:
Today’s workforce consists of multiple generations (Baby Boomers, GenX, GenY/Millennials, etc) of workers.  Generational classification of people based on the period in which they were born is used for marketing purposes, but is also used to help educate a diverse workforce on understanding their team members of different generations and how to work with them.

There are of course many more types of classification and use cases. Feel free to share your use cases.