In this post we will explore the RStudio interface. This is where you will manage your R environment, issue commands for processing and analyzing data, create scripts, view results, and much more. Below is an image of the default RStudio interface.
On the left:
– Console – the window where you enter commands, and where output is displayed.
On the top-right:
– Environment tab – shows the variables and values created through the console
– History tab – shows the history of past executed commands
On the bottom-right:
– Files tab – displays folders and files from the file system, from which you can select files, set working directory, create folders, copy and move folders and files, and more.
– Plots tab – displays the plots that have been created and allows for you to export them.
– Packages tab – displays all the packages currently installed and available. Loaded packages will have the checkbox checked and packages must be loaded before they can be used.
– Help tab – useful for getting help about R and R packages, and keyword search is available which can be very helpful when you don’t know exactly what you are looking for.
– Viewer tab – can be used to view local web content, such as, static HTML files written to the session temporary directory or a locally run web application.
On the top-left (when a script is created or opened): – Script pane and tabs – When you create or open a R script, it will create a new pane area in the top-left of the application window, and the Console pane will get shifted down to the bottom-left area.
A new Script tab will open in this pane for each new script opened or created. From this window, you will be able to run your script line by line or in its entirety, among many other functions.
R has thousands of packages available for statistics and data analytics, but before you can use them, they need to be installed. In this post I cover installing, loading, unloading, and removing R packages in RStudio. In these examples, I use the ggplot2 package – a popular graphics and visualization package in R. Wherever you see ggplot2 in the examples below, you can replace it with the package you want to perform these actions on.
To install a package via the User Interface
In RStudio, select Tools -> Install Packages from the main menu, or click Install in the Packages tab on the bottom-right.
The Install Packages dialog appears.
Start typing the name of the package you want to install, and a list of all packages that start with the letters you have type will show up in the selection list.
Select (or type the full name of) the package you want to install, ensure that “Install dependencies” is checked, and click “Install”.
The statement will be automatically entered and run as shown below.
And the output will show if the package is successfully installed.
At this point, you will be able to see the package in the list in the Packages tab on the right.
To install a package via script
Instead of using the user interface (menu), you can also install packages directly via script.
install.packages("ggplot2")
See script statement below. And as before, the package shows in the Packages tab.
After a package has been installed, it needs to be loaded before you can use it.
Loading and Unloading a package (via user interface or script):
To load a package you can simply check the checkbox beside the package name in the Packages tab – as shown by the yellow box highlight below. This will automatically enter and execute the command shown with the yellow arrow.
Or you can enter the script, by entering the command as shown with the green arrow:
library("ggplot2")
As an alternative, the require(ggplot2) command will also load the package.
To unload the package, you can simply uncheck the checkbox beside the package name in the Packages tab, or enter the command shown by the red arrow:
detach("package:ggplot2", unload=TRUE)
You will notice that after running the detach command, the Package checkbox will not be checked (will be unchecked).
To remove (uninstall) a package (via user interface or script):
To remove a package, you can simply click the “x” icon shown to the right of the package in the Packages window. See the yellow box highlight beside ggplot2 below.
Or you can run the script command as shown below in the Console window.
remove.packages("ggplot2")
The below shows the output after removing a package. You will notice that the package is no longer in the Packages list on the right hand side. In this example, ggplot2 is no longer in the list of packages.
An advantage of using the script option instead of the user interface methods to perform the above actions is that you will have a history of what you have done.
RStudio is an open-source integrated development environment (IDE) for R. It also has commercial versions with expansive capabilities available (at a cost). It runs on the desktop with multiple operating systems, or in a browser connected to a RStudio server. RStudio includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. This post covers installing RStudio. Note that R needs to be installed first – see this post for installing R.
Choose the desired version of RStudio. You will likely want the “RStudio Desktop (Open Source License)” version. On the same page, you will be able to read about the various options available – free and pay versions.
This will bring you down in the page to the installers. Choose the installer that is appropriate for you. In this example, we are installing on Windows, and so we chose the “RStudio 1.0.153 – Windows Vista/7/8/10” version. Note: RStudio requires that R is installed. If you have not already installed R, do so first (see this post for Installing R).
After the download is complete, run the exe by double-clicking on it.
Click Next at the Welcome screen.
Choose the install directory, click Next
Chooses a Start Menu Folder, click Install
Installing …
Complete the installation.
Run RStudio
RStudio IDE
You will notice that the left window “Console” is the same as the “R Console” window in the stand-alone R installation. This is because RStudio is built on top of R.
R is an open source software platform for data manipulation, statistical computing, calculation, analytics, and graphics. It provides a wide variety of statistical/mathematical and high-quality graphical capabilities. Some of the statistical capabilities include linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and clustering, and more.
You will find R useful in Analytics and Business Intelligence environments where data needs to be analyzed to uncover patterns or for better understanding and help make predictions and decisions.
In this post, we cover the installation of R.
Click the “download R” link (underlined in yellow below).
Choose the CRAN (Comprehensive R Archive Network) mirror location closest to you.
Choose the version for your install computer’s operating system (OS). In this example, we are installing on Windows – so we chose “Download R for Windows”.
Assuming this is your first install, click “base” or “Install R for the first time”.
Then, click “Download R 3.4.1 for Windows” (or whatever the appropriate version is at the time)
After the download is complete, go to the download directory, and double-click the R exe to run it.
Choose your language
Click Next
Review the license agreement, click Next
Accept the default directory or enter/select a new one.
Select the components you want. If your PC is 32-bit, then unselect 64-bit if it is shown as an option. If your PC is 64-bit, you can install both 32-bit and 64-bit (default) or choose one of them.
Choose No and click Next (unless you want to customize the startup options for R, but this can be done later)
This is a continuation of a series of Data Science Fundamentals posts. In this post I will briefly describe Matching.
Matching, also known as Similarity Matching, is a technique of using data about objects to identify “like” objects. For example, Amazon or Walmart may use matching to identify “like” customers based on their browsing, liking, and purchasing history.
This information can then be used to provide product recommendations to these customers.
Product recommendations based on browsing and purchase history, and similarity matching
The results of Matching can be used for Classification and Regression; and Matching underlies Clustering. These techniques were described in previous posts.
Data Science is very complementary to Business Intelligence, in that they are both used to gain insights from data. While Business Intelligence, generally speaking, is more about answering known questions, Data Science is more about discovery and providing information for previously unknown questions.
This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks. In this post, I will be covering Regression and will include an example to make it more meaningful. Previous posts covered Classification and Clustering. Upcoming posts over the next few days will cover Matching, and other data science fundamental concepts.
Regression analysis is a predictive modeling technique which investigates the relationship between a dependent or target variable and one or more independent or predictor variables. It can be used to predict the value of a variable and the class the variable belongs to and identifies the strength of the relationships and the strength of impact between the variables. There are many variations of regression with linear and logistic regression being the most commons methods used. The various regression methods will be explored at a later point in time.
An example of how Regression can be used is, you may identify products similar to a given product, that is, products that are in the same class or category as your subject product. Then review the historical performance of those similar products under certain promotions, and use that to estimate/predict how well the subject product will perform under similar promotions.
Another example is, you may use the classification of a customer or prospect to estimate/predict how much that customer/prospect is likely to spend on your products and services each year.
Classification determines the group/class of an entity, whereas Regression determines where on the spectrum (expressed as a numerical value) of that class the entity falls. An example using a hotel customer – Classification: Elite Customer; Regression: 200 nights per year (on a scale of 100-366 nights per year) or top 10% of customers.
Like Business Intelligence, the essential purpose of Data Science is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.
This is a continuation of a series of Data Science Fundamentals posts that I will be doing over the next few weeks. In this post, I will be covering Clustering and will include an example to make it more meaningful. A previous post covered Classification. Upcoming posts over the next few days will cover Regression, Matching, and other data science fundamental concepts.
Clustering is similar to Classification, in that, they are both used to categorize and segment data. But Clustering is different from Classification, in that, clustering segments the data into groups (clusters) not previously defined or even known in some cases. Clustering explores the data and finds natural groupings/clusters/classes without any targets (previously defined classes). This is called “unsupervised” segmentation. It clusters the data entities based on some similarity that makes them more like each other than entities in other clusters. Therefore, this is a great first step if information about the data set is unknown.
Clustering: 3 clusters formed (with an outlier)
The Clustering process may yield clusters/groups than can be later used for Classification. Using the defined classes as targets is called “supervised” segmentation. In the diagram to the right, there are 3 clusters that have been formed (red pluses, blue circles, green diamonds).
After a Clustering process is completed, there may be some data entities that are clustered by themselves. In other words, they do not fall into any of the other clusters containing multiple entities. These are classified as outliers. An example of this can be seen in the diagram where there is an outlier in the top-left corner (purple square). Analysis on these outliers can sometimes yield additional insight.
Software such as R and Python provides functions for performing cluster analysis/segmentation on datasets. Future posts will cover these topics along with more details on Clustering.
Over the next 3 months, I will be focusing on Data Science and my next few posts will cover some fundamental topics of Data Science.
The essential purpose of Data Science, like Business Intelligence, is to gain knowledge and insights from data. This knowledge can then be used for a variety of purposes – such as, driving more sales, retaining more employees, reducing marketing costs, and saving lives.
In this post, I will be covering Classification and will include examples to make it more meaningful. Upcoming posts over the next few days will cover Clustering, Regression, Matching, and other data science fundamental concepts.
Classification is the process of using characteristics, features, and attributes of a data entity (such as a person, company, or thing) to determine what class (group or category) it belongs to and assigning it to that class. As an example, demographic data is usually a classification – marital status (married, single, divorced), income bracket (wealthy, middle-class, poor), homeowner status (homeowner or renter), age bracket (old, middle-aged, young), etc.
Shapes are classified by characteristics such as number of sides, length of sides, etc.
When a large amount of data needs to be analyzed, Classification needs to be an automated process. If the classes are not know ahead of time, a process called Clustering can be used on existing data to discover groups that can in some way be used to form the classes.(Clustering will be covered in an upcoming post)
Class Probability Estimation (Scoring) is the process of producing a score that represents the probability of the data entity being in a particular class. As an example, Income Bracket – top 5%.
A few Use Cases and examples of Classification and Class Probably Estimation/Scoring are:
(1) Financial: credit risk – High-Risk, Medium-Risk, Low-Risk, Safe.
A person’s past credit history (or lack of one) will determine their credit score. And their credit score will determine what class of credit risk they fall into, and therefore, will determine if they get the loan, and how favorable the terms of the loan would be.
As an example of Class Probability Estimation (Scoring) for this use case, a person may fall in the Low-Risk class, but their credit score (sometime called FICO score) shows that they are in the low-end of the Low-Risk class making them bordering on Medium-Risk.
(2) Marketing: Marketing offer/promotion interest – Highly likely, Likely, Unlikely
Based on past promotions and those who responded to it, classification can be used to determine the likelihood of a person being interested in a specific marketing offer/promotion. This is known as targeted marketing where specific promotions are sent only to those who will likely be interested, and therefore, different classes/groups may receive different marketing messages from the same company.
As an example of Class Probability Estimation (Scoring) for this use case, a customer or prospect could be scored as 70% Unlikely, or 90% Highly Likely.
(3) Customer Base: Top-customer, Seasonal Customer, Loyal customer, High-Chance of Losing customer, …
A company may use some set of criteria to classify customers into various categories. These categories can be used for various customer-focused efforts, such as marketing, special offers, rewards, and more.
(4) Fraud detection & security: Transaction or Activity occurrence – Highly Unusual, Unusual, Normal
Based on past activity and all other activities as a whole, a person’s activity/transaction can be classified as unusual or normal, and the appropriate actions taken to protect their accounts.
(5) Healthcare:
Data from past health analysis and treatments can be used to classify the level of a patient’s illness, and classify their treatment class. This will then drive the recommended treatment.
(6) Human behavior/Workforce:
Today’s workforce consists of multiple generations (Baby Boomers, GenX, GenY/Millennials, etc) of workers. Generational classification of people based on the period in which they were born is used for marketing purposes, but is also used to help educate a diverse workforce on understanding their team members of different generations and how to work with them.
There are of course many more types of classification and use cases. Feel free to share your use cases.
Information and resources for the data professionals' community