Category: Statistics

Python Libraries for Data Science

Python has grown quickly to become one of the most widely used programming languages. While it’s a powerful, multi-purpose language used for creating just about any type of application, it has become a go-to language for data science, rivaling even “R”, the longtime favorite language and platform for data science.

Python’s popularity for data-based solutions has grown because of the many powerful, opensource, data-centric libraries it has available. Some of these libraries include:

NumPy

A library used for creating and manipulating multi-dimensional data arrays and can be used for handling multi-dimensional data and difficult mathematical operations.

Pandas

Pandas is a library that provides easy-to-use but high-performance data structures, such as the DataFrame, and data analysis tools.

Matplotlib

Matplotlib is a library used for data visualization such as creating histograms, bar charts, scatter plots, and much more.

SciPy

SciPy is a library that provides integration, statistics, and linear algebra packages for numerical computations.

Scikit-learn

Scikit-learn is a library used for machine learning. It is built on top of some other libraries including NumPy, Matplotlib, and SciPy.

There are many other data-centric Python libraries and some will be introduced in future articles. More can be learned here: https://www.python.org/

Statistics Basics – Descriptive vs Inferential Statistics

Descriptive Statistics
Statistics that quantitatively describes an observed data set. Analysis for descriptive statistics is performed on and conclusions drawn from the observed data only, and does not take into account any larger population of data.

Inferential Statistics
Statistics that make inferences about a larger population of data based on the observed data set. Analysis for inferential statistics takes into account that the observed data is taken from a larger population of data, and infers or predicts characteristics about the population.

Statistics Basics – Measures of Central Tendency & Measures of Variability

Measures of Central Tendency and Measures of Variability are frequently used in data analysis.  This post provides simple definitions of the common measures.

 

Measures of Central Tendency

Mean / Average – sum of all data points or observations in a dataset divided by the total number of data points or observations in the dataset.

The mean or average of this dataset with 5 numbers {2, 4, 6, 8, 10} is: 6

Sum of all data points:     (2+4+6+8+10)
Divided by:                       ———————–  = 6
Number of data points:              5

Median – with the values (data points) in the dataset listed in increasing (ascending) order, the median is the midpoint of the values, such that there are an equal number of data points above and below the median.  If there are an odd number of data points in the dataset, then the median value will be a single midpoint value. If there an even number of data points in the dataset, then the median value will be the mean/average of the two midpoint values.

The median of the same dataset {2, 4, 6, 8, 10} is:  6
This dataset has an odd number of data points (5), and the middle data point is the value 6, with 2 numbers below (2, 4) and 2 numbers above (8, 10).

Using an example of a dataset with an even number of data points:
The median of this dataset {2, 4, 6, 8, 10, 12} is: (6 + 8) / 2 = 7
Since there are 2 middle data points (6, 8), then we need to calculate the mean of those 2 numbers to determine the median.

Mode – the data point that appears the most times in the dataset.

Using our original dataset {2, 4, 6, 8, 10}, since each of the values only appear once, none appearing more times than the others, this dataset does not have a mode.

Using a new dataset {2, 2, 4, 4, 4, 4, 6, 8, 8, 8, 10}, the Mode in this case is: 4
4 is the value that appears the most times in the dataset.

Measures of Variability

Min – the minimum value of the all values in the dataset.
Min {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 1.

Max – the maximum value of the all values in the dataset.
Max {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 9.

Variance – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from their average/mean value.  It is the average of the squared differences from the mean.

Variance of {2, 3, 4, 5, 6} is calculated as follows …

First find the Mean.  Mean = (2 + 3 + 4 + 5 + 6) / 5 = 4

Then, find the Squared Differences from the Mean … where ^2 means squared …
(2 – 4)^2 = (-2)^2 = 4
(3 – 4)^2 = (-1)^2 = 1
(4 – 4)^2 = (0)^2 = 0
(5 – 4)^2 = (1)^2 = 1
(6 – 4)^2 = (2)^2 = 4
Average of Squared Differences: (4 + 1 + 0 + 1 + 4) / 5 = 2

Standard Deviation – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from each other.  It is the square root of the Variance (defined above).

For the above dataset, Standard Deviation {2, 3, 4, 5, 6} = Square Root (2) =~ 1.414

Kurtosis – a calculated value that represents how close the tail of the distribution of the dataset is to the tail of a normal distribution*.

Skewness – a calculated value that represents how close the symmetry of the distribution of the dataset is to the symmetry of a normal distribution*.

* A normal distribution, also known as the bell curve, is a probability distribution in which most values are toward the center (closer to the average) and less and less observations occur as you go further from the center.

Range – the difference between the largest number in the dataset and the smallest number in the dataset.
Range {2, 4, 6, 8, 10} = 10 – 2 = 8

 

Thanks for reading!