Glossary – 100 “Data Science” Terms

Below is a glossary that includes 100 “Data Science” terms and phrases, along with their definitions and examples, in alphabetical order. Enjoy!

Term	Definition & Example
A/B Testing	Comparing two variants. Example: Website layout test.
Accuracy	Overall correct predictions rate. Example: 90% accuracy.
Actionable Insight	Insight leading to action. Example: Improve onboarding.
Algorithm	Procedure used to train models. Example: Decision trees.
Alternative Hypothesis	Assumption opposing the null hypothesis. Example: Group A performs better than B.
AUC	Area under ROC curve. Example: Model ranking metric.
Bayesian Inference	Updating probabilities with new evidence. Example: Prior and posterior beliefs.
Bias-Variance Tradeoff	Balance between simplicity and flexibility. Example: Model tuning.
Bootstrapping	Resampling technique for estimation. Example: Estimating confidence intervals.
Business Problem	Decision-focused question. Example: Why churn increased.
Causation	One variable directly affects another. Example: Price drop causes sales increase.
Classification	Predicting categories. Example: Spam detection.
Clustering	Grouping similar observations. Example: Market segmentation.
Computer Vision	Interpreting images and video. Example: Image classification.
Confidence Interval	Range likely containing the true value. Example: 95% CI for average revenue.
Confusion Matrix	Table evaluating classification results. Example: True positives vs false positives.
Correlation	Strength of relationship between variables. Example: Ad spend vs revenue.
Cross-Validation	Repeated training/testing splits. Example: k-fold CV.
Data Drift	Change in input data distribution. Example: New demographics.
Data Imputation	Replacing missing values. Example: Median imputation.
Data Leakage	Training model with future information. Example: Using post-event data.
Data Science	Interdisciplinary field combining statistics, programming, and domain knowledge to extract insights from data. Example: Predicting customer churn.
Data Storytelling	Communicating insights effectively. Example: Executive dashboards.
Dataset	A structured collection of data for analysis. Example: Customer transactions table.
Deep Learning	Multi-layer neural networks. Example: Speech recognition.
Descriptive Statistics	Summary statistics of data. Example: Mean, median.
Dimensionality Reduction	Reducing number of features. Example: PCA.
Effect Size	Magnitude of difference or relationship. Example: Lift in conversion rate.
Ensemble Learning	Combining multiple models. Example: Boosting techniques.
Ethics in Data Science	Responsible use of data and models. Example: Avoiding biased predictions.
Experimentation	Testing hypotheses with data. Example: A/B testing.
Explainable AI (XAI)	Techniques to explain predictions. Example: SHAP values.
Exploratory Data Analysis (EDA)	Initial data investigation using statistics and visuals. Example: Distribution plots.
F1 Score	Balance of precision and recall. Example: Imbalanced datasets.
Feature	An input variable used in modeling. Example: Customer age.
Feature Engineering	Creating new features from raw data. Example: Tenure calculated from signup date.
Forecasting	Predicting future values. Example: Demand forecasting.
Generalization	Model performance on unseen data. Example: Stable test accuracy.
Hazard Function	Instantaneous event rate. Example: Churn risk over time.
Holdout Set	Data reserved for final evaluation. Example: Final test dataset.
Hyperparameter	Pre-set model configuration. Example: Learning rate.
Hypothesis	A testable assumption about data. Example: Discounts increase conversion rates.
Hypothesis Testing	Statistical method to evaluate assumptions. Example: t-test for average sales.
Insight	Meaningful analytical finding. Example: High churn among new users.
Label	Known output used in supervised learning. Example: Fraud or not fraud.
Likelihood	Probability of data given parameters. Example: Used in Bayesian models.
Loss Function	Measures prediction error. Example: Mean squared error.
Mean	Arithmetic average. Example: Average sales value.
Median	Middle value of ordered data. Example: Median income.
Missing Values	Absent data points. Example: Null customer age.
Mode	Most frequent value. Example: Most common category.
Model	Mathematical representation learned from data. Example: Logistic regression.
Model Drift	Performance degradation over time. Example: Changing customer behavior.
Model Interpretability	Understanding model decisions. Example: Feature importance.
Monte Carlo Simulation	Random sampling to model uncertainty. Example: Risk modeling.
Natural Language Processing (NLP)	Analyzing human language. Example: Sentiment analysis.
Neural Network	Model inspired by the human brain. Example: Image recognition.
Null Hypothesis	Default assumption of no effect. Example: No difference between two groups.
Optimization	Process of minimizing loss. Example: Gradient descent.
Outlier	Value significantly different from others. Example: Unusually large purchase.
Overfitting	Model memorizes training data. Example: Poor test performance.
Pipeline	End-to-end data science workflow. Example: Ingest → train → deploy.
Population	Entire group of interest. Example: All customers.
Posterior Probability	Updated belief after observing data. Example: Updated churn likelihood.
Precision	Correct positive prediction rate. Example: Fraud detection precision.
Principal Component Analysis (PCA)	Linear dimensionality reduction technique. Example: Visualizing high-dimensional data.
Prior Probability	Initial belief before observing data. Example: Baseline churn rate.
p-value	Probability of observing results under the null hypothesis. Example: p < 0.05 indicates significance.
Recall	Ability to identify all positives. Example: Medical diagnosis.
Regression	Predicting numeric values. Example: Sales forecasting.
Reinforcement Learning	Learning via rewards and penalties. Example: Game-playing AI.
Reproducibility	Ability to recreate results. Example: Fixed random seeds.
ROC Curve	Classifier performance visualization. Example: Threshold comparison.
Sampling	Selecting subset of data. Example: Survey sample.
Sampling Bias	Non-representative sampling. Example: Surveying only active users.
Seasonality	Repeating time-based patterns. Example: Holiday sales.
Semi-Structured Data	Data with flexible structure. Example: JSON files.
Stacking	Ensemble method using meta-models. Example: Combining classifiers.
Standard Deviation	Average distance from the mean. Example: Price volatility.
Stationarity	Stable statistical properties over time. Example: Mean doesn’t change.
Statistical Power	Probability of detecting a true effect. Example: Larger sample sizes increase power.
Statistical Significance	Evidence results are unlikely due to chance. Example: Rejecting the null hypothesis.
Structured Data	Data with a fixed schema. Example: SQL tables.
Supervised Learning	Learning with labeled data. Example: Credit risk prediction.
Survival Analysis	Modeling time-to-event data. Example: Customer churn timing.
Target Variable	The outcome a model predicts. Example: Loan default indicator.
Test Data	Data used to evaluate model performance. Example: Held-out validation set.
Text Mining	Extracting insights from text. Example: Topic modeling.
Time Series	Data indexed by time. Example: Daily stock prices.
Tokenization	Splitting text into units. Example: Words or subwords.
Training Data	Data used to train a model. Example: Historical transactions.
Transfer Learning	Reusing pretrained models. Example: Image models for medical scans.
Trend	Long-term direction in data. Example: Growing user base.
Underfitting	Model too simple to capture patterns. Example: High bias.
Unstructured Data	Data without predefined structure. Example: Text, images.
Unsupervised Learning	Learning without labels. Example: Customer clustering.
Uplift Modeling	Measuring treatment impact. Example: Marketing campaign effectiveness.
Validation Set	Data used for tuning models. Example: Hyperparameter selection.
Variance	Measure of data spread. Example: Sales variability.
Word Embeddings	Numerical text representations. Example: Word2Vec.