Below is a glossary that includes 100 “Data Science” terms and phrases, along with their definitions and examples, in alphabetical order. Enjoy!
| Term | Definition & Example |
| A/B Testing | Comparing two variants. Example: Website layout test. |
| Accuracy | Overall correct predictions rate. Example: 90% accuracy. |
| Actionable Insight | Insight leading to action. Example: Improve onboarding. |
| Algorithm | Procedure used to train models. Example: Decision trees. |
| Alternative Hypothesis | Assumption opposing the null hypothesis. Example: Group A performs better than B. |
| AUC | Area under ROC curve. Example: Model ranking metric. |
| Bayesian Inference | Updating probabilities with new evidence. Example: Prior and posterior beliefs. |
| Bias-Variance Tradeoff | Balance between simplicity and flexibility. Example: Model tuning. |
| Bootstrapping | Resampling technique for estimation. Example: Estimating confidence intervals. |
| Business Problem | Decision-focused question. Example: Why churn increased. |
| Causation | One variable directly affects another. Example: Price drop causes sales increase. |
| Classification | Predicting categories. Example: Spam detection. |
| Clustering | Grouping similar observations. Example: Market segmentation. |
| Computer Vision | Interpreting images and video. Example: Image classification. |
| Confidence Interval | Range likely containing the true value. Example: 95% CI for average revenue. |
| Confusion Matrix | Table evaluating classification results. Example: True positives vs false positives. |
| Correlation | Strength of relationship between variables. Example: Ad spend vs revenue. |
| Cross-Validation | Repeated training/testing splits. Example: k-fold CV. |
| Data Drift | Change in input data distribution. Example: New demographics. |
| Data Imputation | Replacing missing values. Example: Median imputation. |
| Data Leakage | Training model with future information. Example: Using post-event data. |
| Data Science | Interdisciplinary field combining statistics, programming, and domain knowledge to extract insights from data. Example: Predicting customer churn. |
| Data Storytelling | Communicating insights effectively. Example: Executive dashboards. |
| Dataset | A structured collection of data for analysis. Example: Customer transactions table. |
| Deep Learning | Multi-layer neural networks. Example: Speech recognition. |
| Descriptive Statistics | Summary statistics of data. Example: Mean, median. |
| Dimensionality Reduction | Reducing number of features. Example: PCA. |
| Effect Size | Magnitude of difference or relationship. Example: Lift in conversion rate. |
| Ensemble Learning | Combining multiple models. Example: Boosting techniques. |
| Ethics in Data Science | Responsible use of data and models. Example: Avoiding biased predictions. |
| Experimentation | Testing hypotheses with data. Example: A/B testing. |
| Explainable AI (XAI) | Techniques to explain predictions. Example: SHAP values. |
| Exploratory Data Analysis (EDA) | Initial data investigation using statistics and visuals. Example: Distribution plots. |
| F1 Score | Balance of precision and recall. Example: Imbalanced datasets. |
| Feature | An input variable used in modeling. Example: Customer age. |
| Feature Engineering | Creating new features from raw data. Example: Tenure calculated from signup date. |
| Forecasting | Predicting future values. Example: Demand forecasting. |
| Generalization | Model performance on unseen data. Example: Stable test accuracy. |
| Hazard Function | Instantaneous event rate. Example: Churn risk over time. |
| Holdout Set | Data reserved for final evaluation. Example: Final test dataset. |
| Hyperparameter | Pre-set model configuration. Example: Learning rate. |
| Hypothesis | A testable assumption about data. Example: Discounts increase conversion rates. |
| Hypothesis Testing | Statistical method to evaluate assumptions. Example: t-test for average sales. |
| Insight | Meaningful analytical finding. Example: High churn among new users. |
| Label | Known output used in supervised learning. Example: Fraud or not fraud. |
| Likelihood | Probability of data given parameters. Example: Used in Bayesian models. |
| Loss Function | Measures prediction error. Example: Mean squared error. |
| Mean | Arithmetic average. Example: Average sales value. |
| Median | Middle value of ordered data. Example: Median income. |
| Missing Values | Absent data points. Example: Null customer age. |
| Mode | Most frequent value. Example: Most common category. |
| Model | Mathematical representation learned from data. Example: Logistic regression. |
| Model Drift | Performance degradation over time. Example: Changing customer behavior. |
| Model Interpretability | Understanding model decisions. Example: Feature importance. |
| Monte Carlo Simulation | Random sampling to model uncertainty. Example: Risk modeling. |
| Natural Language Processing (NLP) | Analyzing human language. Example: Sentiment analysis. |
| Neural Network | Model inspired by the human brain. Example: Image recognition. |
| Null Hypothesis | Default assumption of no effect. Example: No difference between two groups. |
| Optimization | Process of minimizing loss. Example: Gradient descent. |
| Outlier | Value significantly different from others. Example: Unusually large purchase. |
| Overfitting | Model memorizes training data. Example: Poor test performance. |
| Pipeline | End-to-end data science workflow. Example: Ingest → train → deploy. |
| Population | Entire group of interest. Example: All customers. |
| Posterior Probability | Updated belief after observing data. Example: Updated churn likelihood. |
| Precision | Correct positive prediction rate. Example: Fraud detection precision. |
| Principal Component Analysis (PCA) | Linear dimensionality reduction technique. Example: Visualizing high-dimensional data. |
| Prior Probability | Initial belief before observing data. Example: Baseline churn rate. |
| p-value | Probability of observing results under the null hypothesis. Example: p < 0.05 indicates significance. |
| Recall | Ability to identify all positives. Example: Medical diagnosis. |
| Regression | Predicting numeric values. Example: Sales forecasting. |
| Reinforcement Learning | Learning via rewards and penalties. Example: Game-playing AI. |
| Reproducibility | Ability to recreate results. Example: Fixed random seeds. |
| ROC Curve | Classifier performance visualization. Example: Threshold comparison. |
| Sampling | Selecting subset of data. Example: Survey sample. |
| Sampling Bias | Non-representative sampling. Example: Surveying only active users. |
| Seasonality | Repeating time-based patterns. Example: Holiday sales. |
| Semi-Structured Data | Data with flexible structure. Example: JSON files. |
| Stacking | Ensemble method using meta-models. Example: Combining classifiers. |
| Standard Deviation | Average distance from the mean. Example: Price volatility. |
| Stationarity | Stable statistical properties over time. Example: Mean doesn’t change. |
| Statistical Power | Probability of detecting a true effect. Example: Larger sample sizes increase power. |
| Statistical Significance | Evidence results are unlikely due to chance. Example: Rejecting the null hypothesis. |
| Structured Data | Data with a fixed schema. Example: SQL tables. |
| Supervised Learning | Learning with labeled data. Example: Credit risk prediction. |
| Survival Analysis | Modeling time-to-event data. Example: Customer churn timing. |
| Target Variable | The outcome a model predicts. Example: Loan default indicator. |
| Test Data | Data used to evaluate model performance. Example: Held-out validation set. |
| Text Mining | Extracting insights from text. Example: Topic modeling. |
| Time Series | Data indexed by time. Example: Daily stock prices. |
| Tokenization | Splitting text into units. Example: Words or subwords. |
| Training Data | Data used to train a model. Example: Historical transactions. |
| Transfer Learning | Reusing pretrained models. Example: Image models for medical scans. |
| Trend | Long-term direction in data. Example: Growing user base. |
| Underfitting | Model too simple to capture patterns. Example: High bias. |
| Unstructured Data | Data without predefined structure. Example: Text, images. |
| Unsupervised Learning | Learning without labels. Example: Customer clustering. |
| Uplift Modeling | Measuring treatment impact. Example: Marketing campaign effectiveness. |
| Validation Set | Data used for tuning models. Example: Hyperparameter selection. |
| Variance | Measure of data spread. Example: Sales variability. |
| Word Embeddings | Numerical text representations. Example: Word2Vec. |

