Tag: Data Glossary

Glossary – 100 “AI” Terms

Below is a glossary that includes 100 common “AI (Artificial Intelligence)” terms and phrases in alphabetical order. Enjoy!

TermDefinition & Example
 AccuracyPercentage of correct predictions. Example: 92% accuracy.
 AgentAI entity performing tasks autonomously. Example: Task-planning agent.
 AI AlignmentEnsuring AI goals match human values. Example: Safe AI systems.
 AI BiasSystematic unfairness in AI outcomes. Example: Biased hiring models.
 AlgorithmA set of rules used to train models. Example: Decision tree algorithm.
 Artificial General Intelligence (AGI)Hypothetical AI with human-level intelligence. Example: Broad reasoning across tasks.
 Artificial Intelligence (AI)Systems that perform tasks requiring human-like intelligence. Example: Chatbots answering questions.
 Artificial Neural Network (ANN)A network of interconnected artificial neurons. Example: Credit scoring models.
 Attention MechanismFocuses model on relevant input parts. Example: Language translation.
 AUCArea under ROC curve. Example: Model comparison.
 AutoMLAutomated model selection and tuning. Example: Auto-generated models.
 Autonomous SystemAI operating with minimal human input. Example: Self-driving cars.
 BackpropagationMethod to update neural network weights. Example: Deep learning training.
 BatchSubset of data processed at once. Example: Batch size of 32.
 Batch InferencePredictions made in bulk. Example: Nightly scoring jobs.
 Bias (Model Bias)Error from oversimplified assumptions. Example: Linear model on non-linear data.
 Bias–Variance TradeoffBalance between bias and variance. Example: Choosing model complexity.
 Black Box ModelModel with opaque internal logic. Example: Deep neural networks.
 ClassificationPredicting categorical outcomes. Example: Email spam classification.
 ClusteringGrouping similar data points. Example: Customer segmentation.
 Computer VisionAI for interpreting images and video. Example: Facial recognition.
 Concept DriftChanges in underlying relationships. Example: Fraud patterns evolving.
 Confusion MatrixTable evaluating classification results. Example: True positives vs false positives.
 Data AugmentationExpanding data via transformations. Example: Image rotation.
 Data DriftChanges in input data distribution. Example: New user demographics.
 Data LeakageUsing future information in training. Example: Including test labels.
 Decision TreeTree-based decision model. Example: Loan approval logic.
 Deep LearningML using multi-layer neural networks. Example: Image recognition.
 Dimensionality ReductionReducing number of features. Example: PCA for visualization.
 Edge AIAI running on local devices. Example: Smart cameras.
 EmbeddingNumerical representation of data. Example: Word embeddings.
 Ensemble ModelCombining multiple models. Example: Random forest.
 EpochOne full pass through training data. Example: 50 training epochs.
 Ethics in AIMoral considerations in AI use. Example: Avoiding bias.
 Explainable AI (XAI)Making AI decisions understandable. Example: Feature importance charts.
 F1 ScoreBalance of precision and recall. Example: Imbalanced datasets.
 FairnessEquitable AI outcomes across groups. Example: Equal approval rates.
 FeatureAn input variable for a model. Example: Customer age.
 Feature EngineeringCreating or transforming features to improve models. Example: Calculating customer tenure.
 Federated LearningTraining models across decentralized data. Example: Mobile keyboard predictions.
 Few-Shot LearningLearning from few examples. Example: Custom classification with few samples.
 Fine-TuningFurther training a pre-trained model. Example: Custom chatbot training.
 GeneralizationModel’s ability to perform on new data. Example: Accurate predictions on unseen data.
 Generative AIAI that creates new content. Example: Text or image generation.
 Gradient BoostingSequentially improving weak models. Example: XGBoost.
 Gradient DescentOptimization technique adjusting weights iteratively. Example: Training neural networks.
 HallucinationModel generates incorrect information. Example: False factual claims.
 HyperparameterConfiguration set before training. Example: Learning rate.
 InferenceUsing a trained model to predict. Example: Real-time recommendations.
 K-MeansClustering algorithm. Example: Market segmentation.
 Knowledge GraphGraph-based representation of knowledge. Example: Search engines.
 LabelThe correct output for supervised learning. Example: “Fraud” or “Not Fraud”.
 Large Language Model (LLM)AI trained on massive text corpora. Example: ChatGPT.
 Loss FunctionMeasures model error during training. Example: Mean squared error.
 Machine Learning (ML)AI that learns patterns from data without explicit programming. Example: Spam email detection.
 MLOpsPractices for managing ML lifecycle. Example: CI/CD for models.
 ModelA trained mathematical representation of patterns. Example: Logistic regression model.
 Model DeploymentMaking a model available for use. Example: API-based predictions.
 Model DriftModel performance degradation over time. Example: Changing customer behavior.
 Model InterpretabilityAbility to understand model behavior. Example: Decision tree visualization.
 Model VersioningTracking model changes. Example: v1 vs v2 models.
 MonitoringTracking model performance in production. Example: Accuracy alerts.
 Multimodal AIAI handling multiple data types. Example: Text + image models.
 Naive BayesProbabilistic classification algorithm. Example: Spam filtering.
 Natural Language Processing (NLP)AI for understanding human language. Example: Sentiment analysis.
 Neural NetworkModel inspired by the human brain’s structure. Example: Handwritten digit recognition.
 OptimizationProcess of minimizing loss. Example: Gradient descent.
 OverfittingModel learns noise instead of patterns. Example: Perfect training accuracy, poor test accuracy.
 PipelineAutomated ML workflow. Example: Training-to-deployment flow.
 PrecisionCorrect positive predictions rate. Example: Fraud detection precision.
 Pretrained ModelModel trained on general data. Example: GPT models.
 Principal Component Analysis (PCA)Technique for dimensionality reduction. Example: Compressing high-dimensional data.
 PrivacyProtecting personal data. Example: Anonymizing training data.
 PromptInput instruction for generative models. Example: “Summarize this text.”
 Prompt EngineeringCrafting effective prompts. Example: Improving LLM responses.
 Random ForestEnsemble of decision trees. Example: Classification tasks.
 Real-Time InferenceImmediate predictions on live data. Example: Fraud detection.
 RecallAbility to find all positives. Example: Cancer detection.
 RegressionPredicting numeric values. Example: Sales forecasting.
 Reinforcement LearningLearning through rewards and penalties. Example: Game-playing AI.
 ReproducibilityAbility to recreate results. Example: Fixed random seeds.
 RoboticsAI applied to physical machines. Example: Warehouse robots.
 ROC CurvePerformance visualization for classifiers. Example: Threshold analysis.
 Semi-Supervised LearningMix of labeled and unlabeled data. Example: Image classification with limited labels.
 Speech RecognitionConverting speech to text. Example: Voice assistants.
 Supervised LearningLearning using labeled data. Example: Predicting house prices from known values.
 Support Vector Machine (SVM)Algorithm separating data with margins. Example: Text classification.
 Synthetic DataArtificially generated data. Example: Privacy-safe training.
 Test DataData used to evaluate model performance. Example: Held-out validation dataset.
 ThresholdCutoff for classification decisions. Example: Probability > 0.7.
 TokenSmallest unit of text processed by models. Example: Words or subwords.
 Training DataData used to teach a model. Example: Historical sales records.
 Transfer LearningReusing knowledge from another task. Example: Image model reused for medical scans.
 TransformerNeural architecture for sequence data. Example: Language translation models.
 UnderfittingModel too simple to capture patterns. Example: High error on all datasets.
 Unsupervised LearningLearning from unlabeled data. Example: Customer clustering.
 Validation DataData used to tune model parameters. Example: Hyperparameter selection.
 VarianceError from sensitivity to data fluctuations. Example: Highly complex model.
 XGBoostOptimized gradient boosting algorithm. Example: Kaggle competitions.
 Zero-Shot LearningPerforming tasks without examples. Example: Classifying unseen labels.

Please share your suggestions for any terms that should be added.

Glossary – 100 “Data Engineering” Terms

Below is a glossary that includes 100 common “Data Engineering” terms and phrases in alphabetical order. Enjoy!

TermDefinition & Example
Access ControlManaging who can access data. Example: Role-based permissions.
At-Least-Once ProcessingData may be processed more than once. Example: Duplicate-safe pipelines.
At-Most-Once ProcessingData processed zero or one time. Example: No retries on failure.
BackfillProcessing historical data. Example: Reloading last year’s data.
Batch ProcessingProcessing data in scheduled chunks. Example: Daily sales aggregation.
Blue-Green DeploymentDeployment strategy minimizing downtime. Example: Switching pipeline versions.
Canary ReleaseGradual rollout to detect issues. Example: New pipeline tested on 5% of data.
Change Data Capture (CDC)Capturing database changes. Example: Streaming updates from OLTP DB.
CheckpointingSaving progress during processing. Example: Spark streaming checkpoints.
Cloud StorageScalable remote data storage. Example: Azure Data Lake Storage.
Cold StorageLow-cost storage for infrequent access. Example: Archived logs.
Columnar StorageData stored by column instead of row. Example: Parquet files.
CompressionReducing data size. Example: Gzip-compressed files.
Compute EngineSystem performing data processing. Example: Spark cluster.
Consumption LayerData prepared for analytics. Example: Gold layer.
Cost OptimizationReducing infrastructure costs. Example: Query optimization.
Curated LayerCleaned and transformed data. Example: Silver layer.
DAG (Directed Acyclic Graph)Workflow structure with dependencies. Example: Airflow pipeline.
Data CatalogSearchable inventory of data assets. Example: Azure Purview.
Data ContractAgreement defining data structure and expectations. Example: Producer guarantees column names and types.
Data EngineeringThe practice of designing, building, and maintaining data systems. Example: Creating pipelines that feed analytics dashboards.
Data GovernancePolicies for data management and usage. Example: Access control rules.
Data IngestionCollecting data from source systems. Example: Ingesting API data hourly.
Data LakeCentralized storage for raw data. Example: S3-based data lake.
Data LatencyTime delay in data availability. Example: 5-minute pipeline delay.
Data LineageTracking data flow from source to output. Example: Source-to-dashboard trace.
Data MartSubset of warehouse for specific use. Example: Finance data mart.
Data MaskingObscuring sensitive data. Example: Masked credit card numbers.
Data MeshDomain-oriented decentralized data ownership. Example: Teams own their data products.
Data ModelingDesigning data structures for usage. Example: Star schema design.
Data ObservabilityMonitoring data health and pipelines. Example: Freshness alerts.
Data Partition PruningSkipping irrelevant partitions. Example: Querying one date only.
Data PipelineAn automated process that moves and transforms data. Example: Nightly ETL job from CRM to warehouse.
Data PlatformIntegrated set of data tools. Example: End-to-end analytics stack.
Data ProductA dataset treated as a product. Example: Curated customer table.
Data ProfilingAnalyzing data characteristics. Example: Value distributions.
Data QualityAccuracy, completeness, and reliability of data. Example: No duplicate records.
Data ReplayReprocessing historical events. Example: Rebuilding aggregates from logs.
Data RetentionRules for data lifespan. Example: Delete logs after 1 year.
Data SecurityProtecting data from unauthorized access. Example: Encryption at rest.
Data SerializationConverting data for storage or transport. Example: Avro encoding.
Data SinkThe destination where data is stored. Example: Data warehouse.
Data SourceThe origin of data. Example: ERP system, SaaS application.
Data ValidationEnsuring data meets expectations. Example: Null checks.
Data VersioningTracking dataset changes. Example: Snapshot tables.
Data WarehouseOptimized storage for analytics queries. Example: Azure Synapse Analytics.
Dead Letter Queue (DLQ)Storage for failed records. Example: Invalid messages routed for review.
Dimension TableTable storing descriptive attributes. Example: Customer details.
ELTExtract, Load, Transform approach. Example: Transforming data inside Snowflake.
ETLExtract, Transform, Load process. Example: Cleaning data before loading into a database.
Event TimeTimestamp when event occurred. Example: User click time.
Event-Driven ArchitectureSystems reacting to events in real time. Example: Trigger pipeline on file arrival.
Exactly-Once ProcessingEnsuring data is processed only once. Example: Preventing duplicate events.
Fact TableTable storing quantitative measures. Example: Order transactions.
Fault ToleranceSystem resilience to failures. Example: Node failure recovery.
File FormatHow data is stored on disk. Example: Parquet, CSV.
Foreign KeyField linking tables together. Example: CustomerID in orders table.
Full LoadReloading all data. Example: Initial table population.
High AvailabilitySystem uptime and reliability. Example: Multi-zone deployment.
Hot StorageHigh-performance storage for frequent access. Example: Real-time tables.
IdempotencyAbility to rerun pipelines safely. Example: Reprocessing without duplicates.
Incremental LoadLoading only new or changed data. Example: CDC-based ingestion.
IndexingCreating structures to speed queries. Example: Index on order date.
Infrastructure as Code (IaC)Managing infrastructure via code. Example: Terraform scripts.
LakehouseHybrid of data lake and warehouse. Example: Databricks Lakehouse.
Late-Arriving DataData that arrives after expected time. Example: Delayed event logs.
LoggingRecording system events. Example: Job execution logs.
Message QueueBuffer for asynchronous data transfer. Example: Kafka topic for events.
MetadataData about data. Example: Table definitions and lineage.
MetricsQuantitative indicators of performance. Example: Rows processed per run.
OrchestrationCoordinating pipeline execution. Example: DAG scheduling.
PartitioningDividing data for performance. Example: Partitioning by date.
Personally Identifiable Information (PII)Data identifying individuals. Example: Email addresses.
Pipeline MonitoringTracking pipeline execution status. Example: Failure notifications.
Primary KeyUnique identifier for a record. Example: CustomerID.
Processing TimeTimestamp when data is processed. Example: Ingestion time.
Query OptimizationImproving query efficiency. Example: Predicate pushdown.
Raw LayerStorage of unprocessed data. Example: Bronze layer.
Real-Time DataData available with minimal latency. Example: Live dashboard updates.
Retry LogicAutomatic reruns on failure. Example: Retry failed ingestion job.
ScalabilityAbility to handle growing workloads. Example: Auto-scaling clusters.
SchedulerTool managing execution timing. Example: Cron, Airflow.
SchemaThe structure of a dataset. Example: Table columns and data types.
Schema EvolutionHandling schema changes over time. Example: Adding new columns safely.
Secrets ManagementSecure handling of credentials. Example: Key Vault for passwords.
Semi-Structured DataData with flexible schema. Example: JSON, Parquet.
ServerlessInfrastructure managed by provider. Example: Serverless SQL pools.
Serving LayerLayer optimized for consumption. Example: BI-ready tables.
ShardingDistributing data across nodes. Example: User data split across servers.
Snowflake SchemaNormalized version of star schema. Example: Product broken into sub-dimensions.
Star SchemaFact table surrounded by dimensions. Example: Sales fact with date dimension.
Stream ProcessingProcessing data in real time. Example: Clickstream event processing.
Structured DataData with a fixed schema. Example: SQL tables.
Technical DebtLong-term cost of quick fixes. Example: Hardcoded transformations.
ThroughputAmount of data processed per unit time. Example: Records per second.
Transformation LayerLayer where business logic is applied. Example: dbt models.
Unstructured DataData without a predefined structure. Example: Images, PDFs.
WatermarkMarker for processed data. Example: Last processed timestamp.
WindowingGrouping stream data by time windows. Example: 5-minute aggregations.
Workload IsolationSeparating workloads to avoid contention. Example: Dedicated compute pools.

Please share your suggestions for any terms that should be added.

Glossary – 100 “Data Analysis” Terms

Below is a glossary that includes 100 common “Data Analysis” terms and phrases in alphabetical order. Enjoy!

TermDefinition & Example
A/B TestComparing two variations to measure impact. Example: Two webpage layouts.
Actionable InsightAn insight that leads to a clear decision. Example: Improve onboarding experience.
Ad Hoc AnalysisOne-off analysis for a specific question. Example: Investigating a sudden sales dip.
AggregationSummarizing data using functions like sum or average. Example: Total revenue by region.
Analytical MaturityOrganization’s capability to use data effectively. Example: Moving from descriptive to predictive analytics.
Bar ChartA chart comparing categories. Example: Sales by region.
BaselineA reference point for comparison. Example: Last year’s sales used as baseline.
BenchmarkA standard used to compare performance. Example: Industry average churn rate.
BiasSystematic error in data or analysis. Example: Surveying only active users.
Business QuestionA decision-focused question data aims to answer. Example: Which products drive profit?
CausationA relationship where one variable causes another. Example: Price cuts causing sales growth.
Confidence IntervalRange likely containing a true value. Example: 95% CI for average sales.
CorrelationA statistical relationship between variables. Example: Sales and marketing spend.
Cumulative TotalA running total over time. Example: Year-to-date revenue.
DashboardA visual collection of key metrics. Example: Executive sales dashboard.
DataRaw facts or measurements collected for analysis. Example: Sales transactions, sensor readings, survey responses.
Data AnomalyUnexpected or unusual data pattern. Example: Sudden spike in user signups.
Data CleaningCorrecting or removing inaccurate data. Example: Fixing misspelled country names.
Data ConsistencyUniform representation across datasets. Example: Same currency used everywhere.
Data GovernancePolicies ensuring data quality, security, and usage. Example: Defined data ownership roles.
Data ImputationReplacing missing values with estimated ones. Example: Filling null ages with the median.
Data LineageTracking data origin and transformations. Example: Tracing metrics back to source systems.
Data LiteracyAbility to read, understand, and use data. Example: Interpreting charts correctly.
Data ModelThe structure defining how data tables relate. Example: Star schema.
Data PipelineAutomated flow of data from source to destination. Example: Daily ingestion job.
Data ProfilingAnalyzing data characteristics. Example: Checking null percentages.
Data QualityThe accuracy, completeness, and reliability of data. Example: Valid dates and consistent formats.
Data RefreshUpdating data with the latest values. Example: Nightly refresh.
Data Refresh FrequencyHow often data is updated. Example: Hourly vs. daily refresh.
Data SkewnessDegree of asymmetry in data distribution. Example: Income data skewed to the right.
Data SourceThe origin of data. Example: SQL database, API.
Data StorytellingCommunicating insights using narrative and visuals. Example: Executive-ready presentation.
Data TransformationModifying data to improve usability or consistency. Example: Converting text dates to date data types.
Data ValidationEnsuring data meets rules and expectations. Example: No negative quantities.
Data WranglingTransforming raw data into a usable format. Example: Reshaping columns for analysis.
DatasetA structured collection of related data. Example: A table of customer orders with dates, amounts, and regions.
Derived MetricA metric calculated from other metrics. Example: Profit margin = Profit / Revenue.
Descriptive AnalyticsAnalysis that explains what happened. Example: Last quarter’s sales summary.
Diagnostic AnalyticsAnalysis that explains why something happened. Example: Revenue drop due to fewer customers.
DiceFiltering data by multiple dimensions. Example: Sales for 2025 in the West region.
DimensionA descriptive attribute used to slice data. Example: Date, region, product.
Dimension TableA table containing descriptive attributes. Example: Product details.
DimensionalityNumber of features or variables in data. Example: High-dimensional customer data.
DistributionHow values are spread across a range. Example: Income distribution.
Drill DownNavigating from summary to detail. Example: Yearly sales → monthly sales.
Drill ThroughJumping to a detailed view for a specific value. Example: Clicking a region to see store data.
ELTExtract, Load, Transform approach. Example: Transforming data inside a warehouse.
ETLExtract, Transform, Load process. Example: Loading CRM data into a warehouse.
Exploratory Data Analysis (EDA)Initial investigation to understand data. Example: Visualizing distributions.
Fact TableA table containing quantitative data. Example: Sales transactions.
FeatureAn individual measurable property used in analysis. Example: Customer age used in churn analysis.
Feature EngineeringCreating new features from existing data. Example: Calculating customer tenure from signup date.
FilteringLimiting data to a subset of interest. Example: Only orders from 2025.
GranularityThe level of detail in the data. Example: Daily sales vs. monthly sales.
GroupingOrganizing data into categories before aggregation. Example: Sales grouped by product category.
HistogramA chart showing data distribution. Example: Frequency of order sizes.
HypothesisA testable assumption. Example: Discounts increase sales.
Incremental LoadLoading only new or changed data. Example: Yesterday’s transactions.
InsightA meaningful finding that informs action. Example: High churn among new users.
KPI (Key Performance Indicator)A critical metric tied to business objectives. Example: Monthly churn rate.
KurtosisMeasure of how heavy the tails of a distribution are. Example: Detecting extreme outliers.
LatencyDelay between data generation and availability. Example: Real-time vs. daily data.
Line ChartA chart showing trends over time. Example: Monthly revenue trend.
MeanThe arithmetic average. Example: Average order value.
MeasureA calculated numeric value, often aggregated. Example: SUM(Sales).
MedianThe middle value in ordered data. Example: Median household income.
MetricA quantifiable measure used to track performance. Example: Total sales, average order value.
Missing ValuesData points that are absent or null. Example: Blank customer age values.
ModeThe most frequent value. Example: Most common product category.
Multivariate AnalysisAnalyzing multiple variables simultaneously. Example: Studying price, demand, and seasonality.
NormalizationScaling data to a common range. Example: Normalizing values between 0 and 1.
ObservationA single record or row in a dataset. Example: One customer’s purchase history.
OutlierA data point significantly different from others. Example: An unusually large transaction amount.
PercentileValue below which a percentage of data falls. Example: 90th percentile response time.
PopulationThe full set of interest. Example: All customers.
Predictive AnalyticsAnalysis that forecasts future outcomes. Example: Predicting next month’s demand.
Prescriptive AnalyticsAnalysis that suggests actions. Example: Recommending price changes.
QuartileValues dividing data into four parts. Example: Q1, Q2, Q3.
ReportA structured presentation of analysis results. Example: Monthly performance report.
ReproducibilityAbility to recreate analysis results consistently. Example: Using versioned datasets.
Rolling AverageAn average calculated over a moving window. Example: 7-day rolling average of sales.
Root Cause AnalysisIdentifying the underlying cause of an issue. Example: Revenue loss due to inventory shortages.
SampleA subset of a population. Example: Survey respondents.
Sampling BiasBias introduced by non-random samples. Example: Feedback collected only from power users.
Scatter PlotA chart showing relationships between two variables. Example: Ad spend vs. revenue.
SeasonalityRepeating patterns tied to time cycles. Example: Holiday sales spikes.
Semi-Structured DataData with flexible structure. Example: JSON files.
Sensitivity AnalysisEvaluating how outcomes change with inputs. Example: Impact of price changes on profit.
SliceFiltering data by a single dimension. Example: Sales for 2025 only.
SnapshotData captured at a specific point in time. Example: End-of-month balances.
Snowflake SchemaA normalized version of a star schema. Example: Product broken into sub-tables.
Standard DeviationAverage distance from the mean. Example: Consistency of sales performance.
StandardizationRescaling data to have mean 0 and standard deviation 1. Example: Preparing data for regression analysis.
Star SchemaA data model with facts surrounded by dimensions. Example: Sales fact with product and date dimensions.
Structured DataData with a fixed schema. Example: Relational tables.
Time SeriesData indexed by time. Example: Daily stock prices.
TrendA general direction in data over time. Example: Increasing monthly revenue.
Unstructured DataData without a predefined schema. Example: Emails, images.
VariableA characteristic or attribute that can take different values. Example: Age, revenue, product category.
VarianceMeasure of data spread. Example: Variance in delivery times.

Please share your suggestions for any terms that should be added.