Below is a glossary that includes 100 common “Data Engineering” terms and phrases in alphabetical order. Enjoy!
| Term | Definition & Example |
| Access Control | Managing who can access data. Example: Role-based permissions. |
| At-Least-Once Processing | Data may be processed more than once. Example: Duplicate-safe pipelines. |
| At-Most-Once Processing | Data processed zero or one time. Example: No retries on failure. |
| Backfill | Processing historical data. Example: Reloading last year’s data. |
| Batch Processing | Processing data in scheduled chunks. Example: Daily sales aggregation. |
| Blue-Green Deployment | Deployment strategy minimizing downtime. Example: Switching pipeline versions. |
| Canary Release | Gradual rollout to detect issues. Example: New pipeline tested on 5% of data. |
| Change Data Capture (CDC) | Capturing database changes. Example: Streaming updates from OLTP DB. |
| Checkpointing | Saving progress during processing. Example: Spark streaming checkpoints. |
| Cloud Storage | Scalable remote data storage. Example: Azure Data Lake Storage. |
| Cold Storage | Low-cost storage for infrequent access. Example: Archived logs. |
| Columnar Storage | Data stored by column instead of row. Example: Parquet files. |
| Compression | Reducing data size. Example: Gzip-compressed files. |
| Compute Engine | System performing data processing. Example: Spark cluster. |
| Consumption Layer | Data prepared for analytics. Example: Gold layer. |
| Cost Optimization | Reducing infrastructure costs. Example: Query optimization. |
| Curated Layer | Cleaned and transformed data. Example: Silver layer. |
| DAG (Directed Acyclic Graph) | Workflow structure with dependencies. Example: Airflow pipeline. |
| Data Catalog | Searchable inventory of data assets. Example: Azure Purview. |
| Data Contract | Agreement defining data structure and expectations. Example: Producer guarantees column names and types. |
| Data Engineering | The practice of designing, building, and maintaining data systems. Example: Creating pipelines that feed analytics dashboards. |
| Data Governance | Policies for data management and usage. Example: Access control rules. |
| Data Ingestion | Collecting data from source systems. Example: Ingesting API data hourly. |
| Data Lake | Centralized storage for raw data. Example: S3-based data lake. |
| Data Latency | Time delay in data availability. Example: 5-minute pipeline delay. |
| Data Lineage | Tracking data flow from source to output. Example: Source-to-dashboard trace. |
| Data Mart | Subset of warehouse for specific use. Example: Finance data mart. |
| Data Masking | Obscuring sensitive data. Example: Masked credit card numbers. |
| Data Mesh | Domain-oriented decentralized data ownership. Example: Teams own their data products. |
| Data Modeling | Designing data structures for usage. Example: Star schema design. |
| Data Observability | Monitoring data health and pipelines. Example: Freshness alerts. |
| Data Partition Pruning | Skipping irrelevant partitions. Example: Querying one date only. |
| Data Pipeline | An automated process that moves and transforms data. Example: Nightly ETL job from CRM to warehouse. |
| Data Platform | Integrated set of data tools. Example: End-to-end analytics stack. |
| Data Product | A dataset treated as a product. Example: Curated customer table. |
| Data Profiling | Analyzing data characteristics. Example: Value distributions. |
| Data Quality | Accuracy, completeness, and reliability of data. Example: No duplicate records. |
| Data Replay | Reprocessing historical events. Example: Rebuilding aggregates from logs. |
| Data Retention | Rules for data lifespan. Example: Delete logs after 1 year. |
| Data Security | Protecting data from unauthorized access. Example: Encryption at rest. |
| Data Serialization | Converting data for storage or transport. Example: Avro encoding. |
| Data Sink | The destination where data is stored. Example: Data warehouse. |
| Data Source | The origin of data. Example: ERP system, SaaS application. |
| Data Validation | Ensuring data meets expectations. Example: Null checks. |
| Data Versioning | Tracking dataset changes. Example: Snapshot tables. |
| Data Warehouse | Optimized storage for analytics queries. Example: Azure Synapse Analytics. |
| Dead Letter Queue (DLQ) | Storage for failed records. Example: Invalid messages routed for review. |
| Dimension Table | Table storing descriptive attributes. Example: Customer details. |
| ELT | Extract, Load, Transform approach. Example: Transforming data inside Snowflake. |
| ETL | Extract, Transform, Load process. Example: Cleaning data before loading into a database. |
| Event Time | Timestamp when event occurred. Example: User click time. |
| Event-Driven Architecture | Systems reacting to events in real time. Example: Trigger pipeline on file arrival. |
| Exactly-Once Processing | Ensuring data is processed only once. Example: Preventing duplicate events. |
| Fact Table | Table storing quantitative measures. Example: Order transactions. |
| Fault Tolerance | System resilience to failures. Example: Node failure recovery. |
| File Format | How data is stored on disk. Example: Parquet, CSV. |
| Foreign Key | Field linking tables together. Example: CustomerID in orders table. |
| Full Load | Reloading all data. Example: Initial table population. |
| High Availability | System uptime and reliability. Example: Multi-zone deployment. |
| Hot Storage | High-performance storage for frequent access. Example: Real-time tables. |
| Idempotency | Ability to rerun pipelines safely. Example: Reprocessing without duplicates. |
| Incremental Load | Loading only new or changed data. Example: CDC-based ingestion. |
| Indexing | Creating structures to speed queries. Example: Index on order date. |
| Infrastructure as Code (IaC) | Managing infrastructure via code. Example: Terraform scripts. |
| Lakehouse | Hybrid of data lake and warehouse. Example: Databricks Lakehouse. |
| Late-Arriving Data | Data that arrives after expected time. Example: Delayed event logs. |
| Logging | Recording system events. Example: Job execution logs. |
| Message Queue | Buffer for asynchronous data transfer. Example: Kafka topic for events. |
| Metadata | Data about data. Example: Table definitions and lineage. |
| Metrics | Quantitative indicators of performance. Example: Rows processed per run. |
| Orchestration | Coordinating pipeline execution. Example: DAG scheduling. |
| Partitioning | Dividing data for performance. Example: Partitioning by date. |
| Personally Identifiable Information (PII) | Data identifying individuals. Example: Email addresses. |
| Pipeline Monitoring | Tracking pipeline execution status. Example: Failure notifications. |
| Primary Key | Unique identifier for a record. Example: CustomerID. |
| Processing Time | Timestamp when data is processed. Example: Ingestion time. |
| Query Optimization | Improving query efficiency. Example: Predicate pushdown. |
| Raw Layer | Storage of unprocessed data. Example: Bronze layer. |
| Real-Time Data | Data available with minimal latency. Example: Live dashboard updates. |
| Retry Logic | Automatic reruns on failure. Example: Retry failed ingestion job. |
| Scalability | Ability to handle growing workloads. Example: Auto-scaling clusters. |
| Scheduler | Tool managing execution timing. Example: Cron, Airflow. |
| Schema | The structure of a dataset. Example: Table columns and data types. |
| Schema Evolution | Handling schema changes over time. Example: Adding new columns safely. |
| Secrets Management | Secure handling of credentials. Example: Key Vault for passwords. |
| Semi-Structured Data | Data with flexible schema. Example: JSON, Parquet. |
| Serverless | Infrastructure managed by provider. Example: Serverless SQL pools. |
| Serving Layer | Layer optimized for consumption. Example: BI-ready tables. |
| Sharding | Distributing data across nodes. Example: User data split across servers. |
| Snowflake Schema | Normalized version of star schema. Example: Product broken into sub-dimensions. |
| Star Schema | Fact table surrounded by dimensions. Example: Sales fact with date dimension. |
| Stream Processing | Processing data in real time. Example: Clickstream event processing. |
| Structured Data | Data with a fixed schema. Example: SQL tables. |
| Technical Debt | Long-term cost of quick fixes. Example: Hardcoded transformations. |
| Throughput | Amount of data processed per unit time. Example: Records per second. |
| Transformation Layer | Layer where business logic is applied. Example: dbt models. |
| Unstructured Data | Data without a predefined structure. Example: Images, PDFs. |
| Watermark | Marker for processed data. Example: Last processed timestamp. |
| Windowing | Grouping stream data by time windows. Example: 5-minute aggregations. |
| Workload Isolation | Separating workloads to avoid contention. Example: Dedicated compute pools. |
Please share your suggestions for any terms that should be added.
