This post is a part of the DP-900: Microsoft Azure Data Fundamentals Exam Prep Hub.
This topic falls under these sections:
Describe core data concepts (25–30%)
--> Identify options for data storage
--> Describe common formats for data files
Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.
In DP-900, Microsoft expects you to understand common data file formats, what type of data they typically store (structured, semi-structured, or unstructured), and why certain formats are used in analytics and Azure storage scenarios.
This topic connects directly to Azure Blob Storage, Azure Data Lake Storage, and analytics pipelines.
Why Data File Formats Matter
Data file formats define:
- How data is organized inside a file
- Whether the data is human-readable or binary
- How efficiently it can be stored and queried
- Which tools and services can process it
Choosing the right format impacts:
- Performance
- Storage cost
- Analytics capabilities
- Interoperability between systems
For DP-900, focus on understanding what each format is used for, not deep implementation details.
Common Data File Formats You Should Know
1. CSV (Comma-Separated Values)
CSV is one of the simplest and most widely used formats for structured data.
Key Characteristics
- Plain text
- Each row represents a record
- Columns separated by commas (or other delimiters)
- No embedded schema
- Human readable
Example:
CustomerID,Name,City1,John,Seattle2,Maria,Austin
Typical Use Cases
- Data exports and imports
- Simple datasets
- Spreadsheet interoperability
Exam Notes
- Represents structured data
- Lightweight and easy to move between systems
- No support for nested structures or data types
2. JSON (JavaScript Object Notation)
JSON is the most common format for semi-structured data, especially in modern applications and APIs.
Key Characteristics
- Key–value pairs
- Supports nested objects and arrays
- Self-describing
- Human readable
- Schema-on-read
Example:
{ "CustomerID": 1, "Name": "John", "Orders": [ { "OrderID": 100, "Amount": 50 } ]}
Typical Use Cases
- Web APIs
- Application data
- Azure Cosmos DB documents
- Logs and telemetry
Exam Notes
- Represents semi-structured data
- Flexible schema
- Commonly used with Azure Cosmos DB and Azure Data Lake
3. XML (Extensible Markup Language)
XML is another semi-structured format that uses tags to describe data.
Key Characteristics
- Tag-based hierarchy
- Supports nested structures
- Human readable but verbose
- Self-describing
Example:
<Customer> <CustomerID>1</CustomerID> <Name>John</Name></Customer>
Typical Use Cases
- Legacy systems
- Configuration files
- Enterprise data exchange
Exam Notes
- Semi-structured
- Less common than JSON in modern Azure solutions
4. Parquet
Parquet is a columnar, binary file format optimized for analytics workloads.
Key Characteristics
- Column-based storage
- Highly compressed
- Not human readable
- Very fast for analytical queries
Typical Use Cases
- Big data analytics
- Azure Synapse Analytics
- Azure Data Lake Storage
Exam Notes
- Used for large analytical datasets
- Optimized for performance and storage efficiency
- Common in modern data engineering pipelines
5. Avro
Avro is a binary format designed for data serialization and streaming.
Key Characteristics
- Compact binary format
- Includes schema with the data
- Efficient for data movement
- Not human readable
Typical Use Cases
- Data pipelines
- Event streaming
- Big data ingestion
Exam Notes
- Often used behind the scenes in analytics platforms
- Supports schema evolution
6. Plain Text Files
Simple text files may also be used to store unstructured or loosely structured data.
Examples
- Log files
- Notes
- Raw exports
Exam Notes
- Usually treated as unstructured data
- Stored in Azure Blob Storage or Data Lake
How These Formats Map to Data Types
This mapping is important for DP-900 questions:
| Format | Data Type |
|---|---|
| CSV | Structured |
| JSON | Semi-structured |
| XML | Semi-structured |
| Parquet | Structured / Analytics |
| Avro | Semi-structured |
| TXT | Unstructured |
Where These Formats Are Stored in Azure
You’ll commonly see these formats stored in:
Azure Blob Storage
- Primary storage for files
- Supports all formats (CSV, JSON, Parquet, images, etc.)
- Used for unstructured and semi-structured data
Azure Data Lake Storage Gen2
- Built on Blob Storage
- Optimized for analytics
- Common for Parquet and Avro files
- Used with Azure Synapse and Azure Data Factory
Why This Matters for DP-900
On the exam, file formats typically appear in scenarios like:
- Choosing storage for CSV or JSON files
- Identifying formats used in analytics pipelines
- Recognizing Parquet in big data workloads
- Distinguishing structured vs semi-structured file types
You’re expected to understand purpose and characteristics, not internal file mechanics.
Summary — Exam-Relevant Takeaways
For DP-900, remember:
✔ CSV → structured, simple, text-based
✔ JSON / XML → semi-structured, flexible, self-describing
✔ Parquet → columnar, compressed, analytics-optimized
✔ Avro → binary, schema included, streaming-friendly
✔ TXT → unstructured
And:
- These formats are commonly stored in Azure Blob Storage or Azure Data Lake Storage
- Analytics formats (Parquet/Avro) are used with Azure Synapse and big data workloads
Go to the Practice Exam Questions for this topic.
Go to the DP-900 Exam Prep Hub main page.
