Common formats for data files – The Data Community

This post is a part of the DP-900: Microsoft Azure Data Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Describe core data concepts (25–30%)
   --> Identify options for data storage
      --> Describe common formats for data files

Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

In DP-900, Microsoft expects you to understand common data file formats, what type of data they typically store (structured, semi-structured, or unstructured), and why certain formats are used in analytics and Azure storage scenarios.

This topic connects directly to Azure Blob Storage, Azure Data Lake Storage, and analytics pipelines.

Why Data File Formats Matter

Data file formats define:

How data is organized inside a file
Whether the data is human-readable or binary
How efficiently it can be stored and queried
Which tools and services can process it

Choosing the right format impacts:

Performance
Storage cost
Analytics capabilities
Interoperability between systems

For DP-900, focus on understanding what each format is used for, not deep implementation details.

Common Data File Formats You Should Know

1. CSV (Comma-Separated Values)

CSV is one of the simplest and most widely used formats for structured data.

Key Characteristics

Plain text
Each row represents a record
Columns separated by commas (or other delimiters)
No embedded schema
Human readable

Example:

			
CustomerID,Name,City
1,John,Seattle
2,Maria,Austin

Typical Use Cases

Data exports and imports
Simple datasets
Spreadsheet interoperability

Exam Notes

Represents structured data
Lightweight and easy to move between systems
No support for nested structures or data types

2. JSON (JavaScript Object Notation)

JSON is the most common format for semi-structured data, especially in modern applications and APIs.

Key Characteristics

Key–value pairs
Supports nested objects and arrays
Self-describing
Human readable
Schema-on-read

Example:

			
{
  "CustomerID": 1,
  "Name": "John",
  "Orders": [
    { "OrderID": 100, "Amount": 50 }
  ]
}

		

Typical Use Cases

Web APIs
Application data
Azure Cosmos DB documents
Logs and telemetry

Exam Notes

Represents semi-structured data
Flexible schema
Commonly used with Azure Cosmos DB and Azure Data Lake

3. XML (Extensible Markup Language)

XML is another semi-structured format that uses tags to describe data.

Key Characteristics

Tag-based hierarchy
Supports nested structures
Human readable but verbose
Self-describing

Example:

			
<Customer>
  <CustomerID>1</CustomerID>
  <Name>John</Name>
</Customer>

Typical Use Cases

Legacy systems
Configuration files
Enterprise data exchange

Exam Notes

Semi-structured
Less common than JSON in modern Azure solutions

4. Parquet

Parquet is a columnar, binary file format optimized for analytics workloads.

Key Characteristics

Column-based storage
Highly compressed
Not human readable
Very fast for analytical queries

Typical Use Cases

Big data analytics
Azure Synapse Analytics
Azure Data Lake Storage

Exam Notes

Used for large analytical datasets
Optimized for performance and storage efficiency
Common in modern data engineering pipelines

5. Avro

Avro is a binary format designed for data serialization and streaming.

Key Characteristics

Compact binary format
Includes schema with the data
Efficient for data movement
Not human readable

Typical Use Cases

Data pipelines
Event streaming
Big data ingestion

Exam Notes

Often used behind the scenes in analytics platforms
Supports schema evolution

6. Plain Text Files

Simple text files may also be used to store unstructured or loosely structured data.

Examples

Log files
Notes
Raw exports

Exam Notes

Usually treated as unstructured data
Stored in Azure Blob Storage or Data Lake

How These Formats Map to Data Types

This mapping is important for DP-900 questions:

Format	Data Type
CSV	Structured
JSON	Semi-structured
XML	Semi-structured
Parquet	Structured / Analytics
Avro	Semi-structured
TXT	Unstructured

Where These Formats Are Stored in Azure

You’ll commonly see these formats stored in:

Azure Blob Storage

Primary storage for files
Supports all formats (CSV, JSON, Parquet, images, etc.)
Used for unstructured and semi-structured data

Azure Data Lake Storage Gen2

Built on Blob Storage
Optimized for analytics
Common for Parquet and Avro files
Used with Azure Synapse and Azure Data Factory

Why This Matters for DP-900

On the exam, file formats typically appear in scenarios like:

Choosing storage for CSV or JSON files
Identifying formats used in analytics pipelines
Recognizing Parquet in big data workloads
Distinguishing structured vs semi-structured file types

You’re expected to understand purpose and characteristics, not internal file mechanics.

Summary — Exam-Relevant Takeaways

For DP-900, remember:

✔ CSV → structured, simple, text-based
✔ JSON / XML → semi-structured, flexible, self-describing
✔ Parquet → columnar, compressed, analytics-optimized
✔ Avro → binary, schema included, streaming-friendly
✔ TXT → unstructured

And:

These formats are commonly stored in Azure Blob Storage or Azure Data Lake Storage
Analytics formats (Parquet/Avro) are used with Azure Synapse and big data workloads

Go to the Practice Exam Questions for this topic.

Go to the DP-900 Exam Prep Hub main page.

The Data Community

Tag: Common formats for data files

Describe Common Formats for Data Files (DP-900 Exam Prep)

Why Data File Formats Matter

Common Data File Formats You Should Know

1. CSV (Comma-Separated Values)

Key Characteristics

Typical Use Cases

Exam Notes

2. JSON (JavaScript Object Notation)

Key Characteristics

Typical Use Cases

Exam Notes

3. XML (Extensible Markup Language)

Key Characteristics

Typical Use Cases

Exam Notes

4. Parquet

Key Characteristics

Typical Use Cases

Exam Notes

5. Avro

Key Characteristics

Typical Use Cases

Exam Notes

6. Plain Text Files

Examples

Exam Notes

How These Formats Map to Data Types

Where These Formats Are Stored in Azure

Azure Blob Storage

Azure Data Lake Storage Gen2

Why This Matters for DP-900

Summary — Exam-Relevant Takeaways

Information and resources for the data professionals' community