Tag: Common formats for data files

Describe Common Formats for Data Files (DP-900 Exam Prep)

This post is a part of the DP-900: Microsoft Azure Data Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Describe core data concepts (25–30%)
--> Identify options for data storage
--> Describe common formats for data files


Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

In DP-900, Microsoft expects you to understand common data file formats, what type of data they typically store (structured, semi-structured, or unstructured), and why certain formats are used in analytics and Azure storage scenarios.

This topic connects directly to Azure Blob Storage, Azure Data Lake Storage, and analytics pipelines.


Why Data File Formats Matter

Data file formats define:

  • How data is organized inside a file
  • Whether the data is human-readable or binary
  • How efficiently it can be stored and queried
  • Which tools and services can process it

Choosing the right format impacts:

  • Performance
  • Storage cost
  • Analytics capabilities
  • Interoperability between systems

For DP-900, focus on understanding what each format is used for, not deep implementation details.


Common Data File Formats You Should Know

1. CSV (Comma-Separated Values)

CSV is one of the simplest and most widely used formats for structured data.

Key Characteristics

  • Plain text
  • Each row represents a record
  • Columns separated by commas (or other delimiters)
  • No embedded schema
  • Human readable

Example:

CustomerID,Name,City
1,John,Seattle
2,Maria,Austin

Typical Use Cases

  • Data exports and imports
  • Simple datasets
  • Spreadsheet interoperability

Exam Notes

  • Represents structured data
  • Lightweight and easy to move between systems
  • No support for nested structures or data types

2. JSON (JavaScript Object Notation)

JSON is the most common format for semi-structured data, especially in modern applications and APIs.

Key Characteristics

  • Key–value pairs
  • Supports nested objects and arrays
  • Self-describing
  • Human readable
  • Schema-on-read

Example:

{
"CustomerID": 1,
"Name": "John",
"Orders": [
{ "OrderID": 100, "Amount": 50 }
]
}

Typical Use Cases

  • Web APIs
  • Application data
  • Azure Cosmos DB documents
  • Logs and telemetry

Exam Notes

  • Represents semi-structured data
  • Flexible schema
  • Commonly used with Azure Cosmos DB and Azure Data Lake

3. XML (Extensible Markup Language)

XML is another semi-structured format that uses tags to describe data.

Key Characteristics

  • Tag-based hierarchy
  • Supports nested structures
  • Human readable but verbose
  • Self-describing

Example:

<Customer>
<CustomerID>1</CustomerID>
<Name>John</Name>
</Customer>

Typical Use Cases

  • Legacy systems
  • Configuration files
  • Enterprise data exchange

Exam Notes

  • Semi-structured
  • Less common than JSON in modern Azure solutions

4. Parquet

Parquet is a columnar, binary file format optimized for analytics workloads.

Key Characteristics

  • Column-based storage
  • Highly compressed
  • Not human readable
  • Very fast for analytical queries

Typical Use Cases

  • Big data analytics
  • Azure Synapse Analytics
  • Azure Data Lake Storage

Exam Notes

  • Used for large analytical datasets
  • Optimized for performance and storage efficiency
  • Common in modern data engineering pipelines

5. Avro

Avro is a binary format designed for data serialization and streaming.

Key Characteristics

  • Compact binary format
  • Includes schema with the data
  • Efficient for data movement
  • Not human readable

Typical Use Cases

  • Data pipelines
  • Event streaming
  • Big data ingestion

Exam Notes

  • Often used behind the scenes in analytics platforms
  • Supports schema evolution

6. Plain Text Files

Simple text files may also be used to store unstructured or loosely structured data.

Examples

  • Log files
  • Notes
  • Raw exports

Exam Notes

  • Usually treated as unstructured data
  • Stored in Azure Blob Storage or Data Lake

How These Formats Map to Data Types

This mapping is important for DP-900 questions:

FormatData Type
CSVStructured
JSONSemi-structured
XMLSemi-structured
ParquetStructured / Analytics
AvroSemi-structured
TXTUnstructured

Where These Formats Are Stored in Azure

You’ll commonly see these formats stored in:

Azure Blob Storage

  • Primary storage for files
  • Supports all formats (CSV, JSON, Parquet, images, etc.)
  • Used for unstructured and semi-structured data

Azure Data Lake Storage Gen2

  • Built on Blob Storage
  • Optimized for analytics
  • Common for Parquet and Avro files
  • Used with Azure Synapse and Azure Data Factory

Why This Matters for DP-900

On the exam, file formats typically appear in scenarios like:

  • Choosing storage for CSV or JSON files
  • Identifying formats used in analytics pipelines
  • Recognizing Parquet in big data workloads
  • Distinguishing structured vs semi-structured file types

You’re expected to understand purpose and characteristics, not internal file mechanics.


Summary — Exam-Relevant Takeaways

For DP-900, remember:

✔ CSV → structured, simple, text-based
✔ JSON / XML → semi-structured, flexible, self-describing
✔ Parquet → columnar, compressed, analytics-optimized
✔ Avro → binary, schema included, streaming-friendly
✔ TXT → unstructured

And:

  • These formats are commonly stored in Azure Blob Storage or Azure Data Lake Storage
  • Analytics formats (Parquet/Avro) are used with Azure Synapse and big data workloads

Go to the Practice Exam Questions for this topic.

Go to the DP-900 Exam Prep Hub main page.