This post is a part of the DP-900: Microsoft Azure Data Fundamentals Exam Prep Hub.
This topic falls under these sections:
Describe core data concepts (25–30%)
--> Describe ways to represent data
--> Describe features of semi-structured data
Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.
Introduction
For the DP-900 exam, semi-structured data sits between structured and unstructured data. You’re expected to understand what it is, how it’s organized, and why Azure provides specialized services to store and query it.
What Is Semi-Structured Data?
Semi-structured data is data that does not follow a rigid, tabular schema like relational data, but still contains organizational markers or tags that make it partially structured and machine readable.
Unlike structured data (rows and columns), semi-structured data:
- Does not require a predefined schema
- Can vary in shape from record to record
- Still contains self-describing elements such as key–value pairs or hierarchical structures
In other words, semi-structured data has some structure — just not fixed tables.
Common examples include:
- JSON documents
- XML files
- YAML
- Avro / Parquet (used in analytics pipelines)
Key Features of Semi-Structured Data
1. Schema-on-Read (Not Schema-on-Write)
One of the most important characteristics of semi-structured data is schema-on-read.
This means:
- Data is stored without enforcing a strict schema
- Structure is interpreted when the data is queried or analyzed
This contrasts with structured data, which uses schema-on-write, where structure must be defined before data is inserted.
For DP-900, remember:
Semi-structured data is flexible at ingestion time and structured at query time.
2. Flexible and Evolving Structure
Each record in a semi-structured dataset can contain:
- Different fields
- Nested objects
- Optional attributes
Example (JSON):
{ "CustomerID": 123, "Name": "Sarah", "Orders": [ { "OrderID": 1, "Amount": 50 }, { "OrderID": 2, "Amount": 75 } ]}
Another record in the same dataset might include extra fields like Email or omit Orders entirely.
This flexibility makes semi-structured data ideal for:
- Application telemetry
- IoT data
- User activity logs
- Rapidly changing systems
3. Hierarchical or Nested Organization
Semi-structured data often uses hierarchies rather than flat tables.
For example:
- JSON objects inside objects
- XML elements within elements
This nested design allows complex relationships to exist inside a single document — something that would require multiple tables in relational systems.
4. Self-Describing Format
Semi-structured data embeds its own metadata using:
- Keys
- Tags
- Field names
This makes the data self-describing, meaning applications can understand what each value represents without relying on an external schema definition.
Example:
"Temperature": 72
The key itself describes the value.
5. Easily Transported Across Systems
Semi-structured formats such as JSON and XML are:
- Human readable
- Platform independent
- Widely supported across APIs and applications
This is why most modern web services exchange data using JSON.
Common Formats of Semi-Structured Data
You should recognize these for DP-900:
| Format | Description |
|---|---|
| JSON | Most common format for APIs and applications |
| XML | Tag-based hierarchical format |
| YAML | Human-friendly configuration format |
| Avro / Parquet | Columnar formats used in analytics pipelines |
Where Semi-Structured Data Is Used in Azure
Microsoft Azure provides specialized services designed to handle semi-structured data:
Azure Cosmos DB
- Stores JSON documents
- Supports schema-less designs
- Designed for globally distributed applications
- Optimized for flexible data models
Azure Data Lake Storage
- Stores large volumes of semi-structured files
- Used in analytics pipelines
- Often paired with Azure Synapse or Azure Data Factory
These services are built specifically for workloads where structure changes frequently or cannot be fully defined in advance.
Why Semi-Structured Data Matters for DP-900
Understanding semi-structured data helps you:
- Distinguish it from relational (structured) data
- Identify appropriate Azure services (especially Cosmos DB)
- Understand modern application and analytics architectures
On the exam, you’ll typically see semi-structured data appear in scenarios involving:
- JSON documents
- Application telemetry
- IoT data
- Log files
Structured vs Semi-Structured (Quick Comparison)
| Structured | Semi-Structured |
|---|---|
| Fixed schema | Flexible schema |
| Rows and columns | Documents / nested objects |
| Schema-on-write | Schema-on-read |
| SQL databases | Document databases |
| Highly consistent | Shape varies by record |
Summary — Exam-Relevant Takeaways
For DP-900, remember:
✔ Semi-structured data has no fixed schema
✔ Uses schema-on-read
✔ Supports nested and hierarchical structures
✔ Common formats: JSON, XML
✔ Often stored in Azure Cosmos DB or Data Lake
✔ Ideal for rapidly changing or document-based data
Go to the Practice Exam Questions for this topic.
Go to the DP-900 Exam Prep Hub main page.
