Describe Features of Semi-Structured Data (DP-900 Exam Prep)

This post is a part of the DP-900: Microsoft Azure Data Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Describe core data concepts (25–30%)
--> Describe ways to represent data
--> Describe features of semi-structured data


Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

Introduction

For the DP-900 exam, semi-structured data sits between structured and unstructured data. You’re expected to understand what it is, how it’s organized, and why Azure provides specialized services to store and query it.


What Is Semi-Structured Data?

Semi-structured data is data that does not follow a rigid, tabular schema like relational data, but still contains organizational markers or tags that make it partially structured and machine readable.

Unlike structured data (rows and columns), semi-structured data:

  • Does not require a predefined schema
  • Can vary in shape from record to record
  • Still contains self-describing elements such as key–value pairs or hierarchical structures

In other words, semi-structured data has some structure — just not fixed tables.

Common examples include:

  • JSON documents
  • XML files
  • YAML
  • Avro / Parquet (used in analytics pipelines)

Key Features of Semi-Structured Data

1. Schema-on-Read (Not Schema-on-Write)

One of the most important characteristics of semi-structured data is schema-on-read.

This means:

  • Data is stored without enforcing a strict schema
  • Structure is interpreted when the data is queried or analyzed

This contrasts with structured data, which uses schema-on-write, where structure must be defined before data is inserted.

For DP-900, remember:

Semi-structured data is flexible at ingestion time and structured at query time.


2. Flexible and Evolving Structure

Each record in a semi-structured dataset can contain:

  • Different fields
  • Nested objects
  • Optional attributes

Example (JSON):

{
"CustomerID": 123,
"Name": "Sarah",
"Orders": [
{ "OrderID": 1, "Amount": 50 },
{ "OrderID": 2, "Amount": 75 }
]
}

Another record in the same dataset might include extra fields like Email or omit Orders entirely.

This flexibility makes semi-structured data ideal for:

  • Application telemetry
  • IoT data
  • User activity logs
  • Rapidly changing systems

3. Hierarchical or Nested Organization

Semi-structured data often uses hierarchies rather than flat tables.

For example:

  • JSON objects inside objects
  • XML elements within elements

This nested design allows complex relationships to exist inside a single document — something that would require multiple tables in relational systems.


4. Self-Describing Format

Semi-structured data embeds its own metadata using:

  • Keys
  • Tags
  • Field names

This makes the data self-describing, meaning applications can understand what each value represents without relying on an external schema definition.

Example:

"Temperature": 72

The key itself describes the value.


5. Easily Transported Across Systems

Semi-structured formats such as JSON and XML are:

  • Human readable
  • Platform independent
  • Widely supported across APIs and applications

This is why most modern web services exchange data using JSON.


Common Formats of Semi-Structured Data

You should recognize these for DP-900:

FormatDescription
JSONMost common format for APIs and applications
XMLTag-based hierarchical format
YAMLHuman-friendly configuration format
Avro / ParquetColumnar formats used in analytics pipelines

Where Semi-Structured Data Is Used in Azure

Microsoft Azure provides specialized services designed to handle semi-structured data:

Azure Cosmos DB

  • Stores JSON documents
  • Supports schema-less designs
  • Designed for globally distributed applications
  • Optimized for flexible data models

Azure Data Lake Storage

  • Stores large volumes of semi-structured files
  • Used in analytics pipelines
  • Often paired with Azure Synapse or Azure Data Factory

These services are built specifically for workloads where structure changes frequently or cannot be fully defined in advance.


Why Semi-Structured Data Matters for DP-900

Understanding semi-structured data helps you:

  • Distinguish it from relational (structured) data
  • Identify appropriate Azure services (especially Cosmos DB)
  • Understand modern application and analytics architectures

On the exam, you’ll typically see semi-structured data appear in scenarios involving:

  • JSON documents
  • Application telemetry
  • IoT data
  • Log files

Structured vs Semi-Structured (Quick Comparison)

StructuredSemi-Structured
Fixed schemaFlexible schema
Rows and columnsDocuments / nested objects
Schema-on-writeSchema-on-read
SQL databasesDocument databases
Highly consistentShape varies by record

Summary — Exam-Relevant Takeaways

For DP-900, remember:

✔ Semi-structured data has no fixed schema
✔ Uses schema-on-read
✔ Supports nested and hierarchical structures
✔ Common formats: JSON, XML
✔ Often stored in Azure Cosmos DB or Data Lake
✔ Ideal for rapidly changing or document-based data


Go to the Practice Exam Questions for this topic.

Go to the DP-900 Exam Prep Hub main page.

Leave a comment