This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement information extraction solutions (10–15%)
--> Build retrieval and grounding pipelines
--> Configure RAG ingestion flow, including documents and using OCR
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
For the AI-103: Develop AI Apps and Agents on Azure certification exam, one of the critical topics within Build retrieval and grounding pipelines is understanding how to configure a Retrieval-Augmented Generation (RAG) ingestion flow.
Modern AI applications and agents depend heavily on RAG architectures to:
- Retrieve enterprise data
- Ground AI responses
- Reduce hallucinations
- Provide current and trusted information
A major part of this process involves:
- Ingesting documents
- Extracting content
- Applying OCR
- Enriching data
- Creating searchable indexes
- Supporting semantic and vector retrieval
Understanding how these components work together is essential for the AI-103 exam.
What Is Retrieval-Augmented Generation (RAG)?
RAG combines:
- Information retrieval
- External knowledge sources
- Large Language Models (LLMs)
Instead of relying solely on model training data, a RAG system retrieves relevant enterprise content during inference.
Why RAG Matters
Without RAG:
- AI models may hallucinate
- Responses may be outdated
- Enterprise knowledge is inaccessible
- Answers may lack grounding
With RAG:
- Responses are grounded in real documents
- AI can use private organizational data
- Retrieval improves factual accuracy
- Answers become more trustworthy
High-Level RAG Architecture
A common RAG architecture looks like this:
Enterprise Documents ↓Ingestion Pipeline ↓OCR / Enrichment ↓Chunking ↓Embeddings Generation ↓Vector Index ↓Retrieval ↓LLM Prompt ↓Grounded Response
This workflow appears frequently in AI-103 scenarios.
Core Azure Services Used
Several Azure services commonly appear in RAG ingestion architectures.
| Service | Purpose |
|---|---|
| Azure AI Search | Indexing, retrieval, vector search |
| Azure OpenAI Service | Embeddings and generative AI |
| Azure AI Vision | OCR and image analysis |
| Azure AI Document Intelligence | Layout extraction and document processing |
| Azure Blob Storage | Document storage |
| Azure Functions | Workflow automation and custom processing |
| Azure AI Foundry | AI orchestration and agent workflows |
Understanding the RAG Ingestion Flow
The ingestion flow prepares enterprise data for retrieval and grounding.
Core stages include:
- Document ingestion
- Content extraction
- OCR processing
- AI enrichment
- Chunking
- Embedding generation
- Indexing
Step 1: Document Ingestion
What Is Document Ingestion?
Document ingestion imports content into the retrieval pipeline.
Common sources:
- PDFs
- Word documents
- PowerPoint files
- HTML pages
- Scanned images
- Emails
- Knowledge base articles
- SharePoint repositories
Common Storage Locations
Many Azure architectures store documents in:
- Azure Blob Storage
- Azure Data Lake Storage
- SharePoint
- SQL databases
Blob Storage is especially common in AI-103 examples.
Step 2: Extracting Content
Documents may contain:
- Plain text
- Tables
- Images
- Scanned pages
- Handwriting
- Multi-column layouts
The extraction process converts raw files into machine-readable content.
Structured vs Unstructured Documents
| Structured | Unstructured |
|---|---|
| Databases | PDFs |
| CSV files | Emails |
| Tables | Scanned forms |
| JSON | Images |
RAG pipelines often focus on unstructured data.
Step 3: OCR Processing
What Is OCR?
OCR stands for Optical Character Recognition.
OCR extracts text from:
- Scanned PDFs
- Photos
- Screenshots
- Whiteboards
- Forms
- Image-based documents
This is one of the most heavily tested concepts in AI-103 information extraction topics.
Why OCR Is Important in RAG
Many enterprise documents are scanned images rather than machine-readable text.
Without OCR:
- The content cannot be searched
- Embeddings cannot be generated
- Retrieval becomes impossible
OCR converts images into searchable text.
OCR Workflow
Scanned PDF ↓OCR Processing ↓Extracted Text ↓Chunking ↓Embeddings ↓Search Index
Azure AI Vision OCR
Azure AI Vision provides OCR capabilities that can:
- Detect printed text
- Detect handwritten text
- Support multiple languages
- Extract text coordinates
Common outputs:
- Lines
- Words
- Bounding boxes
- Confidence scores
OCR in Azure AI Search Skillsets
OCR is commonly integrated directly into:
- Azure AI Search indexers
- Skillsets
Typical flow:
Blob Storage ↓Indexer ↓OCR Skill ↓Search Index
Step 4: AI Enrichment
After OCR or extraction, AI enrichment improves the content.
Common enrichment steps:
- Language detection
- Entity recognition
- Key phrase extraction
- Sentiment analysis
- Image tagging
- Translation
These enrichments improve:
- Retrieval quality
- Metadata
- Semantic search
- Grounding accuracy
Skillsets in Azure AI Search
A skillset is a pipeline of AI enrichment operations.
Example:
OCR Skill ↓Entity Recognition ↓Key Phrase Extraction ↓Embeddings Generation
Skillsets are a core AI-103 topic.
Step 5: Chunking Documents
Why Chunking Is Necessary
Large documents exceed LLM token limits.
Chunking divides documents into smaller pieces.
Benefits:
- Better retrieval precision
- Improved embedding quality
- More accurate grounding
- Reduced token usage
Chunking Strategies
Fixed-Size Chunking
Example:
500-token chunks
Semantic Chunking
Split by:
- Sections
- Headings
- Paragraphs
Overlapping Chunks
Preserves context across chunks.
Example:
Chunk 1: Tokens 1–500Chunk 2: Tokens 450–950
Step 6: Generate Embeddings
What Are Embeddings?
Embeddings are numerical vector representations of content.
Embeddings enable:
- Semantic search
- Vector search
- Similarity matching
Generated using:
- Azure OpenAI Service
- Azure AI Foundry models
Embedding Workflow
Document Chunk ↓Embedding Model ↓Vector Embedding
The vectors are stored in a vector-enabled index.
Step 7: Indexing Content
Azure AI Search Indexes
Indexes store:
- Document content
- Metadata
- Embeddings
- Enrichment outputs
Example fields:
| Field | Purpose |
|---|---|
| id | Unique identifier |
| content | Extracted text |
| title | Document title |
| contentVector | Embedding vector |
| language | Metadata |
Vector Indexing
Vector indexes support:
- Semantic similarity retrieval
- Nearest-neighbor search
- Hybrid search
Important exam concept:
Vector search is foundational to RAG retrieval.
Hybrid Search
What Is Hybrid Search?
Hybrid search combines:
- Keyword search
- Semantic ranking
- Vector search
Benefits:
- Better relevance
- Higher recall
- Improved grounding
Hybrid search is strongly recommended for enterprise AI applications.
Retrieval Stage
When a user submits a question:
- Query embedding is generated
- Search retrieves relevant chunks
- Retrieved chunks are inserted into the prompt
- LLM generates grounded response
Example RAG Query Flow
User Question ↓Embedding Generation ↓Vector + Hybrid Search ↓Relevant Chunks Retrieved ↓Prompt Construction ↓Grounded AI Response
Document Intelligence and Layout Extraction
Many documents contain:
- Tables
- Forms
- Multi-column layouts
- Headers and footers
Simple OCR may lose structure.
Azure AI Document Intelligence preserves layout relationships.
Layout-Aware Retrieval
Example:
Invoice ├── Vendor ├── Invoice Number ├── Table of Charges └── Total
Layout extraction preserves:
- Table rows
- Field relationships
- Reading order
This improves:
- Search quality
- Grounding accuracy
- Structured retrieval
Security Considerations
Enterprise RAG systems often require:
- RBAC
- Managed identities
- Private endpoints
- Data encryption
- Access-controlled retrieval
Important exam point:
Retrieval systems should return only authorized content.
Performance Optimization
Common optimization techniques:
- Incremental indexing
- Hybrid search
- Proper chunk sizing
- Metadata filtering
- Caching embeddings
- Selective OCR processing
Common AI-103 Scenarios
Scenario 1
You need searchable scanned PDFs.
Solution:
- OCR Skill
- Azure AI Search
- Blob Storage
Scenario 2
You need semantic retrieval for an AI chatbot.
Solution:
- Embeddings
- Vector search
- Hybrid search
Scenario 3
You need invoice field extraction.
Solution:
- Azure AI Document Intelligence
- Layout extraction
Scenario 4
You need enterprise grounding with internal documents.
Solution:
- RAG architecture
- Azure AI Search
- Azure OpenAI
Important AI-103 Exam Tips
Know These Key Concepts
| Concept | Purpose |
|---|---|
| OCR | Extract text from images |
| Skillset | AI enrichment pipeline |
| Chunking | Split documents for retrieval |
| Embeddings | Vector representations |
| Vector search | Semantic retrieval |
| Hybrid search | Combined retrieval approach |
| Grounding | Provide trusted context to LLM |
Frequently Tested Knowledge Areas
Expect questions involving:
- OCR pipelines
- RAG architectures
- Azure AI Search indexers
- Skillsets
- Embedding generation
- Chunking strategies
- Hybrid search
- Layout-aware extraction
- Document Intelligence integration
Final Thoughts
Configuring RAG ingestion flows is one of the most important modern Azure AI skills.
For AI-103, focus heavily on:
- OCR workflows
- Document ingestion
- AI enrichment
- Chunking
- Embeddings
- Vector indexing
- Hybrid retrieval
- Grounding pipelines
These concepts are foundational to enterprise AI agents, copilots, and intelligent search applications.
Practice Exam Questions
Question 1
What is the primary purpose of OCR in a RAG ingestion pipeline?
A. Encrypt documents
B. Generate embeddings directly
C. Compress PDF files
D. Convert images and scanned documents into searchable text
Answer
D. Convert images and scanned documents into searchable text
Question 2
Which Azure service commonly provides OCR capabilities?
A. Azure Backup
B. Azure AI Vision
C. Azure DNS
D. Azure Firewall
Answer
B. Azure AI Vision
Question 3
What is the purpose of chunking documents in a RAG pipeline?
A. Reduce network latency only
B. Encrypt sensitive data
C. Improve retrieval and fit token limits
D. Remove metadata
Answer
C. Improve retrieval and fit token limits
Question 4
Which Azure service commonly stores searchable vector indexes?
A. Azure AI Search
B. Azure Virtual Machines
C. Azure Monitor
D. Azure Policy
Answer
A. Azure AI Search
Question 5
What is the role of embeddings in a RAG system?
A. Compress images
B. Store RBAC permissions
C. Represent content as numerical vectors for similarity search
D. Replace OCR processing
Answer
C. Represent content as numerical vectors for similarity search
Question 6
Which component commonly orchestrates AI enrichment during indexing?
A. Load balancer
B. Skillset
C. Resource group
D. Network security group
Answer
B. Skillset
Question 7
Why is hybrid search commonly recommended in enterprise RAG systems?
A. It reduces storage costs only
B. It replaces OCR processing
C. It eliminates embeddings entirely
D. It combines multiple retrieval techniques for better relevance
Answer
D. It combines multiple retrieval techniques for better relevance
Question 8
Which Azure service is best for preserving document layout and table structures?
A. Azure AI Document Intelligence
B. Azure Monitor
C. Azure Kubernetes Service
D. Azure Logic Apps
Answer
A. Azure AI Document Intelligence
Question 9
What is grounding in a generative AI solution?
A. Deleting unused indexes
B. Training foundation models from scratch
C. Providing trusted external context to the LLM
D. Compressing vector databases
Answer
C. Providing trusted external context to the LLM
Question 10
Which statement best describes a RAG architecture?
A. It relies only on model training data
B. It combines retrieval systems with generative AI models
C. It eliminates the need for search indexes
D. It only works with structured databases
Answer
B. It combines retrieval systems with generative AI models
Go to the AI-103 Exam Prep Hub main page
