This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement computer vision solutions (10–15%)
   --> Design and implement multimodal understanding workflows
      --> Implement a solution that enables question-answering grounded in visual evidence

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern multimodal AI systems can answer questions about images, videos, diagrams, documents, and other visual content. These systems combine:

Computer vision
Natural language understanding
Multimodal reasoning
Retrieval mechanisms
Context grounding

For the AI-103 certification exam, you should understand how to implement solutions that provide question-answering (QA) capabilities grounded in visual evidence.

This includes:

Visual Question Answering (VQA)
Image-based reasoning
Document question-answering
Video question-answering
Grounded multimodal responses
OCR-enhanced workflows
Retrieval-augmented multimodal systems

You should also understand:

Prompt engineering
Visual grounding
Confidence validation
Responsible AI concerns
Azure services used in multimodal QA systems
Workflow orchestration and monitoring

This topic falls under:

“Design and implement multimodal understanding workflows”

What Is Visual Question Answering (VQA)?

Definition

Visual Question Answering (VQA) is a multimodal AI capability that enables users to ask questions about visual content.

The system:

Processes visual input
Interprets contextual meaning
Understands the question
Generates an answer grounded in the visual evidence

Example VQA Workflow

Image:

A kitchen scene

Question:

What appliance is next to the refrigerator?

AI Response:

A microwave is positioned next to the refrigerator.

What Does “Grounded in Visual Evidence” Mean?

Grounding means that responses are based on actual visual content rather than unsupported assumptions.

Grounded systems:

Reference visible information
Reduce hallucinations
Improve trustworthiness
Support explainability

Why Grounding Matters

Without grounding, AI models may:

Invent objects
Misinterpret scenes
Produce misleading responses

Grounding improves:

Accuracy
Reliability
Transparency

Common Use Cases

Accessibility

Users can ask:

What is happening in this image?

This supports:

Screen readers
Assistive technologies
Inclusive applications

Retail and E-Commerce

Questions may include:

Which products are missing from the shelf?

Healthcare

Medical workflows may ask:

Are there visible signs of inflammation?

Human oversight remains essential.

Security and Monitoring

Systems may answer:

Is anyone wearing a safety helmet?

Manufacturing

Questions may include:

Is the assembly component installed correctly?

Education

Students may ask questions about:

Diagrams
Charts
Scientific illustrations

Types of Visual QA Systems

Image Question-Answering

Answers questions about single images.

Multi-Image QA

Analyzes multiple related images.

Example:

Which image contains damaged packaging?

Document Question-Answering

Answers questions about:

Forms
Invoices
Reports
PDFs

Video Question-Answering

Answers questions about:

Motion
Events
Temporal sequences

Example:

What happens after the person enters the room?

Core Components of Grounded Visual QA Systems

A grounded QA workflow often includes:

Image acquisition
OCR extraction
Object detection
Scene analysis
Multimodal prompting
AI reasoning
Evidence grounding
Safety validation
Output generation

Visual Understanding Tasks

Object Detection

Identifies:

Objects
Locations
Bounding boxes

OCR (Optical Character Recognition)

Extracts visible text from:

Signs
Documents
Labels
Screenshots

Scene Understanding

Analyzes:

Activities
Relationships
Environmental context

Spatial Reasoning

Understands:

Relative positioning
Distance relationships
Orientation

Example:

What object is behind the table?

Multimodal Prompt Engineering

What Is Multimodal Prompting?

Multimodal prompting combines:

Visual input
Natural-language instructions

to guide model reasoning.

Example Prompt

Image:

Warehouse photo

Prompt:

Identify any visible safety violations in the image

Effective Prompting Techniques

Ask Specific Questions

Strong:

How many forklifts are visible?

Weak:

Describe this image

Request Evidence-Based Responses

Example:

Answer only using visible evidence from the image

Request Structured Output

Example:

Return detected hazards as JSON

Visual Grounding

What Is Visual Grounding?

Visual grounding links AI responses to specific image regions or evidence.

This helps:

Explain answers
Improve transparency
Support auditing

Example of Grounding

Question:

What is the person holding?

Grounded response:

			
The person is holding a yellow umbrella visible in the upper-left portion of the image.

OCR-Enhanced Question-Answering

OCR improves QA systems by extracting visible text.

Example Workflow

Image:

Receipt

Question:

What is the total purchase amount?

OCR extracts:

$42.18

The multimodal model generates the final answer.

Chart and Diagram Question-Answering

Multimodal systems can answer questions about:

Graphs
Dashboards
Infographics
Technical diagrams

Example Questions

Which month had the highest sales?

What trend is visible in the chart?

Video Question-Answering

Video QA systems analyze:

Frame sequences
Actions
Motion
Temporal context

Example Video QA

Question:

What does the driver do after stopping the vehicle?

The system analyzes:

Multiple frames
Temporal order
Scene transitions

Retrieval-Augmented Generation (RAG)

What Is Multimodal RAG?

Multimodal RAG combines:

Visual retrieval
Text retrieval
AI reasoning

to improve grounded responses.

Example Workflow

User uploads image
System retrieves related documentation
Multimodal model analyzes image
AI generates grounded answer

Confidence Scoring

Grounded systems may provide:

Confidence values
Evidence references
Validation indicators

This improves reliability.

Hallucinations in Visual QA

What Are Hallucinations?

Hallucinations occur when models:

Invent objects
Misread scenes
Produce unsupported answers

Example Hallucination

Image:

Empty desk

Incorrect answer:

A laptop is sitting on the desk.

Reducing Hallucinations

Strategies include:

Visual grounding
OCR validation
Retrieval augmentation
Confidence thresholds
Human review

Responsible AI Considerations

Visual QA systems introduce important Responsible AI concerns.

Bias and Fairness

Models may:

Misidentify demographics
Reinforce stereotypes
Produce biased responses

Privacy Concerns

Images may contain:

Faces
Sensitive documents
Personal information

Organizations must secure user data.

Harmful Content

Images may include:

Violence
Explicit material
Hate symbols

Azure AI Content Safety

Microsoft provides:
Azure AI Content Safety

to help detect:

Unsafe imagery
Harmful prompts
Policy violations

Human-in-the-Loop Review

Manual review is often required for:

Healthcare systems
Legal workflows
Public-facing applications
High-risk decisions

Performance Considerations

Visual QA systems can be compute-intensive.

Factors affecting performance include:

Image resolution
Video duration
Model size
Context length
Retrieval complexity

GPU Acceleration

Multimodal AI systems commonly rely on GPUs for:

Parallel inference
Transformer processing
Vision model acceleration

Optimization Techniques

Image Resizing

Reduce unnecessary resolution.

Batch Processing

Process multiple requests efficiently.

Caching

Reuse frequent visual embeddings.

Asynchronous Workflows

Improve responsiveness.

Azure Services for Grounded Visual QA

Azure OpenAI Service

Supports:

Multimodal reasoning
Prompt-based image understanding
Grounded visual QA workflows

Azure AI Vision

Supports:

OCR
Object detection
Image analysis
Scene understanding

Azure AI Document Intelligence

Supports:

Document extraction
Form understanding
Layout analysis
Document QA workflows

Azure AI Foundry

Supports:

Prompt flows
Workflow orchestration
Evaluation pipelines

Azure Blob Storage

Frequently used for:

Image storage
Video storage
Document storage

Azure Functions

Often used for:

Event-driven workflows
Orchestration
Automated processing

Observability and Monitoring

Production systems should monitor:

Latency
Failed requests
Hallucination frequency
OCR accuracy
Safety violations
GPU utilization
Operational cost

Best Practices for Visual QA Systems

Use Grounded Prompts

Request evidence-based responses.

Combine OCR and Vision Models

This improves accuracy.

Validate Outputs

Apply confidence thresholds and moderation.

Use Human Review for High-Risk Scenarios

Especially important in regulated industries.

Optimize for Latency and Cost

Balance performance and operational efficiency.

Maintain Audit Logs

Track prompts, responses, and evidence references.

Protect Sensitive Data

Secure uploaded media and extracted information.

Real-World Example

A warehouse safety application may:

Upload facility images
Detect equipment and workers
Use OCR to read warning signs
Allow supervisors to ask:

Are all workers wearing protective equipment?

Generate grounded answers
Store results for compliance auditing

This demonstrates:

Visual grounding
OCR integration
Scene understanding
Safety-focused QA workflows

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Visual Question Answering (VQA) answers questions about visual data.
Grounding ensures answers are based on visible evidence.
OCR improves document and text-based QA.
Visual grounding improves explainability and trustworthiness.
Multimodal prompting guides visual reasoning behavior.
Hallucinations occur when models invent unsupported details.
Multimodal RAG improves grounded responses.
Azure AI Vision supports OCR and image analysis.
Azure AI Document Intelligence supports document QA workflows.
Azure AI Content Safety helps moderate unsafe content.
Human review may be necessary for sensitive scenarios.

Practice Exam Questions

Question 1

What is Visual Question Answering (VQA)?

A. Compressing video streams
B. Answering questions about visual content
C. Encrypting image metadata
D. Generating SQL queries

Answer

B. Answering questions about visual content

Explanation

VQA systems combine visual understanding with natural-language reasoning.

Question 2

What does “grounded in visual evidence” mean?

A. Responses are based on actual visual content
B. Images are compressed before processing
C. Outputs are encrypted automatically
D. GPU acceleration is enabled

Answer

A. Responses are based on actual visual content

Explanation

Grounded systems generate answers supported by visible evidence.

Question 3

What is a primary benefit of visual grounding?

A. Reduced storage requirements
B. Improved explainability and trustworthiness
C. Faster internet connectivity
D. Automatic video compression

Answer

B. Improved explainability and trustworthiness

Explanation

Visual grounding helps users understand why a model produced a response.

Question 4

How does OCR enhance visual QA systems?

A. By encrypting uploaded images
B. By extracting visible text from images and documents
C. By compressing video frames
D. By eliminating GPU requirements

Answer

B. By extracting visible text from images and documents

Explanation

OCR enables multimodal systems to reason over visible text.

Question 5

What is a hallucination in a visual QA system?

A. Automatic image resizing
B. Generating unsupported or incorrect answers
C. Compressing captions
D. Increasing GPU throughput

Answer

B. Generating unsupported or incorrect answers

Explanation

Hallucinations occur when models invent nonexistent objects or details.

Question 6

Which Azure service supports OCR and object detection?

A. Azure AI Vision
B. Azure Firewall
C. Azure DNS
D. Azure CDN

Answer

A. Azure AI Vision

Explanation

Azure AI Vision provides OCR, image analysis, and object detection capabilities.

Question 7

What is multimodal prompting?

A. Combining visual inputs with text instructions
B. Encrypting prompts before transmission
C. Compressing image metadata
D. Scaling GPU clusters automatically

Answer

A. Combining visual inputs with text instructions

Explanation

Multimodal prompting guides AI reasoning using multiple input types.

Question 8

Why are confidence scores useful in visual QA systems?

A. They reduce storage costs
B. They help assess response reliability
C. They eliminate hallucinations completely
D. They compress images automatically