This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement computer vision solutions (10–15%)
--> Design and implement multimodal understanding workflows
--> Implement a solution that enables question-answering grounded in visual evidence
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern multimodal AI systems can answer questions about images, videos, diagrams, documents, and other visual content. These systems combine:
- Computer vision
- Natural language understanding
- Multimodal reasoning
- Retrieval mechanisms
- Context grounding
For the AI-103 certification exam, you should understand how to implement solutions that provide question-answering (QA) capabilities grounded in visual evidence.
This includes:
- Visual Question Answering (VQA)
- Image-based reasoning
- Document question-answering
- Video question-answering
- Grounded multimodal responses
- OCR-enhanced workflows
- Retrieval-augmented multimodal systems
You should also understand:
- Prompt engineering
- Visual grounding
- Confidence validation
- Responsible AI concerns
- Azure services used in multimodal QA systems
- Workflow orchestration and monitoring
This topic falls under:
“Design and implement multimodal understanding workflows”
What Is Visual Question Answering (VQA)?
Definition
Visual Question Answering (VQA) is a multimodal AI capability that enables users to ask questions about visual content.
The system:
- Processes visual input
- Interprets contextual meaning
- Understands the question
- Generates an answer grounded in the visual evidence
Example VQA Workflow
Image:
- A kitchen scene
Question:
What appliance is next to the refrigerator?
AI Response:
A microwave is positioned next to the refrigerator.
What Does “Grounded in Visual Evidence” Mean?
Grounding means that responses are based on actual visual content rather than unsupported assumptions.
Grounded systems:
- Reference visible information
- Reduce hallucinations
- Improve trustworthiness
- Support explainability
Why Grounding Matters
Without grounding, AI models may:
- Invent objects
- Misinterpret scenes
- Produce misleading responses
Grounding improves:
- Accuracy
- Reliability
- Transparency
Common Use Cases
Accessibility
Users can ask:
What is happening in this image?
This supports:
- Screen readers
- Assistive technologies
- Inclusive applications
Retail and E-Commerce
Questions may include:
Which products are missing from the shelf?
Healthcare
Medical workflows may ask:
Are there visible signs of inflammation?
Human oversight remains essential.
Security and Monitoring
Systems may answer:
Is anyone wearing a safety helmet?
Manufacturing
Questions may include:
Is the assembly component installed correctly?
Education
Students may ask questions about:
- Diagrams
- Charts
- Scientific illustrations
Types of Visual QA Systems
Image Question-Answering
Answers questions about single images.
Multi-Image QA
Analyzes multiple related images.
Example:
Which image contains damaged packaging?
Document Question-Answering
Answers questions about:
- Forms
- Invoices
- Reports
- PDFs
Video Question-Answering
Answers questions about:
- Motion
- Events
- Temporal sequences
Example:
What happens after the person enters the room?
Core Components of Grounded Visual QA Systems
A grounded QA workflow often includes:
- Image acquisition
- OCR extraction
- Object detection
- Scene analysis
- Multimodal prompting
- AI reasoning
- Evidence grounding
- Safety validation
- Output generation
Visual Understanding Tasks
Object Detection
Identifies:
- Objects
- Locations
- Bounding boxes
OCR (Optical Character Recognition)
Extracts visible text from:
- Signs
- Documents
- Labels
- Screenshots
Scene Understanding
Analyzes:
- Activities
- Relationships
- Environmental context
Spatial Reasoning
Understands:
- Relative positioning
- Distance relationships
- Orientation
Example:
What object is behind the table?
Multimodal Prompt Engineering
What Is Multimodal Prompting?
Multimodal prompting combines:
- Visual input
- Natural-language instructions
to guide model reasoning.
Example Prompt
Image:
- Warehouse photo
Prompt:
Identify any visible safety violations in the image
Effective Prompting Techniques
Ask Specific Questions
Strong:
How many forklifts are visible?
Weak:
Describe this image
Request Evidence-Based Responses
Example:
Answer only using visible evidence from the image
Request Structured Output
Example:
Return detected hazards as JSON
Visual Grounding
What Is Visual Grounding?
Visual grounding links AI responses to specific image regions or evidence.
This helps:
- Explain answers
- Improve transparency
- Support auditing
Example of Grounding
Question:
What is the person holding?
Grounded response:
The person is holding a yellow umbrella visible in the upper-left portion of the image.
OCR-Enhanced Question-Answering
OCR improves QA systems by extracting visible text.
Example Workflow
Image:
- Receipt
Question:
What is the total purchase amount?
OCR extracts:
$42.18
The multimodal model generates the final answer.
Chart and Diagram Question-Answering
Multimodal systems can answer questions about:
- Graphs
- Dashboards
- Infographics
- Technical diagrams
Example Questions
Which month had the highest sales?
What trend is visible in the chart?
Video Question-Answering
Video QA systems analyze:
- Frame sequences
- Actions
- Motion
- Temporal context
Example Video QA
Question:
What does the driver do after stopping the vehicle?
The system analyzes:
- Multiple frames
- Temporal order
- Scene transitions
Retrieval-Augmented Generation (RAG)
What Is Multimodal RAG?
Multimodal RAG combines:
- Visual retrieval
- Text retrieval
- AI reasoning
to improve grounded responses.
Example Workflow
- User uploads image
- System retrieves related documentation
- Multimodal model analyzes image
- AI generates grounded answer
Confidence Scoring
Grounded systems may provide:
- Confidence values
- Evidence references
- Validation indicators
This improves reliability.
Hallucinations in Visual QA
What Are Hallucinations?
Hallucinations occur when models:
- Invent objects
- Misread scenes
- Produce unsupported answers
Example Hallucination
Image:
- Empty desk
Incorrect answer:
A laptop is sitting on the desk.
Reducing Hallucinations
Strategies include:
- Visual grounding
- OCR validation
- Retrieval augmentation
- Confidence thresholds
- Human review
Responsible AI Considerations
Visual QA systems introduce important Responsible AI concerns.
Bias and Fairness
Models may:
- Misidentify demographics
- Reinforce stereotypes
- Produce biased responses
Privacy Concerns
Images may contain:
- Faces
- Sensitive documents
- Personal information
Organizations must secure user data.
Harmful Content
Images may include:
- Violence
- Explicit material
- Hate symbols
Azure AI Content Safety
Microsoft provides:
Azure AI Content Safety
to help detect:
- Unsafe imagery
- Harmful prompts
- Policy violations
Human-in-the-Loop Review
Manual review is often required for:
- Healthcare systems
- Legal workflows
- Public-facing applications
- High-risk decisions
Performance Considerations
Visual QA systems can be compute-intensive.
Factors affecting performance include:
- Image resolution
- Video duration
- Model size
- Context length
- Retrieval complexity
GPU Acceleration
Multimodal AI systems commonly rely on GPUs for:
- Parallel inference
- Transformer processing
- Vision model acceleration
Optimization Techniques
Image Resizing
Reduce unnecessary resolution.
Batch Processing
Process multiple requests efficiently.
Caching
Reuse frequent visual embeddings.
Asynchronous Workflows
Improve responsiveness.
Azure Services for Grounded Visual QA
Azure OpenAI Service
Azure OpenAI Service
Supports:
- Multimodal reasoning
- Prompt-based image understanding
- Grounded visual QA workflows
Azure AI Vision
Azure AI Vision
Supports:
- OCR
- Object detection
- Image analysis
- Scene understanding
Azure AI Document Intelligence
Azure AI Document Intelligence
Supports:
- Document extraction
- Form understanding
- Layout analysis
- Document QA workflows
Azure AI Foundry
Azure AI Foundry
Supports:
- Prompt flows
- Workflow orchestration
- Evaluation pipelines
Azure Blob Storage
Azure Blob Storage
Frequently used for:
- Image storage
- Video storage
- Document storage
Azure Functions
Azure Functions
Often used for:
- Event-driven workflows
- Orchestration
- Automated processing
Observability and Monitoring
Production systems should monitor:
- Latency
- Failed requests
- Hallucination frequency
- OCR accuracy
- Safety violations
- GPU utilization
- Operational cost
Best Practices for Visual QA Systems
Use Grounded Prompts
Request evidence-based responses.
Combine OCR and Vision Models
This improves accuracy.
Validate Outputs
Apply confidence thresholds and moderation.
Use Human Review for High-Risk Scenarios
Especially important in regulated industries.
Optimize for Latency and Cost
Balance performance and operational efficiency.
Maintain Audit Logs
Track prompts, responses, and evidence references.
Protect Sensitive Data
Secure uploaded media and extracted information.
Real-World Example
A warehouse safety application may:
- Upload facility images
- Detect equipment and workers
- Use OCR to read warning signs
- Allow supervisors to ask:
Are all workers wearing protective equipment?
- Generate grounded answers
- Store results for compliance auditing
This demonstrates:
- Visual grounding
- OCR integration
- Scene understanding
- Safety-focused QA workflows
Exam Tips for AI-103
For the AI-103 exam, remember these important concepts:
- Visual Question Answering (VQA) answers questions about visual data.
- Grounding ensures answers are based on visible evidence.
- OCR improves document and text-based QA.
- Visual grounding improves explainability and trustworthiness.
- Multimodal prompting guides visual reasoning behavior.
- Hallucinations occur when models invent unsupported details.
- Multimodal RAG improves grounded responses.
- Azure AI Vision supports OCR and image analysis.
- Azure AI Document Intelligence supports document QA workflows.
- Azure AI Content Safety helps moderate unsafe content.
- Human review may be necessary for sensitive scenarios.
Practice Exam Questions
Question 1
What is Visual Question Answering (VQA)?
A. Compressing video streams
B. Answering questions about visual content
C. Encrypting image metadata
D. Generating SQL queries
Answer
B. Answering questions about visual content
Explanation
VQA systems combine visual understanding with natural-language reasoning.
Question 2
What does “grounded in visual evidence” mean?
A. Responses are based on actual visual content
B. Images are compressed before processing
C. Outputs are encrypted automatically
D. GPU acceleration is enabled
Answer
A. Responses are based on actual visual content
Explanation
Grounded systems generate answers supported by visible evidence.
Question 3
What is a primary benefit of visual grounding?
A. Reduced storage requirements
B. Improved explainability and trustworthiness
C. Faster internet connectivity
D. Automatic video compression
Answer
B. Improved explainability and trustworthiness
Explanation
Visual grounding helps users understand why a model produced a response.
Question 4
How does OCR enhance visual QA systems?
A. By encrypting uploaded images
B. By extracting visible text from images and documents
C. By compressing video frames
D. By eliminating GPU requirements
Answer
B. By extracting visible text from images and documents
Explanation
OCR enables multimodal systems to reason over visible text.
Question 5
What is a hallucination in a visual QA system?
A. Automatic image resizing
B. Generating unsupported or incorrect answers
C. Compressing captions
D. Increasing GPU throughput
Answer
B. Generating unsupported or incorrect answers
Explanation
Hallucinations occur when models invent nonexistent objects or details.
Question 6
Which Azure service supports OCR and object detection?
A. Azure AI Vision
B. Azure Firewall
C. Azure DNS
D. Azure CDN
Answer
A. Azure AI Vision
Explanation
Azure AI Vision provides OCR, image analysis, and object detection capabilities.
Question 7
What is multimodal prompting?
A. Combining visual inputs with text instructions
B. Encrypting prompts before transmission
C. Compressing image metadata
D. Scaling GPU clusters automatically
Answer
A. Combining visual inputs with text instructions
Explanation
Multimodal prompting guides AI reasoning using multiple input types.
Question 8
Why are confidence scores useful in visual QA systems?
A. They reduce storage costs
B. They help assess response reliability
C. They eliminate hallucinations completely
D. They compress images automatically
Answer
B. They help assess response reliability
Explanation
Confidence scores help users evaluate answer trustworthiness.
Question 9
Which Azure service supports document understanding workflows?
A. Azure AI Document Intelligence
B. Azure DNS
C. Azure Virtual WAN
D. Azure Monitor
Answer
A. Azure AI Document Intelligence
Explanation
Azure AI Document Intelligence processes forms, invoices, and documents.
Question 10
Why might human review be required in visual QA systems?
A. To increase network latency
B. To validate high-risk or sensitive outputs
C. To reduce GPU memory usage
D. To disable OCR processing
Answer
B. To validate high-risk or sensitive outputs
Explanation
Sensitive domains such as healthcare and legal systems often require human oversight.
Go to the AI-103 Exam Prep Hub main page
