This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement computer vision solutions (10–15%)
--> Design and implement multimodal understanding workflows
--> Implement visual understanding by configuring Azure Content Understanding in Foundry Tools to extract visual characteristics
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern AI applications increasingly rely on multimodal systems capable of analyzing images, documents, videos, and other visual content to extract meaningful information. Microsoft provides tools within Azure AI ecosystems that support visual understanding workflows using multimodal AI and orchestration capabilities.
For the AI-103 certification exam, you should understand how to configure visual understanding solutions using Azure AI tools and Foundry workflows to extract visual characteristics from media assets.
This includes:
- Object identification
- Scene understanding
- OCR extraction
- Attribute extraction
- Image captioning
- Spatial analysis
- Metadata enrichment
- Visual classification
- Workflow orchestration
You should also understand:
- Prompt engineering
- Multimodal reasoning
- Azure AI Foundry workflows
- Responsible AI practices
- Performance optimization
- Monitoring and observability
This topic falls under:
“Design and implement multimodal understanding workflows”
What Is Visual Understanding?
Definition
Visual understanding is the ability of AI systems to analyze and interpret visual information from:
- Images
- Videos
- Documents
- Diagrams
- Screenshots
The goal is to extract meaningful characteristics and contextual insights.
What Are Visual Characteristics?
Visual characteristics are identifiable attributes extracted from visual content.
Examples include:
- Objects
- Colors
- Shapes
- Text
- Actions
- Layouts
- Emotions
- Spatial relationships
- Environmental context
Example of Visual Characteristic Extraction
Image:
- Retail shelf
Extracted characteristics:
- Product categories
- Shelf placement
- Pricing labels
- Empty inventory slots
- Brand logos
What Is Azure AI Foundry?
Azure AI Foundry
is a Microsoft platform for:
- Building AI applications
- Managing prompt flows
- Orchestrating AI workflows
- Evaluating models
- Integrating multimodal AI services
Foundry tools help developers create scalable AI workflows that integrate vision, language, and reasoning capabilities.
What Is Azure Content Understanding?
Azure Content Understanding refers to workflows that combine:
- Computer vision
- OCR
- Multimodal AI
- Document understanding
- Language reasoning
to interpret and extract information from visual and multimedia content.
Why Visual Understanding Matters
Visual understanding enables:
- Automation
- Accessibility
- Search enrichment
- Content moderation
- Intelligent retrieval
- Business analytics
- Operational monitoring
Common Use Cases
Retail
Analyze:
- Inventory placement
- Shelf conditions
- Product labels
Healthcare
Interpret:
- Medical imagery
- Visual reports
- Diagnostic documentation
Manufacturing
Detect:
- Defects
- Safety issues
- Assembly validation
Document Processing
Extract:
- Forms
- Tables
- Handwritten text
- Layout structure
Security and Monitoring
Identify:
- Unauthorized access
- Safety hazards
- Environmental anomalies
Core Components of Visual Understanding Workflows
A typical workflow includes:
- Media ingestion
- Preprocessing
- OCR extraction
- Object detection
- Scene analysis
- Multimodal reasoning
- Metadata generation
- Storage and orchestration
Visual Analysis Capabilities
Object Detection
Identifies:
- Objects
- Locations
- Bounding boxes
Example:
- Cars
- People
- Traffic signs
Scene Understanding
Interprets:
- Activities
- Environments
- Relationships between objects
Example:
- Crowded airport terminal
- Outdoor sports event
Attribute Extraction
Extracts:
- Colors
- Clothing types
- Brand identifiers
- Vehicle types
- Product conditions
OCR (Optical Character Recognition)
OCR extracts visible text from:
- Signs
- Screenshots
- Receipts
- Documents
- Labels
Example OCR Extraction
Image:
- Invoice
Extracted text:
Invoice Total: $1,248.50
Spatial Analysis
Spatial analysis interprets:
- Positioning
- Relative distances
- Orientation
Example:
The bicycle is positioned beside the parked vehicle.
Image Captioning
Captioning generates natural-language descriptions of visual content.
Example:
A worker wearing protective equipment operates machinery in a factory environment.
Dense Captioning
Dense captioning describes:
- Multiple regions
- Multiple objects
- Activities within a scene
Visual Classification
Classification categorizes images into labels.
Examples:
- Warehouse
- Beach
- Construction site
- Medical scan
Multimodal Reasoning
What Is Multimodal Reasoning?
Multimodal reasoning combines:
- Vision analysis
- Language understanding
- Contextual interpretation
to produce intelligent outputs.
Example
Image:
- Restaurant kitchen
Question:
Are food safety violations visible?
The system analyzes:
- Cooking equipment
- Worker behavior
- Environmental conditions
Prompt Engineering in Foundry Workflows
Why Prompt Engineering Matters
Prompt engineering guides how multimodal models interpret visual content.
Example Prompt
Extract all visible product labels and identify damaged packaging
Accessibility-Focused Prompt Example
Generate accessibility-focused image descriptions for screen readers
Structured Output Prompt Example
Return extracted visual characteristics as JSON
Workflow Orchestration in Azure AI Foundry
Foundry workflows may orchestrate:
- OCR pipelines
- Vision analysis
- Prompt flows
- Safety checks
- Human review
- Data storage
Example Workflow
- User uploads image
- OCR extracts visible text
- Object detection identifies entities
- Multimodal model analyzes context
- AI generates structured metadata
- Results stored in Blob Storage
Retrieval-Augmented Generation (RAG)
Multimodal RAG
Multimodal RAG combines:
- Visual retrieval
- Text retrieval
- AI reasoning
to improve grounded understanding.
Example
- User uploads equipment photo
- System retrieves maintenance documentation
- AI compares image to known equipment states
- System generates grounded analysis
Responsible AI Considerations
Visual understanding systems introduce important Responsible AI concerns.
Bias and Fairness
Models may:
- Misidentify demographics
- Reinforce stereotypes
- Produce biased classifications
Privacy Concerns
Images may contain:
- Faces
- Personal data
- Sensitive information
Organizations must secure visual data properly.
Hallucinations
What Are Hallucinations?
Hallucinations occur when models:
- Invent objects
- Misidentify scenes
- Produce unsupported conclusions
Reducing Hallucinations
Strategies include:
- OCR grounding
- Confidence scoring
- Human review
- Retrieval augmentation
- Structured prompts
Azure AI Content Safety
Microsoft provides:
Azure AI Content Safety
to help detect:
- Harmful imagery
- Unsafe prompts
- Policy violations
Human-in-the-Loop Review
Manual review may be required for:
- Healthcare workflows
- Legal systems
- Government applications
- Public-facing AI systems
Performance Considerations
Visual understanding systems can require substantial compute resources.
Factors affecting performance include:
- Image resolution
- Video length
- OCR complexity
- Model size
- Context window size
GPU Acceleration
Multimodal AI commonly relies on GPUs because of:
- Parallel processing
- Transformer inference
- Large-scale visual analysis
Optimization Techniques
Image Resizing
Reduce unnecessary resolution.
Batch Processing
Analyze multiple assets efficiently.
Asynchronous Processing
Improve responsiveness.
Caching
Reuse previously generated embeddings and metadata.
Azure Services Used in Visual Understanding Workflows
Azure OpenAI Service
Azure OpenAI Service
Supports:
- Multimodal reasoning
- Prompt-driven visual analysis
- Context-aware workflows
Azure AI Vision
Azure AI Vision
Supports:
- OCR
- Image analysis
- Object detection
- Caption generation
Azure AI Document Intelligence
Azure AI Document Intelligence
Supports:
- Form extraction
- Layout understanding
- Structured document analysis
Azure Blob Storage
Azure Blob Storage
Frequently used for:
- Image storage
- Video storage
- Metadata storage
- Workflow integration
Azure Functions
Azure Functions
Often used for:
- Trigger-based automation
- Event-driven workflows
- Orchestration pipelines
Observability and Monitoring
Production systems should monitor:
- Latency
- OCR accuracy
- Failed requests
- Hallucination frequency
- GPU utilization
- Safety violations
- Operational cost
Best Practices for Visual Understanding Solutions
Use Specific Prompts
Detailed prompts improve extraction quality.
Combine OCR and Vision Analysis
This improves grounded understanding.
Validate Outputs
Check for hallucinations and inaccuracies.
Use Structured Outputs
JSON outputs simplify downstream automation.
Protect Sensitive Data
Secure uploaded media and extracted information.
Support Human Review
Especially important for high-risk workflows.
Optimize for Cost and Performance
Balance quality and operational efficiency.
Real-World Example
A logistics company may:
- Upload warehouse images
- Extract visible shipment labels with OCR
- Detect damaged packaging
- Identify forklift activity
- Generate structured metadata
- Store analysis results in Blob Storage
This demonstrates:
- OCR integration
- Object detection
- Spatial analysis
- Workflow orchestration
- Metadata enrichment
Exam Tips for AI-103
For the AI-103 exam, remember these important concepts:
- Visual understanding extracts meaningful information from images and videos.
- Azure AI Foundry supports workflow orchestration and prompt flows.
- OCR extracts visible text from images and documents.
- Multimodal reasoning combines vision and language understanding.
- Object detection identifies objects and locations.
- Scene understanding interprets activities and relationships.
- Structured outputs improve automation workflows.
- Hallucinations occur when models generate unsupported conclusions.
- Azure AI Vision supports OCR and image analysis.
- Azure AI Content Safety helps moderate unsafe content.
- Human review may be necessary for sensitive workflows.
Practice Exam Questions
Question 1
What is the primary goal of visual understanding systems?
A. Compressing media files
B. Extracting meaningful information from visual content
C. Encrypting image metadata
D. Reducing internet bandwidth usage
Answer
B. Extracting meaningful information from visual content
Explanation
Visual understanding systems analyze images and videos to extract useful insights.
Question 2
Which capability extracts visible text from images?
A. Object detection
B. OCR
C. Image compression
D. GPU scheduling
Answer
B. OCR
Explanation
OCR (Optical Character Recognition) extracts machine-readable text from images and documents.
Question 3
What is multimodal reasoning?
A. Combining visual and language understanding for contextual interpretation
B. Compressing videos into smaller files
C. Encrypting AI prompts
D. Scaling databases automatically
Answer
A. Combining visual and language understanding for contextual interpretation
Explanation
Multimodal reasoning integrates multiple input types to improve AI understanding.
Question 4
Which Azure service supports prompt flows and AI workflow orchestration?
A. Azure AI Foundry
B. Azure CDN
C. Azure Firewall
D. Azure DNS
Answer
A. Azure AI Foundry
Explanation
Azure AI Foundry supports orchestration, evaluation pipelines, and prompt workflows.
Question 5
What is a hallucination in visual understanding systems?
A. Automatic GPU scaling
B. Generating unsupported or incorrect conclusions
C. Compressing image embeddings
D. Encrypting metadata
Answer
B. Generating unsupported or incorrect conclusions
Explanation
Hallucinations occur when AI systems invent nonexistent details or relationships.
Question 6
Which Azure service supports image analysis and object detection?
A. Azure AI Vision
B. Azure DNS
C. Azure Firewall
D. Azure ExpressRoute
Answer
A. Azure AI Vision
Explanation
Azure AI Vision supports OCR, image analysis, and object detection capabilities.
Question 7
Why are structured outputs useful in visual understanding workflows?
A. They simplify downstream automation and integration
B. They eliminate GPU requirements
C. They automatically remove hallucinations
D. They compress images automatically
Answer
A. They simplify downstream automation and integration
Explanation
Structured outputs such as JSON are easier for downstream systems to process.
Question 8
What is a common use case for visual understanding in retail?
A. Detecting shelf inventory conditions
B. Encrypting payment transactions
C. Reducing internet latency
D. Scaling virtual machines automatically
Answer
A. Detecting shelf inventory conditions
Explanation
Retail workflows often analyze shelves, inventory placement, and product visibility.
Question 9
Which Azure service helps moderate unsafe visual content?
A. Azure AI Content Safety
B. Azure Virtual WAN
C. Azure DNS
D. Azure Load Balancer
Answer
A. Azure AI Content Safety
Explanation
Azure AI Content Safety helps detect harmful or policy-violating content.
Question 10
Why might human review be necessary in visual understanding workflows?
A. To validate sensitive or high-risk AI outputs
B. To disable OCR processing
C. To increase GPU throughput
D. To compress image metadata
Answer
A. To validate sensitive or high-risk AI outputs
Explanation
Human oversight helps ensure accuracy and safety in critical workflows.
Go to the AI-103 Exam Prep Hub main page
