Tag: Multimodal models

AI, AI-103, Computer Vision, Generative AI, Microsoft Certification May 25, 2026May 25, 2026

Configure apps to produce concise or detailed captions for single or multiple images (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement computer vision solutions (10–15%)
   --> Design and implement multimodal understanding workflows
      --> Configure apps to produce concise or detailed captions for single or multiple images

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern multimodal AI systems can automatically generate captions that describe visual content in natural language. Captioning capabilities are widely used in:

Accessibility solutions
Content management systems
E-commerce platforms
Media analysis systems
Social media applications
Digital asset management
Search and retrieval systems

For the AI-103 certification exam, you should understand how to configure applications that generate:

Concise captions
Detailed captions
Single-image captions
Multi-image summaries
Context-aware visual descriptions

You should also understand:

Multimodal prompting
Caption customization
Batch image workflows
Accessibility considerations
Responsible AI concerns
Performance optimization
Azure services commonly used for captioning solutions

This topic falls under:

“Design and implement multimodal understanding workflows”

What Is Image Captioning?

Definition

Image captioning is the process of generating natural-language descriptions from visual input.

A captioning system analyzes:

Objects
People
Actions
Relationships
Backgrounds
Contextual information

and produces descriptive text.

Example Caption

Image:

Dog running on a beach

Generated caption:

A golden retriever running along a sandy beach near the ocean

Why Image Captioning Matters

Captioning improves:

Accessibility
Searchability
Automation
User experience
Content organization

Common Use Cases

Accessibility

Captions help visually impaired users understand image content through:

Screen readers
Audio narration
Alternative text (alt text)

E-Commerce

Captioning can automatically describe:

Products
Product conditions
Visual features

Media and Content Management

Organizations use captioning to:

Tag assets
Search images
Organize media libraries

Social Media

Applications generate:

Suggested captions
Content summaries
Automatic alt text

Security and Monitoring

Captioning systems can describe:

Surveillance scenes
Operational events
Safety hazards

Concise vs Detailed Captions

Concise Captions

Concise captions provide short summaries of image content.

Example:

A child riding a bicycle

Advantages of Concise Captions

Benefits include:

Faster reading
Simpler accessibility support
Reduced token usage
Lower latency

Detailed Captions

Detailed captions provide richer contextual descriptions.

Example:

			
A young child wearing a red helmet rides a blue bicycle along a tree-lined suburban street on a sunny afternoon

Advantages of Detailed Captions

Benefits include:

More context
Better search indexing
Improved scene understanding
Enhanced accessibility

Captioning Workflows

A typical captioning workflow includes:

Upload image
Preprocess image
Run visual analysis
Generate caption
Validate output
Store or display caption

Single-Image Captioning

What Is Single-Image Captioning?

Single-image captioning generates descriptions for one image at a time.

This is common in:

Accessibility apps
Social media uploads
Product pages

Example Workflow

User uploads image
Multimodal model analyzes image
App requests concise caption
Caption returned to application

Multi-Image Captioning

What Is Multi-Image Captioning?

Multi-image captioning generates:

Individual captions
Combined summaries
Comparative descriptions

for multiple related images.

Example Use Cases

Product Catalogs

Describe multiple product images together.

Photo Albums

Generate event summaries.

Medical Imaging

Summarize related scans or frames.

Example Multi-Image Summary

Images:

Beach photos from vacation

Generated summary:

A family vacation featuring beach activities, ocean sunsets, and outdoor dining

Dense Captioning

What Is Dense Captioning?

Dense captioning describes multiple objects or regions within a single image.

Example:

Person sitting on bench
Dog nearby
Bicycle leaning against tree

Visual Context in Captioning

Captioning systems analyze:

Objects
Actions
Emotions
Spatial relationships
Scene composition

This enables richer descriptions.

Caption Personalization

Applications may customize captions based on:

Audience
Reading level
Language
Accessibility requirements
Business domain

Example Accessibility Caption

			
A person using a wheelchair enters a modern office building using a wheelchair-accessible ramp

Multimodal Prompting for Captioning

What Is Multimodal Prompting?

Multimodal prompting combines:

Visual input
Text instructions

to guide caption generation.

Example Prompt

Image input:

Retail shelf

Prompt:

Generate a concise inventory-focused caption

Detailed Caption Prompt Example

Generate a highly detailed accessibility-focused description of this image

Prompt Engineering Best Practices

Be Specific

Specific prompts improve:

Accuracy
Relevance
Style consistency

Define Desired Length

Example:

Generate a one-sentence caption

or:

Generate a detailed paragraph describing all visible activities

Request Structured Outputs

Applications may request:

JSON responses
Categorized descriptions
Tagged outputs

Example:

Return caption and detected objects as JSON

Caption Quality Factors

Caption quality depends on:

Image quality
Resolution
Model capability
Prompt clarity
Scene complexity

Challenges in Captioning

Ambiguity

Images may contain unclear or partially visible objects.

Context Limitations

Models may incorrectly infer:

Emotions
Intentions
Activities

Cultural Interpretation

Visual meaning may vary across cultures.

Hallucinations in Captioning

What Are Hallucinations?

Hallucinations occur when models describe objects or actions not actually present.

Example:

Describing a dog that is not visible

Reducing Hallucinations

Strategies include:

Better prompts
Confidence scoring
Human review
Object detection grounding

Caption Evaluation Metrics

Organizations may evaluate captions using:

Accuracy
Relevance
Completeness
Fluency
Accessibility quality

Accessibility Considerations

Captioning systems are important for:

Screen readers
Alt text generation
Inclusive design

Good Accessibility Captions

Good captions should:

Be descriptive
Avoid vague wording
Focus on important details

Weak Caption Example

An image of a thing

Strong Caption Example

A firefighter carrying a child away from a burning building

Batch Captioning Workflows

Enterprise systems often process images in bulk.

Example Batch Workflow

Upload image batch
Queue processing jobs
Generate captions
Validate outputs
Store metadata
Enable search indexing

Workflow Orchestration

Captioning systems often integrate:

OCR
Object detection
Search indexing
Safety filtering
Human review

Example Enterprise Workflow

User uploads image collection
OCR extracts visible text
AI generates captions
Search metadata created
Unsafe content filtered
Results stored

Responsible AI Considerations

Captioning systems introduce important Responsible AI concerns.

Bias and Fairness

Models may:

Misidentify demographics
Reinforce stereotypes
Generate biased descriptions

Privacy Concerns

Images may contain:

Faces
Sensitive documents
Personal information

Organizations must protect privacy.

Harmful Content

Images may contain:

Violence
Explicit material
Hate symbols

Azure AI Content Safety

Microsoft provides:
Azure AI Content Safety

to help detect:

Harmful imagery
Unsafe prompts
Policy violations

Human-in-the-Loop Review

Organizations often require manual review for:

Medical systems
Legal workflows
Public-facing accessibility systems
High-risk applications

Performance Considerations

Captioning performance depends on:

Image size
Batch size
Model complexity
Prompt size
GPU availability

GPU Acceleration

Captioning systems commonly use GPUs because of:

Parallel inference
Large-scale vision processing
Transformer model acceleration

Optimization Techniques

Image Resizing

Reduce unnecessary resolution.

Batch Processing

Process multiple images simultaneously.

Caching

Reuse frequently analyzed assets.

Asynchronous Processing

Improve application responsiveness.

Azure Services for Captioning Workflows

Azure OpenAI Service

Supports:

Multimodal reasoning
Prompt-based caption generation
Visual understanding

Azure AI Vision

Supports:

Image analysis
Caption generation
OCR
Object detection

Azure AI Foundry

Supports:

Workflow orchestration
Prompt flows
AI evaluation pipelines

Azure Blob Storage

Frequently used for:

Image storage
Caption metadata storage
Workflow integration

Azure Functions

Often used for:

Trigger-based processing
Batch orchestration
Event-driven workflows

Observability and Monitoring

Production systems should monitor:

Caption latency
GPU utilization
Failed requests
Caption quality metrics
Safety violations
Operational costs

Best Practices for Captioning Solutions

Use Clear Prompts

Specific prompts improve caption quality.

Match Caption Length to Use Case

Use concise or detailed captions appropriately.

Validate Outputs

Check for hallucinations and unsafe content.

Support Accessibility Standards

Generate meaningful alt text.

Use Human Review for Sensitive Workflows

Especially important in regulated industries.

Optimize for Cost and Performance

Balance detail level with operational efficiency.

Maintain Audit Logs

Track prompts, outputs, and moderation actions.

Real-World Example

An e-commerce retailer may implement a workflow that:

Uploads product images
Uses OCR to extract visible labels
Generates concise captions for product listings
Generates detailed captions for accessibility support
Runs content safety validation
Stores captions in Blob Storage

This demonstrates:

Single-image captioning
Multi-purpose caption generation
Accessibility support
Workflow orchestration

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Image captioning generates natural-language descriptions of visual content.
Concise captions provide short summaries.
Detailed captions provide richer contextual descriptions.
Dense captioning describes multiple regions or objects.
Multimodal prompting guides caption behavior.
OCR can enhance captioning workflows.
Hallucinations occur when models describe nonexistent objects.
Accessibility is a major use case for captioning systems.
Azure AI Vision supports image captioning and OCR.
Azure AI Content Safety helps moderate unsafe visual content.
Human review may be needed for sensitive workflows.

Practice Exam Questions

Question 1

What is image captioning?

A. Compressing image files
B. Generating natural-language descriptions from images
C. Encrypting image metadata
D. Rendering video animations

Answer

B. Generating natural-language descriptions from images

Explanation

Image captioning converts visual information into descriptive text.

Question 2

What is the primary advantage of concise captions?

A. Increased GPU usage
B. Faster readability and lower token usage
C. Higher rendering latency
D. Improved encryption

Answer

B. Faster readability and lower token usage

Explanation

Concise captions are shorter and easier to process quickly.

Question 3

What is dense captioning?

A. Compressing images at higher density
B. Describing multiple regions or objects within an image
C. Encrypting image outputs
D. Converting images into spreadsheets

Answer

B. Describing multiple regions or objects within an image

Explanation

Dense captioning generates descriptions for several objects or regions in one image.

Question 4

What is a common accessibility use case for image captioning?

A. GPU optimization
B. Alt text generation for screen readers
C. Database indexing
D. Network compression

Answer

B. Alt text generation for screen readers

Explanation

Captions improve accessibility for visually impaired users.

Question 5

What is a hallucination in image captioning?

A. A rendering optimization technique
B. Describing objects or actions not actually present
C. Compressing captions automatically
D. Encrypting generated text

Answer

B. Describing objects or actions not actually present

Explanation

Hallucinations occur when models generate inaccurate descriptions.

Question 6

Which Azure service supports image captioning and OCR?

A. Azure AI Vision
B. Azure DNS
C. Azure CDN
D. Azure Firewall

Answer

A. Azure AI Vision

Explanation

Azure AI Vision supports caption generation, OCR, and image analysis.

Question 7

Why might an application use detailed captions instead of concise captions?

A. To reduce context and detail
B. To provide richer scene understanding and accessibility support
C. To eliminate GPU usage
D. To compress image metadata

Answer

B. To provide richer scene understanding and accessibility support

Explanation

Detailed captions provide more contextual information.

Question 8

What is the purpose of multimodal prompting in captioning workflows?

A. Encrypting image data
B. Combining images and text instructions to guide caption generation
C. Compressing captions automatically
D. Eliminating storage requirements

Answer

B. Combining images and text instructions to guide caption generation

Explanation

Multimodal prompts help control caption style and content.

Question 9

Which Azure service commonly stores generated captions and image assets?

A. Azure Blob Storage
B. Azure Virtual WAN
C. Azure DNS
D. Azure Firewall

Answer

A. Azure Blob Storage

Explanation

Azure Blob Storage is commonly used for storing images and generated metadata.

Question 10

What is a major Responsible AI concern in captioning systems?

A. Bias and inaccurate descriptions
B. Reduced SQL query speed
C. Lower network throughput
D. GPU cooling issues

Answer

A. Bias and inaccurate descriptions

Explanation

Captioning systems may produce biased or incorrect descriptions that affect users.

Go to the AI-103 Exam Prep Hub main page

AI, AI-103, Computer Vision, Microsoft Certification May 25, 2026

Build a solution that analyzes visual context by using multimodal models (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement computer vision solutions (10–15%)
   --> Design and implement multimodal understanding workflows
      --> Build a solution that analyzes visual context by using multimodal models

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI systems increasingly rely on multimodal models that can understand and reason across multiple data types simultaneously, including:

Images
Text
Video
Audio
Documents

For the AI-103 certification exam, you should understand how to build solutions that analyze visual context using multimodal models within Azure AI services.

This includes:

Image understanding
Visual reasoning
Caption generation
Scene interpretation
Visual question answering
Document understanding
Cross-modal reasoning
Multi-input workflows

You should also understand:

Prompt engineering for multimodal systems
Workflow orchestration
Retrieval-augmented generation (RAG)
Responsible AI considerations
Safety controls
Azure services used for multimodal AI

This topic falls under:

“Design and implement multimodal understanding workflows”

What Is a Multimodal Model?

Definition

A multimodal model is an AI model capable of processing and understanding multiple forms of input simultaneously.

Examples include:

Text + image
Video + audio
Image + prompt
Document + visual layout

Unlike traditional single-mode models, multimodal systems can reason across different information types.

What Is Visual Context?

Visual context refers to the meaning and relationships contained within visual data.

This includes:

Objects
Actions
Environments
Spatial relationships
Emotions
Text within images
Scene composition

Example of Visual Context Analysis

An image may contain:

A child holding an umbrella
Rain falling
Vehicles on a street

A multimodal model may infer:

The weather is rainy
The child is outdoors
Traffic conditions may be wet

This goes beyond simple object detection.

Why Multimodal AI Matters

Multimodal systems enable:

Richer AI understanding
Natural human interaction
Improved reasoning
Context-aware responses
Better automation

Common Use Cases

Visual Question Answering (VQA)

Users ask questions about images.

Example:

What is the person holding?

Image Captioning

Automatically generate descriptions for images.

Example:

A dog running through a grassy field

Document Understanding

Analyze:

Forms
Invoices
Receipts
PDFs
Charts

Video Understanding

Interpret:

Scenes
Actions
Motion
Events

Retail and E-Commerce

Analyze:

Products
Shelf layouts
Shopping behavior

Healthcare

Interpret:

Medical imagery
Visual documentation
Diagnostic content

Security and Monitoring

Detect:

Unsafe situations
Intrusions
Operational anomalies

Core Components of Multimodal Workflows

A multimodal workflow commonly includes:

Input acquisition
Data preprocessing
Visual analysis
Prompt engineering
AI reasoning
Response generation
Safety validation
Storage and orchestration

Types of Visual Analysis Tasks

Image Classification

Identifies the primary category of an image.

Example:

Cat
Car
Building

Object Detection

Identifies:

Objects
Locations
Bounding boxes

Scene Understanding

Interprets:

Environments
Activities
Relationships

Optical Character Recognition (OCR)

Extracts text from images or documents.

Examples:

Signs
Receipts
Forms

Visual Reasoning

Combines visual understanding with logical interpretation.

Example:

Is the person likely preparing food?

The model analyzes:

Kitchen items
Actions
Contextual clues

Multimodal Prompt Engineering

What Is Multimodal Prompting?

Multimodal prompting combines:

Visual input
Text instructions

to guide model behavior.

Example Multimodal Prompt

Input:

Product image

Prompt:

Describe the product and identify any visible defects

Effective Prompting Techniques

Be Specific

Good:

Describe all visible safety hazards in the image

Weak:

Describe the image

Request Structured Output

Example:

List detected objects as JSON

Use Contextual Instructions

Example:

Analyze this retail shelf image for out-of-stock products

Visual Grounding

What Is Visual Grounding?

Visual grounding links generated text to specific visual regions.

Example:

Identifying where an object appears in an image

This improves:

Explainability
Accuracy
Traceability

Image Captioning

What Is Image Captioning?

Image captioning generates natural-language descriptions of images.

Example:

A cyclist riding on a mountain trail during sunset

Dense Captioning

Dense captioning describes:

Multiple objects
Regions
Activities

within a single image.

Visual Question Answering (VQA)

What Is VQA?

VQA systems answer questions about visual content.

Example:
Image:

Parking lot

Question:

How many cars are visible?

Chart and Graph Understanding

Multimodal systems can analyze:

Charts
Dashboards
Diagrams
Infographics

Tasks include:

Trend identification
Data extraction
Summarization

Document Intelligence

Multimodal AI can process documents containing:

Text
Tables
Images
Layout structures

Common Document Tasks

Invoice Processing

Extract:

Vendor names
Totals
Dates

Form Extraction

Capture:

Structured fields
Checkboxes
Handwritten text

Contract Analysis

Identify:

Clauses
Dates
Key obligations

Video Understanding

Multimodal models can analyze:

Frame sequences
Motion
Temporal context
Events

Video Analysis Tasks

Scene Detection

Identify scene changes.

Action Recognition

Detect:

Running
Cooking
Driving
Fighting

Event Summarization

Generate video summaries.

Audio + Visual Understanding

Some multimodal workflows combine:

Speech
Visual scenes
Captions
Environmental audio

This enables:

Meeting analysis
Video transcription
Multimedia search

Retrieval-Augmented Generation (RAG)

What Is Multimodal RAG?

Multimodal RAG combines:

Visual retrieval
Text retrieval
AI reasoning

to improve responses.

Example Workflow

User uploads image
System retrieves related product information
Multimodal model analyzes image
AI generates grounded response

Workflow Orchestration

Enterprise multimodal systems often include:

Image preprocessing
OCR pipelines
AI reasoning
Safety checks
Human review
Storage workflows

Example Workflow

User uploads image
OCR extracts visible text
Object detection identifies items
Multimodal model analyzes context
AI generates explanation
Safety validation occurs
Results stored

Responsible AI Considerations

Multimodal systems introduce important Responsible AI concerns.

Bias and Fairness

Models may exhibit:

Cultural bias
Demographic bias
Representation imbalance

Privacy Concerns

Images may contain:

Faces
Personal data
Sensitive documents

Organizations must protect user privacy.

Harmful Content

Visual inputs may contain:

Violence
Hate symbols
Explicit content

Azure AI Content Safety

Microsoft provides:
Azure AI Content Safety

to help detect:

Unsafe imagery
Harmful prompts
Policy violations

Human-in-the-Loop Review

Organizations often require manual review for:

Medical workflows
Legal documents
Public-facing systems
High-risk decisions

Explainability

Multimodal systems should support:

Transparent reasoning
Traceable outputs
Confidence scoring

Performance Considerations

Multimodal workflows may require substantial compute resources.

Factors affecting performance include:

Image resolution
Video length
Model size
Context window size
Retrieval complexity

GPU Acceleration

Multimodal AI commonly relies on GPUs because of:

Parallel processing
Matrix computations
Large-scale inference

Latency Optimization

Optimization techniques include:

Image resizing
Batch processing
Caching
Parallel inference
Streaming analysis

Azure Services for Multimodal Workflows

Azure OpenAI Service

Supports:

Multimodal reasoning
Image understanding
Prompt-based visual analysis
Multi-input AI workflows

Azure AI Foundry

Supports:

Workflow orchestration
Prompt flows
Evaluation pipelines
AI experimentation

Azure AI Vision

Supports:

OCR
Object detection
Image analysis
Scene understanding

Azure AI Document Intelligence

Supports:

Form extraction
Invoice analysis
Layout understanding
Document workflows

Azure Blob Storage

Frequently used for:

Image storage
Video storage
Document storage
Workflow integration

Azure Functions

Often used for:

Trigger-based orchestration
Workflow automation
Event-driven processing

Observability and Monitoring

Production systems should monitor:

Latency
GPU utilization
Failed requests
Safety violations
OCR accuracy
Retrieval performance
Operational cost

Best Practices for Multimodal Workflows

Use Clear Prompts

Specific instructions improve results.

Combine Multiple AI Techniques

Use OCR, object detection, and reasoning together.

Validate Outputs

Apply safety and quality checks.

Optimize Media Size

Large files increase latency and cost.

Use Human Review for Sensitive Workflows

Especially important for regulated industries.

Maintain Audit Logs

Track prompts, outputs, and approvals.

Protect User Privacy

Secure uploaded media and extracted data.

Real-World Example

A retail company may implement a multimodal workflow that:

Uploads shelf images
Uses OCR to read pricing labels
Detects product placement
Uses a multimodal model to identify out-of-stock products
Generates a natural-language summary
Stores results in Blob Storage

This demonstrates:

Visual reasoning
OCR integration
Scene understanding
Workflow orchestration

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Multimodal models process multiple input types simultaneously.
Visual context includes objects, scenes, relationships, and activities.
OCR extracts text from visual content.
Visual Question Answering (VQA) answers questions about images.
Image captioning generates natural-language descriptions.
Multimodal RAG combines retrieval with visual reasoning.
Visual grounding links outputs to image regions.
Azure AI Vision supports object detection and OCR.
Azure AI Document Intelligence supports document workflows.
Azure AI Content Safety helps moderate unsafe content.
Human review may be necessary for sensitive workflows.

Practice Exam Questions

Question 1

What is a multimodal model?

A. A model that only processes text
B. A model that processes multiple data types simultaneously
C. A database indexing engine
D. A GPU scheduling system

Answer

B. A model that processes multiple data types simultaneously

Explanation

Multimodal models can analyze inputs such as images, text, audio, and video together.

Question 2

What does visual context primarily refer to?

A. Network latency statistics
B. Meaning and relationships within visual data
C. File compression metadata
D. Database schemas

Answer

B. Meaning and relationships within visual data

Explanation

Visual context includes objects, environments, actions, and relationships within images or videos.

Question 3

What is the primary purpose of OCR?

A. Compressing images
B. Extracting text from visual content
C. Generating videos automatically
D. Encrypting documents

Answer

B. Extracting text from visual content

Explanation

OCR converts visible text in images or documents into machine-readable text.

Question 4

What is Visual Question Answering (VQA)?

A. A system that creates SQL queries
B. A system that answers questions about visual content
C. A GPU rendering engine
D. A storage optimization method

Answer

B. A system that answers questions about visual content

Explanation

VQA systems combine image understanding with natural-language reasoning.

Question 5

What is visual grounding?

A. Encrypting image files
B. Linking generated outputs to visual regions
C. Reducing GPU utilization
D. Compressing video streams

Answer

B. Linking generated outputs to visual regions

Explanation

Visual grounding connects textual outputs to specific image areas.

Question 6

Which Azure service supports OCR and object detection?

A. Azure AI Vision
B. Azure Firewall
C. Azure DNS
D. Azure Monitor

Answer

A. Azure AI Vision

Explanation

Azure AI Vision supports OCR, image analysis, and object detection.

Question 7

What is a key benefit of multimodal RAG?

A. Eliminating GPU usage
B. Combining retrieval with multimodal reasoning
C. Compressing images automatically
D. Removing prompts from workflows

Answer

B. Combining retrieval with multimodal reasoning

Explanation

Multimodal RAG enhances responses by combining retrieval systems with AI reasoning.

Question 8

Why are GPUs commonly used in multimodal AI systems?

A. GPUs eliminate storage requirements
B. GPUs accelerate parallel inference operations
C. GPUs automatically moderate unsafe content
D. GPUs reduce internet bandwidth usage

Answer

B. GPUs accelerate parallel inference operations

Explanation

Multimodal AI requires large-scale matrix computations well suited for GPUs.

Question 9

Which Azure service helps analyze invoices and forms?

A. Azure AI Document Intelligence
B. Azure CDN
C. Azure DNS
D. Azure Virtual WAN

Answer

A. Azure AI Document Intelligence

Explanation

Azure AI Document Intelligence extracts structured information from documents.

Question 10

What is a key Responsible AI concern for multimodal systems?

A. Deepfake and privacy risks
B. Reduced SQL performance
C. Lower network throughput
D. GPU fan noise

Answer

A. Deepfake and privacy risks

Explanation

Multimodal systems may process sensitive images and generate misleading synthetic content.

Go to the AI-103 Exam Prep Hub main page

AI, AI-103, Microsoft Certification May 25, 2026May 25, 2026

Choose the appropriate Foundry Services for generative tasks, Grounding, Vector Search, Agent Workflows, or Multimodal Processing (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Plan and manage an Azure AI solution (25–30%)
   --> Choose the appropriate Foundry services for generative AI and agents
      --> Choose the Appropriate Foundry Services for generative tasks, Grounding, Vector Search, Agent Workflows, or Multimodal Processing

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the core responsibilities of an Azure AI developer is selecting the correct Azure AI Foundry services and supporting Azure technologies for specific AI workloads.

The AI-103 certification exam places significant emphasis on understanding how Azure AI Foundry services support:

Generative AI tasks
Grounding and Retrieval-Augmented Generation (RAG)
Vector search
AI agent workflows
Multimodal processing

Modern AI solutions are composed of multiple services working together rather than a single AI model.

For example:

A chatbot may require an LLM, vector search, embeddings, grounding, and agent orchestration.
A document assistant may require multimodal processing, OCR, embeddings, and RAG.
An AI agent may require tool calling, memory, orchestration, and workflow management.

Understanding which Foundry services to use in each scenario is critical both for the AI-103 exam and for real-world Azure AI development.

What Is Azure AI Foundry?

Azure AI Foundry is Microsoft’s unified AI development platform for:

Building AI applications
Developing AI agents
Managing models
Orchestrating workflows
Evaluating AI systems
Implementing responsible AI controls

Azure AI Foundry provides:

Model access
Prompt engineering tools
Agent frameworks
Retrieval and grounding tools
Evaluation systems
Safety controls
Deployment and monitoring capabilities

It integrates with many Azure AI services including:

Azure OpenAI
Azure AI Search
Azure AI Vision
Azure AI Language
Azure AI Document Intelligence
Azure AI Content Safety

Understanding the Core Service Categories

For the AI-103 exam, you should understand how Foundry services align to these major AI solution categories:

Generative AI services
Grounding and RAG services
Vector search services
Agent workflow services
Multimodal processing services
Evaluation and safety services

Generative AI Services

What Are Generative AI Services?

Generative AI services enable applications to:

Generate text
Summarize content
Create conversations
Produce code
Generate structured outputs
Perform reasoning tasks
Support AI copilots and assistants

The primary Foundry-related service for generative AI is:

Azure OpenAI Service

Azure OpenAI Service

Azure OpenAI provides access to advanced foundation models such as:

GPT models
GPT-4-class reasoning models
Multimodal GPT models
Embedding models
Audio-capable models

Azure OpenAI is commonly used for:

Chatbots
AI copilots
Content generation
AI agents
Coding assistants
Summarization
Question answering

When to Use Azure OpenAI

Use Azure OpenAI when the solution requires:

Natural language generation
Conversational AI
Complex reasoning
Function/tool calling
AI agents
Summarization
Code generation
Long-context processing

Example Generative AI Scenario

Scenario

A company wants to create an AI assistant that:

Answers employee questions
Summarizes internal documents
Generates emails
Uses enterprise data

Recommended Services:

Azure OpenAI
Azure AI Search
Embedding models
RAG architecture

Reason:

Azure OpenAI provides the conversational and reasoning capabilities.

Grounding and Retrieval-Augmented Generation (RAG)

What Is Grounding?

Grounding refers to providing AI models with reliable external data sources so responses are based on factual and current information.

Without grounding, LLMs may:

Hallucinate
Provide outdated information
Generate inaccurate answers

Grounding improves:

Accuracy
Relevance
Reliability
Enterprise trustworthiness

What Is Retrieval-Augmented Generation (RAG)?

RAG combines:

Retrieval systems
Embedding models
Vector search
Generative AI models

The workflow typically includes:

Convert documents into embeddings
Store vectors in a vector index
Convert user query into embeddings
Retrieve relevant content
Inject retrieved content into the LLM prompt
Generate grounded response

Azure Services Used for RAG

Common Azure services used for grounding and RAG include:

Azure AI Search
Azure OpenAI
Embedding models
Azure Storage
Azure Cosmos DB (optional)
Azure SQL Database with vector support

Azure AI Search

Azure AI Search is a core service for:

Vector search
Hybrid search
Semantic search
Enterprise retrieval
RAG pipelines

It enables applications to:

Index documents
Perform semantic retrieval
Store vector embeddings
Execute hybrid search queries

Types of Search in Azure AI Search

Keyword Search

Traditional lexical matching.

Example:

Exact term searches

Semantic Search

Understands contextual meaning.

Example:

Searching for “car” may also retrieve “vehicle.”

Vector Search

Uses embeddings to retrieve semantically similar content.

Example:

Finding conceptually similar documents even without exact keywords.

Hybrid Search

Combines:

Keyword search
Semantic ranking
Vector search

Hybrid search often produces the best retrieval quality.

When to Use Azure AI Search

Use Azure AI Search when applications require:

RAG
Semantic retrieval
Vector similarity search
Enterprise document retrieval
Knowledge-base search
Hybrid search scenarios

Example Grounding Scenario

Scenario

A healthcare chatbot must answer questions using the latest internal policy documents.

Recommended Services:

Azure OpenAI
Azure AI Search
Embedding models

Reason:

RAG enables grounded responses using current enterprise documents.

Vector Search Services

What Is Vector Search?

Vector search retrieves information based on semantic similarity rather than exact text matching.

Documents and queries are converted into numerical vectors called embeddings.

Similar meanings produce similar vectors.

Embedding Models

Embedding models transform content into vector representations.

These embeddings support:

Similarity matching
Semantic retrieval
Recommendation systems
RAG pipelines

Azure Services Supporting Vector Search

Azure AI Search

Primary enterprise vector search platform.

Azure Cosmos DB

Can support vector indexing and similarity search.

Useful for:

Globally distributed systems
High-scale AI applications

Azure SQL Database

Supports vector operations in modern AI workloads.

Useful for:

Structured enterprise systems
Integrated relational and AI workloads

Choosing the Correct Vector Search Service

Use Azure AI Search When:

Building enterprise RAG systems
Implementing hybrid search
Using semantic ranking
Creating AI copilots

Use Azure Cosmos DB When:

Global distribution is required
Massive scale is needed
NoSQL flexibility is important

Use Azure SQL Database When:

AI functionality must integrate with relational data
Existing SQL systems already exist

Agent Workflow Services

What Are AI Agents?

AI agents are AI systems capable of:

Reasoning
Planning
Tool usage
Multi-step execution
Task automation
Dynamic decision-making

Unlike basic chatbots, agents can:

Take actions
Call APIs
Use memory
Execute workflows
Interact with systems

Azure AI Foundry Agent Capabilities

Azure AI Foundry supports agent development with:

Tool calling
Function calling
Prompt orchestration
Workflow execution
Agent memory
Retrieval integration

Prompt Flow

Prompt Flow is a key Foundry tool for building:

AI workflows
Prompt chains
Tool orchestration
Agent pipelines
Multi-step AI systems

Prompt Flow helps developers:

Test prompts
Connect services
Evaluate outputs
Build reusable workflows

Tool Calling and Function Calling

LLMs can interact with external systems using:

Tool calling
Function calling

Examples:

Query databases
Call REST APIs
Retrieve documents
Send emails
Trigger workflows

This is a critical AI-103 topic.

Agent Workflow Scenario

Scenario

An AI travel assistant must:

Search flights
Check hotel pricing
Access calendars
Generate itineraries

Recommended Services:

Azure OpenAI
Prompt Flow
Agent orchestration tools
Tool/function calling

Reason:

This solution requires multi-step agent workflows.

Multimodal Processing Services

What Is Multimodal Processing?

Multimodal AI systems process multiple types of input such as:

Text
Images
Audio
Video
Documents

These systems combine multiple modalities to improve understanding.

Azure Services for Multimodal Processing

Common services include:

Azure OpenAI multimodal models
Azure AI Vision
Azure AI Document Intelligence
Azure AI Speech

Azure AI Vision

Azure AI Vision supports:

Image analysis
Object detection
OCR
Face analysis
Caption generation
Scene understanding

Use Azure AI Vision when applications require:

Image processing
Computer vision
OCR tasks
Visual analysis

Azure AI Document Intelligence

Azure AI Document Intelligence extracts structured information from documents such as:

Invoices
Receipts
Contracts
Forms
IDs

Capabilities include:

OCR
Key-value extraction
Layout analysis
Table extraction
Custom models

Azure AI Speech

Azure AI Speech supports:

Speech-to-text
Text-to-speech
Translation
Voice assistants
Real-time transcription

Choosing the Correct Multimodal Service

Use Azure AI Vision When:

Analyzing images
Detecting objects
Extracting text from images

Use Azure AI Document Intelligence When:

Extracting structured document data
Processing forms and invoices
Understanding layouts and tables

Use Azure AI Speech When:

Processing voice input
Building voice assistants
Performing speech transcription

Use Azure OpenAI Multimodal Models When:

Combining conversational reasoning with image understanding
Performing multimodal interactions
Building advanced AI assistants

Safety and Responsible AI Services

AI solutions require safety and governance.

Azure AI Foundry includes services such as:

Azure AI Content Safety
Content filtering
Prompt injection detection
Harm detection

These services help:

Detect unsafe content
Prevent abuse
Improve compliance
Support responsible AI development

Evaluation and Monitoring Services

Azure AI Foundry provides evaluation tools for:

Groundedness
Relevance
Accuracy
Latency
Cost
Toxicity
Hallucination detection

Evaluation is important because AI quality can vary significantly.

Choosing the Correct Foundry Service

The AI-103 exam frequently tests scenario-based service selection.

Scenario 1: Enterprise Knowledge Chatbot

Requirements:

Conversational AI
Enterprise document grounding
Semantic retrieval

Recommended Services:

Azure OpenAI
Azure AI Search
Embedding models

Scenario 2: Invoice Processing System

Requirements:

OCR
Table extraction
Structured document understanding

Recommended Services:

Azure AI Document Intelligence

Scenario 3: AI Agent with Workflow Automation

Requirements:

Tool usage
API calls
Multi-step execution

Recommended Services:

Azure OpenAI
Prompt Flow
Agent orchestration tools

Scenario 4: Image Analysis Application

Requirements:

Object detection
Image captioning
OCR

Recommended Services:

Azure AI Vision

Scenario 5: Semantic Product Search

Requirements:

Similarity search
Semantic retrieval
Vector indexing

Recommended Services:

Azure AI Search
Embedding models

Common AI-103 Exam Tips

Understand Service Roles

Know which services specialize in:

Generative AI
Retrieval
Search
Vision
Speech
Documents
Agent workflows

Know Common Service Pairings

Azure OpenAI + Azure AI Search

Used for:

RAG systems
Enterprise chatbots
Knowledge assistants

Azure OpenAI + Prompt Flow

Used for:

AI agents
Multi-step workflows
Tool orchestration

Azure AI Vision + Azure OpenAI

Used for:

Multimodal assistants
Visual question answering

Remember Hybrid Search

Hybrid search combines:

Vector search
Keyword search
Semantic ranking

This is commonly tested on AI-103.

Know When Specialized Services Are Better

Example:

Azure AI Document Intelligence is better for invoice extraction than using only a general-purpose LLM.

Summary

Selecting the appropriate Azure AI Foundry services is essential for building scalable, accurate, and cost-effective AI applications.

For the AI-103 exam, you should understand:

Which services support generative AI
How grounding and RAG work
When to use vector search
How AI agents are orchestrated
Which services support multimodal processing
How Azure AI Search integrates into enterprise AI systems
How Prompt Flow supports AI workflows
The role of specialized services like Vision and Document Intelligence

Strong service-selection skills are critical for both certification success and real-world Azure AI solution development.

Practice Exam Questions

Question 1

Which Azure service is MOST commonly used to provide generative AI chat capabilities?

A. Azure AI Search
B. Azure OpenAI
C. Azure AI Vision
D. Azure Monitor

Answer

B. Azure OpenAI

Explanation

Azure OpenAI provides access to GPT-based generative AI models.

Question 2

What is the primary purpose of Retrieval-Augmented Generation (RAG)?

A. Reduce GPU usage
B. Improve groundedness using retrieved data
C. Replace embeddings
D. Eliminate vector search

Answer

B. Improve groundedness using retrieved data

Explanation

RAG retrieves relevant information to ground LLM responses.

Question 3

Which Azure service is MOST appropriate for vector search and semantic retrieval?

A. Azure AI Search
B. Azure Backup
C. Azure DNS
D. Azure Automation

Answer

A. Azure AI Search

Explanation

Azure AI Search provides vector indexing and semantic retrieval capabilities.

Question 4

Which Foundry tool is designed for building multi-step AI workflows and prompt orchestration?

A. Azure Policy
B. Prompt Flow
C. Azure Backup
D. Azure DevOps

Answer

B. Prompt Flow

Explanation

Prompt Flow supports orchestration of prompts, tools, and workflows.

Question 5

A solution must extract tables and key-value pairs from invoices. Which service is MOST appropriate?

A. Azure AI Vision
B. Azure AI Document Intelligence
C. Azure Monitor
D. Azure AI Search

Answer

B. Azure AI Document Intelligence

Explanation

Document Intelligence specializes in structured document extraction.

Question 6

Which capability allows an LLM to interact with APIs and external systems?

A. OCR
B. Function calling
C. Vectorization
D. Semantic ranking

Answer

B. Function calling

Explanation

Function calling enables AI models to invoke external tools and APIs.

Question 7

Which Azure service is MOST appropriate for image analysis and object detection?

A. Azure AI Vision
B. Azure AI Search
C. Azure Cosmos DB
D. Azure SQL Database

Answer

A. Azure AI Vision

Explanation

Azure AI Vision provides computer vision capabilities.

Question 8

What is the main purpose of embeddings in AI applications?

A. Image generation
B. Semantic vector representation
C. Text-to-speech conversion
D. Function orchestration

Answer

B. Semantic vector representation

Explanation

Embeddings convert content into vectors for semantic similarity operations.

Question 9

Which search method combines vector search, keyword search, and semantic ranking?

A. Lexical search
B. OCR search
C. Hybrid search
D. Binary search

Answer

C. Hybrid search

Explanation

Hybrid search combines multiple retrieval methods for improved results.

Question 10

Which Azure AI service is MOST appropriate for speech-to-text transcription?

A. Azure AI Speech
B. Azure AI Search
C. Azure AI Vision
D. Azure Policy

Answer

A. Azure AI Speech

Explanation

Azure AI Speech provides speech recognition and transcription capabilities.

Go to the AI-103 Exam Prep Hub main page

AI, AI-901, Artificial Intelligence (AI), Computer Vision, Microsoft Certification May 18, 2026

Interpret visual input in prompts by using a deployed multimodal model (AI-901 Exam Prep)

This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Implement AI solutions by using Microsoft Foundry (55–60%)
   --> Implement AI solutions with computer vision and image-generation capabilities by using Foundry
      --> Interpret visual input in prompts by using a deployed multimodal model

Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

Modern AI systems are increasingly capable of understanding not only text and speech, but also visual information such as images and videos. Multimodal AI models combine multiple forms of input to generate intelligent responses and insights.

For the AI-901 certification exam, candidates should understand the foundational concepts behind interpreting visual input in prompts by using deployed multimodal models through Microsoft Azure AI Foundry and related Azure AI services.

This topic falls under the “Implement AI solutions with computer vision and image-generation capabilities by using Foundry” section of the AI-901 exam objectives.

What Is a Multimodal Model?

A multimodal model is an AI model capable of processing multiple types of input and output.

These modalities may include:

Text
Images
Speech/audio
Video

Multimodal models can combine information across different input types to generate responses.

What Is Visual Input?

Visual input refers to image or video data provided to an AI system.

Examples include:

Photographs
Screenshots
Documents
Charts
Diagrams
Videos

Example Visual Prompt

A user uploads a photo and asks:

“What objects are visible in this image?”

The AI analyzes the visual content and generates a response.

Computer Vision

Computer vision is the field of AI focused on enabling systems to interpret and understand visual information.

Computer vision tasks include:

Image classification
Object detection
Facial analysis
Optical character recognition (OCR)
Image captioning

Azure AI Vision

Azure AI Vision provides computer vision capabilities in Azure.

Features include:

Image analysis
OCR
Object detection
Image captioning
Face-related analysis

Azure AI Foundry

Azure AI Foundry provides tools for building and managing multimodal AI applications.

Developers can:

Deploy AI models
Test prompts
Analyze images
Build AI-powered apps

Deployed Models

A deployed model is an AI model made available for real-time use through a cloud endpoint.

Applications communicate with deployed models using APIs.

Visual Prompt Workflow

A common workflow includes:

User uploads image
Application sends image to multimodal model
Model analyzes visual content
Model generates response
Application displays results

Example Workflow

User Uploads Image

A photo of a dog playing in a park

User Prompt

“Describe this image.”

AI Response

“A brown dog is running through a grassy park.”

Image Classification

Image classification identifies the primary category of an image.

Example

Image

Picture of a cat

Classification

“Cat”

Object Detection

Object detection identifies and locates multiple objects within an image.

Example

Image

Street scene

Detected Objects

Car
Bicycle
Traffic light
Pedestrian

Optical Character Recognition (OCR)

OCR extracts text from images or scanned documents.

Example

Image

Photo of a receipt

Extracted Text

Store name
Total amount
Date

Image Captioning

Image captioning generates natural-language descriptions of images.

Example

Image

A child flying a kite

Caption

“A child flying a colorful kite in a field.”

Visual Question Answering

Some multimodal models can answer questions about images.

Example

Prompt

“How many people are in the image?”

The model analyzes the image and generates an answer.

Combining Text and Images

Multimodal systems often combine:

Text prompts
Visual input

This improves contextual understanding.

Example

Image

A restaurant menu

Prompt

“Which item appears to be vegetarian?”

The AI analyzes both the image and the prompt together.

APIs and Endpoints

Applications communicate with deployed multimodal models through:

APIs
Endpoints

These allow images and prompts to be submitted programmatically.

Authentication

Applications must securely authenticate before accessing Azure AI services.

Common methods include:

API keys
Azure credentials
Managed identities

User Interface Components

A lightweight visual AI application may include:

Image upload area
Prompt input box
Results display
Image preview

Real-Time Processing

Many multimodal applications support near real-time image analysis.

This enables interactive user experiences.

Common Real-World Scenarios

Scenario 1: Accessibility Assistant

Goal

Describe visual content for visually impaired users.

Features

Image captioning
OCR
Voice output

Scenario 2: Retail Product Recognition

Goal

Identify products from images.

Features

Object detection
Classification
Product lookup

Scenario 3: Document Processing

Goal

Extract information from scanned forms.

Features

OCR
Text extraction
Data analysis

Scenario 4: Content Moderation

Goal

Identify harmful or unsafe visual content.

Features

Image analysis
Safety filtering
Automated moderation

Responsible AI Considerations

Visual AI applications should follow Responsible AI principles.

Key considerations include:

Privacy
Fairness
Transparency
Inclusiveness
Accountability
Security

Privacy Concerns

Images may contain:

Personal information
Faces
Sensitive documents

Organizations should protect user data appropriately.

Bias and Fairness

Computer vision systems may perform unevenly across:

Skin tones
Age groups
Lighting conditions
Demographics

Organizations should evaluate models carefully for fairness.

Transparency

Users should understand:

AI is analyzing images
AI-generated descriptions may contain errors
Images may be stored or processed in the cloud

Hallucinations

Multimodal AI systems may generate inaccurate visual descriptions.

These incorrect outputs are called hallucinations.

Applications should not assume all AI-generated outputs are accurate.

Error Handling

Applications should handle:

Unsupported image formats
Low-quality images
Network failures
Authentication errors
Rate limits

Image Quality Challenges

Poor image quality can reduce accuracy.

Examples include:

Blurry images
Poor lighting
Occluded objects
Low resolution

Advantages of Visual AI Applications

Benefits include:

Automation
Faster analysis
Accessibility improvements
Improved user experiences
Scalable image processing

Limitations of Visual AI Applications

Challenges include:

Recognition inaccuracies
Bias
Privacy concerns
Hallucinations
Sensitivity to image quality

High-Level Workflow

A simplified workflow includes:

Upload image
Send image and prompt to model
Analyze visual content
Generate response
Display results

Example High-Level Pseudocode

			
image = upload_image()
prompt = get_prompt()
response = analyze_image(image, prompt)
display_response(response)

For AI-901, understanding the workflow is more important than memorizing exact syntax.

Important AI-901 Exam Tips

For the exam, remember these key points:

Multimodal models process multiple data types.
Visual input includes images and video.
Azure AI Vision supports computer vision workloads.
OCR extracts text from images.
Image captioning generates descriptions of images.
Object detection identifies multiple objects in images.
APIs and endpoints connect applications to AI services.
Authentication secures AI access.
Responsible AI principles apply to computer vision systems.
Hallucinations are inaccurate AI-generated outputs.

Quick Knowledge Check

Question 1

What is OCR used for?

Answer

Extracting text from images or scanned documents.

Question 2

What does image captioning do?

Answer

Generates natural-language descriptions of images.

Question 3

Why are multimodal models useful?

Answer

They can process multiple types of input such as text and images together.

Question 4

Why is fairness important in computer vision?

Answer

To reduce biased or uneven performance across different groups of people.

Practice Exam Questions

Question 1

What is a multimodal AI model?

A. A model that processes only text
B. A model capable of processing multiple types of input such as text and images
C. A model used only for networking
D. A model designed exclusively for spreadsheets

Correct Answer

B. A model capable of processing multiple types of input such as text and images

Explanation

Multimodal models can process and combine different forms of input, including text, images, audio, and video.

Why the Other Answers Are Incorrect

A. A model that processes only text

That describes a text-only model.

C. A model used only for networking

Networking is unrelated to multimodal AI.

D. A model designed exclusively for spreadsheets

This is unrelated to AI modalities.

Question 2

Which Azure service provides computer vision capabilities such as image analysis and OCR?

A. Azure AI Vision
B. Azure Backup
C. Azure Virtual Desktop
D. Azure Monitor

Correct Answer

A. Azure AI Vision

Explanation

Azure AI Vision provides computer vision features including OCR, object detection, and image captioning.

Why the Other Answers Are Incorrect

B. Azure Backup

This is a backup service.

C. Azure Virtual Desktop

This provides desktop virtualization.

D. Azure Monitor

This is used for monitoring and diagnostics.

Question 3

What does OCR stand for?

A. Optical Character Recognition
B. Operational Cloud Routing
C. Object Classification Registry
D. Open Compute Rendering

Correct Answer

A. Optical Character Recognition

Explanation

OCR extracts text from images or scanned documents.

Why the Other Answers Are Incorrect

B. Operational Cloud Routing

This is not an AI vision term.

C. Object Classification Registry

This is not the meaning of OCR.

D. Open Compute Rendering

This is unrelated to text extraction.

Question 4

What is the PRIMARY purpose of object detection?

A. To identify and locate objects within an image
B. To translate speech into text
C. To summarize long documents
D. To improve internet speed

Correct Answer

A. To identify and locate objects within an image

Explanation

Object detection identifies multiple objects and their positions within an image.

Why the Other Answers Are Incorrect

B. To translate speech into text

This is a speech recognition task.

C. To summarize long documents

This is a text analysis task.

D. To improve internet speed

Object detection does not affect networking.

Question 5

What does image captioning do?

A. Generates natural-language descriptions of images
B. Converts text into audio
C. Detects malware in files
D. Compresses images automatically

Correct Answer

A. Generates natural-language descriptions of images

Explanation

Image captioning uses AI to describe visual content in natural language.

Why the Other Answers Are Incorrect

B. Converts text into audio

This is speech synthesis.

C. Detects malware in files

This is unrelated to computer vision.

D. Compresses images automatically

Captioning does not perform compression.

Question 6

How do applications typically communicate with deployed multimodal models?

A. Through APIs and endpoints
B. Through USB-only connections
C. Through monitor drivers
D. Through spreadsheet templates

Correct Answer

A. Through APIs and endpoints

Explanation

Applications use APIs and endpoints to send prompts and images to AI services.

Why the Other Answers Are Incorrect

B. Through USB-only connections

Cloud AI services use network communication.

C. Through monitor drivers

These are unrelated to AI communication.

D. Through spreadsheet templates

This is unrelated to AI integration.

Question 7

Why is authentication important when accessing Azure AI services?

A. To secure access to AI resources
B. To increase image resolution
C. To improve keyboard performance
D. To reduce monitor brightness

Correct Answer

A. To secure access to AI resources

Explanation

Authentication ensures that only authorized users and applications can access Azure AI services.

Why the Other Answers Are Incorrect

B. To increase image resolution

Authentication does not affect image quality.

C. To improve keyboard performance

This is unrelated to AI services.

D. To reduce monitor brightness

Authentication does not control display settings.

Question 8

Which Responsible AI concern is especially important when analyzing images?

A. Protecting personal and sensitive visual information
B. Increasing video frame rates
C. Improving printer output quality
D. Accelerating spreadsheet calculations

Correct Answer

A. Protecting personal and sensitive visual information

Explanation

Images may contain faces, documents, or other sensitive information that must be protected.

Why the Other Answers Are Incorrect

B. Increasing video frame rates

This is unrelated to Responsible AI.

C. Improving printer output quality

Printers are unrelated to computer vision ethics.

D. Accelerating spreadsheet calculations

This is unrelated to image analysis.

Question 9

What are hallucinations in multimodal AI systems?

A. Incorrect or fabricated AI-generated outputs
B. Hardware installation failures
C. Internet connectivity issues
D. Audio recording problems

Correct Answer

A. Incorrect or fabricated AI-generated outputs

Explanation

Hallucinations occur when AI generates inaccurate or invented descriptions or answers.

Why the Other Answers Are Incorrect

B. Hardware installation failures

This is unrelated to AI-generated content.

C. Internet connectivity issues

This is a networking problem.

D. Audio recording problems

This relates to audio hardware or software.

Question 10

Which factor can negatively affect computer vision accuracy?

A. Poor image quality
B. Spreadsheet formatting
C. Screen brightness settings
D. Keyboard layout

Correct Answer

A. Poor image quality

Explanation

Blurry images, poor lighting, and low resolution can reduce computer vision accuracy.

Why the Other Answers Are Incorrect

B. Spreadsheet formatting

This does not affect image analysis.

C. Screen brightness settings

This does not directly affect AI image processing.

D. Keyboard layout

Keyboard settings are unrelated to computer vision.

Final Thoughts

Interpreting visual input using deployed multimodal models is an important topic for the AI-901 certification exam. Microsoft expects candidates to understand the foundational concepts behind computer vision and multimodal AI applications, including image analysis, OCR, object detection, image captioning, APIs, authentication, and Responsible AI principles.

Azure AI Vision and Azure AI Foundry provide powerful tools for building intelligent applications capable of understanding and responding to visual information in real-world scenarios.

Go to the AI-901 Exam Prep Hub main page

AI, AI-901, Artificial Intelligence (AI), Microsoft Certification May 18, 2026

Respond to spoken prompts by using a deployed multimodal model (AI-901 Exam Prep)

This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Implement AI solutions by using Microsoft Foundry (55–60%)
   --> Implement AI solutions for text and speech by using Foundry
      --> Respond to spoken prompts by using a deployed multimodal model

Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

Modern AI systems increasingly support multimodal interactions, allowing users to communicate using speech, text, images, and other forms of input. Multimodal AI models can process and combine multiple input types to generate intelligent responses.

For the AI-901 certification exam, candidates should understand the foundational concepts behind responding to spoken prompts by using deployed multimodal AI models within Microsoft Azure AI Foundry and related Azure AI services.

This topic falls under the “Implement AI solutions for text and speech by using Foundry” section of the AI-901 exam objectives.

What Is a Multimodal Model?

A multimodal model is an AI model capable of processing multiple forms of input and output.

Examples of modalities include:

Text
Speech/audio
Images
Video

A multimodal model can combine information from multiple sources to generate responses.

Examples of Multimodal AI Systems

Common examples include:

Voice assistants
AI copilots
Speech-enabled chatbots
Image-and-text AI assistants
Interactive educational tools

What Is a Spoken Prompt?

A spoken prompt is a voice-based user input provided through audio.

Instead of typing a question, the user speaks it aloud.

Example Spoken Prompt

“What is machine learning?”

The AI system converts the speech into text for processing.

Speech Recognition

Speech recognition converts spoken language into text.

This process is often called:

Speech-to-text (STT)
Automatic speech recognition (ASR)

Example Speech Recognition Workflow

Spoken Audio

“What time is the meeting tomorrow?”

Converted Text

“What time is the meeting tomorrow?”

The text is then processed by the AI model.

Speech Synthesis

Speech synthesis converts text into spoken audio.

This process is often called:

Text-to-speech (TTS)

Example

AI Response Text

“The meeting starts at 10 AM.”

Spoken Output

The AI system reads the response aloud.

Azure AI Speech

Azure AI Speech provides speech recognition and speech synthesis capabilities.

Features include:

Speech-to-text
Text-to-speech
Speech translation
Voice generation

Azure AI Foundry

Azure AI Foundry provides tools for building, deploying, and testing AI applications and multimodal solutions.

Basic Workflow for Spoken Prompt Applications

A typical workflow includes:

User speaks into microphone
Speech recognition converts audio to text
Text is sent to deployed multimodal model
AI model generates response
Optional speech synthesis converts response to audio
User hears spoken reply

Example End-to-End Scenario

User Speaks

“Summarize today’s sales report.”

Speech Recognition

Converts audio to text

AI Model

Generates summary

Speech Synthesis

Reads summary aloud

Deployed Models

A deployed model is an AI model made available through a cloud endpoint for real-time use.

Applications interact with deployed models using APIs.

APIs and Endpoints

Applications communicate with deployed models through:

APIs
Endpoints

The application sends requests and receives responses programmatically.

Authentication

Applications must securely authenticate before accessing AI services.

Common methods include:

API keys
Azure credentials
Managed identities

Lightweight Speech Applications

Lightweight speech-enabled applications typically include:

Microphone input
Speech processing
AI response generation
Audio playback

Conversation Context

Many speech-enabled applications maintain context between interactions.

This allows more natural conversations.

Example

User

“Who founded Microsoft?”

User Later

“When was it founded?”

The system remembers that “it” refers to Microsoft.

System Prompts

System prompts guide model behavior.

They help define:

Tone
Personality
Safety rules
Output style

Example System Prompt

“You are a professional customer support assistant.”

Model Parameters

Applications may configure settings such as:

Temperature
Maximum tokens
Top-p sampling

Temperature

Temperature controls response creativity.

Low Temperature	High Temperature
More predictable	More creative
More focused	More varied

Streaming Responses

Some applications stream speech or text responses incrementally.

Streaming improves responsiveness and user experience.

Real-Time Interaction

Speech-enabled AI systems often support real-time interaction.

This creates conversational experiences similar to human dialogue.

Common Real-World Use Cases

Scenario 1: Voice Assistant

Goal

Answer spoken user questions.

Features

Speech recognition
Conversational AI
Spoken responses

Scenario 2: Hands-Free AI Assistant

Goal

Allow users to interact without typing.

Features

Voice commands
Audio responses
Context retention

Scenario 3: Accessibility Support

Goal

Assist users with visual or mobility impairments.

Features

Voice interaction
Spoken guidance
Accessibility improvements

Responsible AI Considerations

Speech-enabled AI applications should follow Responsible AI principles.

Important considerations include:

Privacy
Security
Transparency
Fairness
Inclusiveness
Accountability

Privacy Concerns

Speech applications may process sensitive spoken information.

Organizations should:

Protect audio recordings
Secure conversations
Limit unnecessary data storage

Transparency

Users should understand:

AI is processing speech
Audio may be recorded or analyzed
AI-generated responses may contain inaccuracies

Inclusiveness

Speech systems should support:

Different accents
Languages
Speech patterns
Accessibility needs

Hallucinations

Generative AI models may produce inaccurate or fabricated responses.

These incorrect outputs are called hallucinations.

Applications should not assume all generated responses are correct.

Latency

Speech-enabled applications must minimize delays between:

Speech input
AI processing
Spoken responses

High latency negatively affects user experience.

Error Handling

Applications should handle:

Speech recognition errors
Background noise
Network failures
Authentication issues
Rate limits

Background Noise Challenges

Speech recognition may struggle with:

Loud environments
Multiple speakers
Poor microphone quality

Advantages of Spoken AI Interfaces

Benefits include:

Natural interaction
Hands-free operation
Accessibility improvements
Faster communication
Improved user experience

Limitations of Spoken AI Interfaces

Challenges include:

Speech recognition errors
Accent variability
Noise interference
Privacy concerns
Hallucinations
Latency

High-Level Application Workflow

A simplified workflow includes:

Capture speech
Convert speech to text
Send prompt to model
Receive response
Convert response to speech
Play audio response

Example High-Level Pseudocode

			
audio = capture_audio()
text = speech_to_text(audio)
response = generate_ai_response(text)
speak(response)

For AI-901, understanding the workflow is more important than memorizing exact syntax.

Important AI-901 Exam Tips

For the exam, remember these key points:

Multimodal models process multiple input types.
Spoken prompts use speech as input.
Speech recognition converts speech to text.
Speech synthesis converts text to speech.
Azure AI Speech supports speech workloads.
Azure AI Foundry supports AI application development.
APIs and endpoints connect applications to deployed models.
Authentication secures AI services.
Responsible AI principles apply to speech-enabled systems.
Hallucinations are inaccurate AI-generated outputs.

Quick Knowledge Check

Question 1

What does speech recognition do?

Answer

Converts spoken language into text.

Question 2

What does speech synthesis do?

Answer

Converts text into spoken audio.

Question 3

What is a multimodal model?

Answer

An AI model that processes multiple forms of input and output.

Question 4

Why is inclusiveness important in speech systems?

Answer

To support different accents, languages, and accessibility needs.

Practice Exam Questions

Question 1

What is a multimodal AI model?

A. A model that only processes text
B. A model capable of processing multiple forms of input and output
C. A model used only for spreadsheets
D. A model that stores physical hardware configurations

Correct Answer

B. A model capable of processing multiple forms of input and output

Explanation

Multimodal models can work with different data types such as text, speech, images, and video.

Why the Other Answers Are Incorrect

A. A model that only processes text

That describes a text-only model, not a multimodal model.

C. A model used only for spreadsheets

This is unrelated to AI modalities.

D. A model that stores physical hardware configurations

This is unrelated to AI processing.

Question 2

What is the PRIMARY purpose of speech recognition?

A. To convert speech into text
B. To convert images into audio
C. To increase internet speed
D. To generate video animations

Correct Answer

A. To convert speech into text

Explanation

Speech recognition, also called speech-to-text, converts spoken language into written text.

Why the Other Answers Are Incorrect

B. To convert images into audio

Speech recognition does not process images.

C. To increase internet speed

Speech recognition does not affect networking.

D. To generate video animations

This is unrelated to speech processing.

Question 3

What does speech synthesis perform?

A. Converts text into spoken audio
B. Compresses speech files
C. Detects objects in images
D. Removes network latency

Correct Answer

A. Converts text into spoken audio

Explanation

Speech synthesis, also called text-to-speech, generates spoken audio from text.

Why the Other Answers Are Incorrect

B. Compresses speech files

Compression is unrelated to synthesis.

C. Detects objects in images

This is a computer vision task.

D. Removes network latency

Speech synthesis does not control network performance.

Question 4

Which Azure service provides speech recognition and speech synthesis capabilities?

A. Azure AI Speech
B. Azure Backup
C. Azure Firewall
D. Azure Virtual Machines

Correct Answer

A. Azure AI Speech

Explanation

Azure AI Speech supports speech-to-text, text-to-speech, translation, and related speech capabilities.

Why the Other Answers Are Incorrect

B. Azure Backup

This is a storage protection service.

C. Azure Firewall

This is a security service.

D. Azure Virtual Machines

This provides compute infrastructure.

Question 5

What is the purpose of deploying an AI model?

A. To make the model available for applications through an endpoint
B. To physically install computer hardware
C. To permanently disable the model
D. To compress training data

Correct Answer

A. To make the model available for applications through an endpoint

Explanation

Deployment allows applications to access AI models for real-time use.

Why the Other Answers Are Incorrect

B. To physically install computer hardware

Deployment is typically cloud-based.

C. To permanently disable the model

Deployment enables usage rather than disabling it.

D. To compress training data

Deployment does not compress datasets.

Question 6

How do applications typically communicate with deployed AI models?

A. Through APIs and endpoints
B. Through USB-only connections
C. Through monitor settings
D. Through printer drivers

Correct Answer

A. Through APIs and endpoints

Explanation

Applications use APIs connected to endpoints to exchange requests and responses with AI models.

Why the Other Answers Are Incorrect

B. Through USB-only connections

Cloud AI systems use network communication.

C. Through monitor settings

These are unrelated to AI communication.

D. Through printer drivers

Printer drivers are unrelated to AI APIs.

Question 7

Why is conversation context important in speech-enabled AI systems?

A. It allows the AI to remember previous interactions
B. It improves monitor brightness
C. It increases microphone volume automatically
D. It reduces file storage size

Correct Answer

A. It allows the AI to remember previous interactions

Explanation

Maintaining context helps create more natural and coherent conversations.

Why the Other Answers Are Incorrect

B. It improves monitor brightness

Conversation context does not affect displays.

C. It increases microphone volume automatically

This is unrelated to conversation memory.

D. It reduces file storage size

Context retention does not compress files.

Question 8

Which Responsible AI concern is especially important for speech-enabled applications?

A. Protecting sensitive spoken information
B. Increasing screen resolution
C. Accelerating video rendering
D. Improving keyboard layouts

Correct Answer

A. Protecting sensitive spoken information

Explanation

Speech-enabled systems may process personal or confidential audio data, making privacy and security important.

Why the Other Answers Are Incorrect

B. Increasing screen resolution

This is unrelated to Responsible AI.

C. Accelerating video rendering

This is unrelated to speech AI.

D. Improving keyboard layouts

Speech systems are not focused on keyboards.

Question 9

What are hallucinations in generative AI systems?

A. Incorrect or fabricated AI-generated responses
B. Hardware overheating events
C. Audio recording failures
D. Slow network connections

Correct Answer

A. Incorrect or fabricated AI-generated responses

Explanation

Hallucinations occur when AI generates information that is inaccurate or invented.

Why the Other Answers Are Incorrect

B. Hardware overheating events

This is unrelated to AI output quality.

C. Audio recording failures

This is a hardware or software issue.

D. Slow network connections

This relates to connectivity, not AI accuracy.

Question 10

What is one advantage of spoken AI interfaces?

A. Hands-free and natural interaction
B. Elimination of all recognition errors
C. Guaranteed perfect accuracy
D. Removal of all privacy concerns

Correct Answer

A. Hands-free and natural interaction

Explanation

Voice-based interfaces provide convenient and natural interaction experiences.

Why the Other Answers Are Incorrect

B. Elimination of all recognition errors

Speech systems can still make mistakes.

C. Guaranteed perfect accuracy

No AI system is perfectly accurate.

D. Removal of all privacy concerns

Speech applications still require privacy protections.

Final Thoughts

Responding to spoken prompts using deployed multimodal models is an important topic for the AI-901 certification exam. Microsoft expects candidates to understand the foundational workflow behind speech-enabled AI applications, including speech recognition, multimodal processing, speech synthesis, APIs, authentication, and Responsible AI principles.

Azure AI Foundry and Azure AI Speech provide powerful tools for building intelligent conversational applications that support natural voice interactions and modern accessibility-focused experiences.

Go to the AI-901 Exam Prep Hub main page