Tag: Multi-modal models

AI, AI-103, Computer Vision, Generative AI, Microsoft Certification May 25, 2026May 25, 2026

Configure apps to produce concise or detailed captions for single or multiple images (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement computer vision solutions (10–15%)
   --> Design and implement multimodal understanding workflows
      --> Configure apps to produce concise or detailed captions for single or multiple images

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern multimodal AI systems can automatically generate captions that describe visual content in natural language. Captioning capabilities are widely used in:

Accessibility solutions
Content management systems
E-commerce platforms
Media analysis systems
Social media applications
Digital asset management
Search and retrieval systems

For the AI-103 certification exam, you should understand how to configure applications that generate:

Concise captions
Detailed captions
Single-image captions
Multi-image summaries
Context-aware visual descriptions

You should also understand:

Multimodal prompting
Caption customization
Batch image workflows
Accessibility considerations
Responsible AI concerns
Performance optimization
Azure services commonly used for captioning solutions

This topic falls under:

“Design and implement multimodal understanding workflows”

What Is Image Captioning?

Definition

Image captioning is the process of generating natural-language descriptions from visual input.

A captioning system analyzes:

Objects
People
Actions
Relationships
Backgrounds
Contextual information

and produces descriptive text.

Example Caption

Image:

Dog running on a beach

Generated caption:

A golden retriever running along a sandy beach near the ocean

Why Image Captioning Matters

Captioning improves:

Accessibility
Searchability
Automation
User experience
Content organization

Common Use Cases

Accessibility

Captions help visually impaired users understand image content through:

Screen readers
Audio narration
Alternative text (alt text)

E-Commerce

Captioning can automatically describe:

Products
Product conditions
Visual features

Media and Content Management

Organizations use captioning to:

Tag assets
Search images
Organize media libraries

Social Media

Applications generate:

Suggested captions
Content summaries
Automatic alt text

Security and Monitoring

Captioning systems can describe:

Surveillance scenes
Operational events
Safety hazards

Concise vs Detailed Captions

Concise Captions

Concise captions provide short summaries of image content.

Example:

A child riding a bicycle

Advantages of Concise Captions

Benefits include:

Faster reading
Simpler accessibility support
Reduced token usage
Lower latency

Detailed Captions

Detailed captions provide richer contextual descriptions.

Example:

			
A young child wearing a red helmet rides a blue bicycle along a tree-lined suburban street on a sunny afternoon

Advantages of Detailed Captions

Benefits include:

More context
Better search indexing
Improved scene understanding
Enhanced accessibility

Captioning Workflows

A typical captioning workflow includes:

Upload image
Preprocess image
Run visual analysis
Generate caption
Validate output
Store or display caption

Single-Image Captioning

What Is Single-Image Captioning?

Single-image captioning generates descriptions for one image at a time.

This is common in:

Accessibility apps
Social media uploads
Product pages

Example Workflow

User uploads image
Multimodal model analyzes image
App requests concise caption
Caption returned to application

Multi-Image Captioning

What Is Multi-Image Captioning?

Multi-image captioning generates:

Individual captions
Combined summaries
Comparative descriptions

for multiple related images.

Example Use Cases

Product Catalogs

Describe multiple product images together.

Photo Albums

Generate event summaries.

Medical Imaging

Summarize related scans or frames.

Example Multi-Image Summary

Images:

Beach photos from vacation

Generated summary:

A family vacation featuring beach activities, ocean sunsets, and outdoor dining

Dense Captioning

What Is Dense Captioning?

Dense captioning describes multiple objects or regions within a single image.

Example:

Person sitting on bench
Dog nearby
Bicycle leaning against tree

Visual Context in Captioning

Captioning systems analyze:

Objects
Actions
Emotions
Spatial relationships
Scene composition

This enables richer descriptions.

Caption Personalization

Applications may customize captions based on:

Audience
Reading level
Language
Accessibility requirements
Business domain

Example Accessibility Caption

			
A person using a wheelchair enters a modern office building using a wheelchair-accessible ramp

Multimodal Prompting for Captioning

What Is Multimodal Prompting?

Multimodal prompting combines:

Visual input
Text instructions

to guide caption generation.

Example Prompt

Image input:

Retail shelf

Prompt:

Generate a concise inventory-focused caption

Detailed Caption Prompt Example

Generate a highly detailed accessibility-focused description of this image

Prompt Engineering Best Practices

Be Specific

Specific prompts improve:

Accuracy
Relevance
Style consistency

Define Desired Length

Example:

Generate a one-sentence caption

or:

Generate a detailed paragraph describing all visible activities

Request Structured Outputs

Applications may request:

JSON responses
Categorized descriptions
Tagged outputs

Example:

Return caption and detected objects as JSON

Caption Quality Factors

Caption quality depends on:

Image quality
Resolution
Model capability
Prompt clarity
Scene complexity

Challenges in Captioning

Ambiguity

Images may contain unclear or partially visible objects.

Context Limitations

Models may incorrectly infer:

Emotions
Intentions
Activities

Cultural Interpretation

Visual meaning may vary across cultures.

Hallucinations in Captioning

What Are Hallucinations?

Hallucinations occur when models describe objects or actions not actually present.

Example:

Describing a dog that is not visible

Reducing Hallucinations

Strategies include:

Better prompts
Confidence scoring
Human review
Object detection grounding

Caption Evaluation Metrics

Organizations may evaluate captions using:

Accuracy
Relevance
Completeness
Fluency
Accessibility quality

Accessibility Considerations

Captioning systems are important for:

Screen readers
Alt text generation
Inclusive design

Good Accessibility Captions

Good captions should:

Be descriptive
Avoid vague wording
Focus on important details

Weak Caption Example

An image of a thing

Strong Caption Example

A firefighter carrying a child away from a burning building

Batch Captioning Workflows

Enterprise systems often process images in bulk.

Example Batch Workflow

Upload image batch
Queue processing jobs
Generate captions
Validate outputs
Store metadata
Enable search indexing

Workflow Orchestration

Captioning systems often integrate:

OCR
Object detection
Search indexing
Safety filtering
Human review

Example Enterprise Workflow

User uploads image collection
OCR extracts visible text
AI generates captions
Search metadata created
Unsafe content filtered
Results stored

Responsible AI Considerations

Captioning systems introduce important Responsible AI concerns.

Bias and Fairness

Models may:

Misidentify demographics
Reinforce stereotypes
Generate biased descriptions

Privacy Concerns

Images may contain:

Faces
Sensitive documents
Personal information

Organizations must protect privacy.

Harmful Content

Images may contain:

Violence
Explicit material
Hate symbols

Azure AI Content Safety

Microsoft provides:
Azure AI Content Safety

to help detect:

Harmful imagery
Unsafe prompts
Policy violations

Human-in-the-Loop Review

Organizations often require manual review for:

Medical systems
Legal workflows
Public-facing accessibility systems
High-risk applications

Performance Considerations

Captioning performance depends on:

Image size
Batch size
Model complexity
Prompt size
GPU availability

GPU Acceleration

Captioning systems commonly use GPUs because of:

Parallel inference
Large-scale vision processing
Transformer model acceleration

Optimization Techniques

Image Resizing

Reduce unnecessary resolution.

Batch Processing

Process multiple images simultaneously.

Caching

Reuse frequently analyzed assets.

Asynchronous Processing

Improve application responsiveness.

Azure Services for Captioning Workflows

Azure OpenAI Service

Supports:

Multimodal reasoning
Prompt-based caption generation
Visual understanding

Azure AI Vision

Supports:

Image analysis
Caption generation
OCR
Object detection

Azure AI Foundry

Supports:

Workflow orchestration
Prompt flows
AI evaluation pipelines

Azure Blob Storage

Frequently used for:

Image storage
Caption metadata storage
Workflow integration

Azure Functions

Often used for:

Trigger-based processing
Batch orchestration
Event-driven workflows

Observability and Monitoring

Production systems should monitor:

Caption latency
GPU utilization
Failed requests
Caption quality metrics
Safety violations
Operational costs

Best Practices for Captioning Solutions

Use Clear Prompts

Specific prompts improve caption quality.

Match Caption Length to Use Case

Use concise or detailed captions appropriately.

Validate Outputs

Check for hallucinations and unsafe content.

Support Accessibility Standards

Generate meaningful alt text.

Use Human Review for Sensitive Workflows

Especially important in regulated industries.

Optimize for Cost and Performance

Balance detail level with operational efficiency.

Maintain Audit Logs

Track prompts, outputs, and moderation actions.

Real-World Example

An e-commerce retailer may implement a workflow that:

Uploads product images
Uses OCR to extract visible labels
Generates concise captions for product listings
Generates detailed captions for accessibility support
Runs content safety validation
Stores captions in Blob Storage

This demonstrates:

Single-image captioning
Multi-purpose caption generation
Accessibility support
Workflow orchestration

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Image captioning generates natural-language descriptions of visual content.
Concise captions provide short summaries.
Detailed captions provide richer contextual descriptions.
Dense captioning describes multiple regions or objects.
Multimodal prompting guides caption behavior.
OCR can enhance captioning workflows.
Hallucinations occur when models describe nonexistent objects.
Accessibility is a major use case for captioning systems.
Azure AI Vision supports image captioning and OCR.
Azure AI Content Safety helps moderate unsafe visual content.
Human review may be needed for sensitive workflows.

Practice Exam Questions

Question 1

What is image captioning?

A. Compressing image files
B. Generating natural-language descriptions from images
C. Encrypting image metadata
D. Rendering video animations

Answer

B. Generating natural-language descriptions from images

Explanation

Image captioning converts visual information into descriptive text.

Question 2

What is the primary advantage of concise captions?

A. Increased GPU usage
B. Faster readability and lower token usage
C. Higher rendering latency
D. Improved encryption

Answer

B. Faster readability and lower token usage

Explanation

Concise captions are shorter and easier to process quickly.

Question 3

What is dense captioning?

A. Compressing images at higher density
B. Describing multiple regions or objects within an image
C. Encrypting image outputs
D. Converting images into spreadsheets

Answer

B. Describing multiple regions or objects within an image

Explanation

Dense captioning generates descriptions for several objects or regions in one image.

Question 4

What is a common accessibility use case for image captioning?

A. GPU optimization
B. Alt text generation for screen readers
C. Database indexing
D. Network compression

Answer

B. Alt text generation for screen readers

Explanation

Captions improve accessibility for visually impaired users.

Question 5

What is a hallucination in image captioning?

A. A rendering optimization technique
B. Describing objects or actions not actually present
C. Compressing captions automatically
D. Encrypting generated text

Answer

B. Describing objects or actions not actually present

Explanation

Hallucinations occur when models generate inaccurate descriptions.

Question 6

Which Azure service supports image captioning and OCR?

A. Azure AI Vision
B. Azure DNS
C. Azure CDN
D. Azure Firewall

Answer

A. Azure AI Vision

Explanation

Azure AI Vision supports caption generation, OCR, and image analysis.

Question 7

Why might an application use detailed captions instead of concise captions?

A. To reduce context and detail
B. To provide richer scene understanding and accessibility support
C. To eliminate GPU usage
D. To compress image metadata

Answer

B. To provide richer scene understanding and accessibility support

Explanation

Detailed captions provide more contextual information.

Question 8

What is the purpose of multimodal prompting in captioning workflows?

A. Encrypting image data
B. Combining images and text instructions to guide caption generation
C. Compressing captions automatically
D. Eliminating storage requirements

Answer

B. Combining images and text instructions to guide caption generation

Explanation

Multimodal prompts help control caption style and content.

Question 9

Which Azure service commonly stores generated captions and image assets?

A. Azure Blob Storage
B. Azure Virtual WAN
C. Azure DNS
D. Azure Firewall

Answer

A. Azure Blob Storage

Explanation

Azure Blob Storage is commonly used for storing images and generated metadata.

Question 10

What is a major Responsible AI concern in captioning systems?

A. Bias and inaccurate descriptions
B. Reduced SQL query speed
C. Lower network throughput
D. GPU cooling issues

Answer

A. Bias and inaccurate descriptions

Explanation

Captioning systems may produce biased or incorrect descriptions that affect users.

Go to the AI-103 Exam Prep Hub main page

AI, AI-103, Computer Vision, Microsoft Certification May 25, 2026

Build a solution that analyzes visual context by using multimodal models (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement computer vision solutions (10–15%)
   --> Design and implement multimodal understanding workflows
      --> Build a solution that analyzes visual context by using multimodal models

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI systems increasingly rely on multimodal models that can understand and reason across multiple data types simultaneously, including:

Images
Text
Video
Audio
Documents

For the AI-103 certification exam, you should understand how to build solutions that analyze visual context using multimodal models within Azure AI services.

This includes:

Image understanding
Visual reasoning
Caption generation
Scene interpretation
Visual question answering
Document understanding
Cross-modal reasoning
Multi-input workflows

You should also understand:

Prompt engineering for multimodal systems
Workflow orchestration
Retrieval-augmented generation (RAG)
Responsible AI considerations
Safety controls
Azure services used for multimodal AI

This topic falls under:

“Design and implement multimodal understanding workflows”

What Is a Multimodal Model?

Definition

A multimodal model is an AI model capable of processing and understanding multiple forms of input simultaneously.

Examples include:

Text + image
Video + audio
Image + prompt
Document + visual layout

Unlike traditional single-mode models, multimodal systems can reason across different information types.

What Is Visual Context?

Visual context refers to the meaning and relationships contained within visual data.

This includes:

Objects
Actions
Environments
Spatial relationships
Emotions
Text within images
Scene composition

Example of Visual Context Analysis

An image may contain:

A child holding an umbrella
Rain falling
Vehicles on a street

A multimodal model may infer:

The weather is rainy
The child is outdoors
Traffic conditions may be wet

This goes beyond simple object detection.

Why Multimodal AI Matters

Multimodal systems enable:

Richer AI understanding
Natural human interaction
Improved reasoning
Context-aware responses
Better automation

Common Use Cases

Visual Question Answering (VQA)

Users ask questions about images.

Example:

What is the person holding?

Image Captioning

Automatically generate descriptions for images.

Example:

A dog running through a grassy field

Document Understanding

Analyze:

Forms
Invoices
Receipts
PDFs
Charts

Video Understanding

Interpret:

Scenes
Actions
Motion
Events

Retail and E-Commerce

Analyze:

Products
Shelf layouts
Shopping behavior

Healthcare

Interpret:

Medical imagery
Visual documentation
Diagnostic content

Security and Monitoring

Detect:

Unsafe situations
Intrusions
Operational anomalies

Core Components of Multimodal Workflows

A multimodal workflow commonly includes:

Input acquisition
Data preprocessing
Visual analysis
Prompt engineering
AI reasoning
Response generation
Safety validation
Storage and orchestration

Types of Visual Analysis Tasks

Image Classification

Identifies the primary category of an image.

Example:

Cat
Car
Building

Object Detection

Identifies:

Objects
Locations
Bounding boxes

Scene Understanding

Interprets:

Environments
Activities
Relationships

Optical Character Recognition (OCR)

Extracts text from images or documents.

Examples:

Signs
Receipts
Forms

Visual Reasoning

Combines visual understanding with logical interpretation.

Example:

Is the person likely preparing food?

The model analyzes:

Kitchen items
Actions
Contextual clues

Multimodal Prompt Engineering

What Is Multimodal Prompting?

Multimodal prompting combines:

Visual input
Text instructions

to guide model behavior.

Example Multimodal Prompt

Input:

Product image

Prompt:

Describe the product and identify any visible defects

Effective Prompting Techniques

Be Specific

Good:

Describe all visible safety hazards in the image

Weak:

Describe the image

Request Structured Output

Example:

List detected objects as JSON

Use Contextual Instructions

Example:

Analyze this retail shelf image for out-of-stock products

Visual Grounding

What Is Visual Grounding?

Visual grounding links generated text to specific visual regions.

Example:

Identifying where an object appears in an image

This improves:

Explainability
Accuracy
Traceability

Image Captioning

What Is Image Captioning?

Image captioning generates natural-language descriptions of images.

Example:

A cyclist riding on a mountain trail during sunset

Dense Captioning

Dense captioning describes:

Multiple objects
Regions
Activities

within a single image.

Visual Question Answering (VQA)

What Is VQA?

VQA systems answer questions about visual content.

Example:
Image:

Parking lot

Question:

How many cars are visible?

Chart and Graph Understanding

Multimodal systems can analyze:

Charts
Dashboards
Diagrams
Infographics

Tasks include:

Trend identification
Data extraction
Summarization

Document Intelligence

Multimodal AI can process documents containing:

Text
Tables
Images
Layout structures

Common Document Tasks

Invoice Processing

Extract:

Vendor names
Totals
Dates

Form Extraction

Capture:

Structured fields
Checkboxes
Handwritten text

Contract Analysis

Identify:

Clauses
Dates
Key obligations

Video Understanding

Multimodal models can analyze:

Frame sequences
Motion
Temporal context
Events

Video Analysis Tasks

Scene Detection

Identify scene changes.

Action Recognition

Detect:

Running
Cooking
Driving
Fighting

Event Summarization

Generate video summaries.

Audio + Visual Understanding

Some multimodal workflows combine:

Speech
Visual scenes
Captions
Environmental audio

This enables:

Meeting analysis
Video transcription
Multimedia search

Retrieval-Augmented Generation (RAG)

What Is Multimodal RAG?

Multimodal RAG combines:

Visual retrieval
Text retrieval
AI reasoning

to improve responses.

Example Workflow

User uploads image
System retrieves related product information
Multimodal model analyzes image
AI generates grounded response

Workflow Orchestration

Enterprise multimodal systems often include:

Image preprocessing
OCR pipelines
AI reasoning
Safety checks
Human review
Storage workflows

Example Workflow

User uploads image
OCR extracts visible text
Object detection identifies items
Multimodal model analyzes context
AI generates explanation
Safety validation occurs
Results stored

Responsible AI Considerations

Multimodal systems introduce important Responsible AI concerns.

Bias and Fairness

Models may exhibit:

Cultural bias
Demographic bias
Representation imbalance

Privacy Concerns

Images may contain:

Faces
Personal data
Sensitive documents

Organizations must protect user privacy.

Harmful Content

Visual inputs may contain:

Violence
Hate symbols
Explicit content

Azure AI Content Safety

Microsoft provides:
Azure AI Content Safety

to help detect:

Unsafe imagery
Harmful prompts
Policy violations

Human-in-the-Loop Review

Organizations often require manual review for:

Medical workflows
Legal documents
Public-facing systems
High-risk decisions

Explainability

Multimodal systems should support:

Transparent reasoning
Traceable outputs
Confidence scoring

Performance Considerations

Multimodal workflows may require substantial compute resources.

Factors affecting performance include:

Image resolution
Video length
Model size
Context window size
Retrieval complexity

GPU Acceleration

Multimodal AI commonly relies on GPUs because of:

Parallel processing
Matrix computations
Large-scale inference

Latency Optimization

Optimization techniques include:

Image resizing
Batch processing
Caching
Parallel inference
Streaming analysis

Azure Services for Multimodal Workflows

Azure OpenAI Service

Supports:

Multimodal reasoning
Image understanding
Prompt-based visual analysis
Multi-input AI workflows

Azure AI Foundry

Supports:

Workflow orchestration
Prompt flows
Evaluation pipelines
AI experimentation

Azure AI Vision

Supports:

OCR
Object detection
Image analysis
Scene understanding

Azure AI Document Intelligence

Supports:

Form extraction
Invoice analysis
Layout understanding
Document workflows

Azure Blob Storage

Frequently used for:

Image storage
Video storage
Document storage
Workflow integration

Azure Functions

Often used for:

Trigger-based orchestration
Workflow automation
Event-driven processing

Observability and Monitoring

Production systems should monitor:

Latency
GPU utilization
Failed requests
Safety violations
OCR accuracy
Retrieval performance
Operational cost

Best Practices for Multimodal Workflows

Use Clear Prompts

Specific instructions improve results.

Combine Multiple AI Techniques

Use OCR, object detection, and reasoning together.

Validate Outputs

Apply safety and quality checks.

Optimize Media Size

Large files increase latency and cost.

Use Human Review for Sensitive Workflows

Especially important for regulated industries.

Maintain Audit Logs

Track prompts, outputs, and approvals.

Protect User Privacy

Secure uploaded media and extracted data.

Real-World Example

A retail company may implement a multimodal workflow that:

Uploads shelf images
Uses OCR to read pricing labels
Detects product placement
Uses a multimodal model to identify out-of-stock products
Generates a natural-language summary
Stores results in Blob Storage

This demonstrates:

Visual reasoning
OCR integration
Scene understanding
Workflow orchestration

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Multimodal models process multiple input types simultaneously.
Visual context includes objects, scenes, relationships, and activities.
OCR extracts text from visual content.
Visual Question Answering (VQA) answers questions about images.
Image captioning generates natural-language descriptions.
Multimodal RAG combines retrieval with visual reasoning.
Visual grounding links outputs to image regions.
Azure AI Vision supports object detection and OCR.
Azure AI Document Intelligence supports document workflows.
Azure AI Content Safety helps moderate unsafe content.
Human review may be necessary for sensitive workflows.

Practice Exam Questions

Question 1

What is a multimodal model?

A. A model that only processes text
B. A model that processes multiple data types simultaneously
C. A database indexing engine
D. A GPU scheduling system

Answer

B. A model that processes multiple data types simultaneously

Explanation

Multimodal models can analyze inputs such as images, text, audio, and video together.

Question 2

What does visual context primarily refer to?

A. Network latency statistics
B. Meaning and relationships within visual data
C. File compression metadata
D. Database schemas

Answer

B. Meaning and relationships within visual data

Explanation

Visual context includes objects, environments, actions, and relationships within images or videos.

Question 3

What is the primary purpose of OCR?

A. Compressing images
B. Extracting text from visual content
C. Generating videos automatically
D. Encrypting documents

Answer

B. Extracting text from visual content

Explanation

OCR converts visible text in images or documents into machine-readable text.

Question 4

What is Visual Question Answering (VQA)?

A. A system that creates SQL queries
B. A system that answers questions about visual content
C. A GPU rendering engine
D. A storage optimization method

Answer

B. A system that answers questions about visual content

Explanation

VQA systems combine image understanding with natural-language reasoning.

Question 5

What is visual grounding?

A. Encrypting image files
B. Linking generated outputs to visual regions
C. Reducing GPU utilization
D. Compressing video streams

Answer

B. Linking generated outputs to visual regions

Explanation

Visual grounding connects textual outputs to specific image areas.

Question 6

Which Azure service supports OCR and object detection?

A. Azure AI Vision
B. Azure Firewall
C. Azure DNS
D. Azure Monitor

Answer

A. Azure AI Vision

Explanation

Azure AI Vision supports OCR, image analysis, and object detection.

Question 7

What is a key benefit of multimodal RAG?

A. Eliminating GPU usage
B. Combining retrieval with multimodal reasoning
C. Compressing images automatically
D. Removing prompts from workflows

Answer

B. Combining retrieval with multimodal reasoning

Explanation

Multimodal RAG enhances responses by combining retrieval systems with AI reasoning.

Question 8

Why are GPUs commonly used in multimodal AI systems?

A. GPUs eliminate storage requirements
B. GPUs accelerate parallel inference operations
C. GPUs automatically moderate unsafe content
D. GPUs reduce internet bandwidth usage

Answer

B. GPUs accelerate parallel inference operations

Explanation

Multimodal AI requires large-scale matrix computations well suited for GPUs.

Question 9

Which Azure service helps analyze invoices and forms?

A. Azure AI Document Intelligence
B. Azure CDN
C. Azure DNS
D. Azure Virtual WAN

Answer

A. Azure AI Document Intelligence

Explanation

Azure AI Document Intelligence extracts structured information from documents.

Question 10

What is a key Responsible AI concern for multimodal systems?

A. Deepfake and privacy risks
B. Reduced SQL performance
C. Lower network throughput
D. GPU fan noise

Answer

A. Deepfake and privacy risks

Explanation

Multimodal systems may process sensitive images and generate misleading synthetic content.

Go to the AI-103 Exam Prep Hub main page

AI, AI-103, Artificial Intelligence (AI), Generative AI, Microsoft Certification May 25, 2026

Deploy and consume LLMs, small models, code models, and multimodal models (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
   --> Build generative applications by using Foundry
      --> Deploy and consume LLMs, small models, code models, and multimodal models

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI applications rely on a wide variety of AI models.

Different models are optimized for different workloads, including:

Conversational AI
Code generation
Text summarization
Image understanding
Audio processing
Reasoning tasks
Agentic workflows

The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of how to deploy and consume AI models in Azure AI Foundry.

For the AI-103 exam, you should understand:

Large language models (LLMs)
Small language models (SLMs)
Code models
Multimodal models
Model deployment concepts
Model consumption patterns
API-based model access
Endpoint configuration
Performance and cost tradeoffs
Model selection strategies
Responsible AI considerations

What Are Large Language Models (LLMs)?

Large language models are advanced AI systems trained on massive datasets.

LLMs can:

Generate text
Summarize documents
Answer questions
Translate languages
Reason across prompts
Support conversational AI

Common LLM Use Cases

Typical use cases include:

AI assistants
Enterprise chatbots
Content generation
Knowledge retrieval
Agent orchestration
Workflow automation

Characteristics of LLMs

LLMs typically provide:

Strong reasoning
Broad general knowledge
Advanced conversational abilities
Complex instruction following

However, they also:

Require more compute
Cost more to run
May introduce higher latency

What Are Small Language Models (SLMs)?

Small language models are lightweight models optimized for:

Faster inference
Lower cost
Lower latency
Edge deployment
Specialized tasks

Common SLM Use Cases

SLMs are often used for:

Classification
Simple chatbots
Mobile applications
Embedded AI
Lightweight assistants

Benefits of Small Models

Advantages include:

Reduced infrastructure cost
Faster response times
Lower resource requirements
Easier deployment at scale

LLM vs SLM Tradeoffs

LLMs

Best for:

Complex reasoning
Broad knowledge
Multi-step tasks

Tradeoffs:

Higher cost
Higher latency
Larger infrastructure requirements

SLMs

Best for:

Lightweight inference
Narrow tasks
Cost-sensitive workloads

Tradeoffs:

Reduced reasoning capability
Smaller context windows
Less flexibility

What Are Code Models?

Code models are specialized AI models trained for software development tasks.

These models can:

Generate code
Explain code
Complete functions
Debug issues
Convert between languages

Common Code Model Use Cases

Typical scenarios include:

Developer copilots
Code generation
Documentation generation
Test generation
Refactoring assistance

Code Model Capabilities

Code models often support:

Multiple programming languages
Natural language prompts
Code reasoning
Syntax understanding

What Are Multimodal Models?

Multimodal models process multiple types of input.

Examples include:

Text and images
Text and audio
Video and text

Multimodal AI Capabilities

Multimodal models may support:

Image understanding
OCR
Visual question answering
Audio transcription
Speech interaction
Video analysis

Common Multimodal Use Cases

Examples include:

AI vision assistants
Document understanding
Medical imaging analysis
Voice assistants
Image captioning

Model Deployment in Azure AI Foundry

Azure AI Foundry enables developers to:

Discover models
Deploy models
Test models
Monitor deployments
Consume models through APIs

Model Catalogs

Azure AI Foundry provides access to:

Foundation models
Open-source models
Specialized models
Multimodal models

Deployment Concepts

A deployment makes a model available through:

APIs
Endpoints
Applications
Agent workflows

Deployment Types

Common deployment options include:

Managed online deployments
Serverless deployments
Real-time inference endpoints
Batch inference deployments

Real-Time Inference

Real-time inference is used for:

Interactive chat
AI assistants
Live applications
Agent workflows

Batch Inference

Batch inference is used for:

Large-scale document processing
Offline analysis
Scheduled workloads
Bulk content generation

Endpoint Configuration

Deployments expose endpoints for application access.

Endpoints may include:

Authentication
Rate limits
Scaling policies
Monitoring settings

Authentication and Authorization

Applications may access models using:

API keys
Managed identities
Microsoft Entra ID
Role-based access control (RBAC)

Consuming Models Through APIs

Applications consume deployed models using:

REST APIs
SDKs
Client libraries

Prompt-Based Interactions

Generative AI applications commonly interact with models through prompts.

Prompts may include:

Instructions
Context
Examples
Retrieved documents

System Prompts

System prompts define:

AI behavior
Tone
Constraints
Safety policies

Model Parameters

Common inference parameters include:

Temperature
Top-p
Max tokens
Frequency penalty
Presence penalty

Temperature

Temperature controls output randomness.

Lower temperature:

More deterministic
More predictable

Higher temperature:

More creative
More variable

Context Windows

Context windows determine how much information a model can process in a request.

Larger context windows support:

Long conversations
Large documents
Multi-document grounding

Streaming Responses

Streaming enables applications to receive responses incrementally.

Benefits include:

Improved user experience
Faster perceived response times

Grounding Models

Grounding improves factual accuracy by providing trusted data.

Grounded applications commonly use:

Vector search
Retrieval-Augmented Generation (RAG)
Enterprise knowledge sources

Model Selection Considerations

Developers should evaluate:

Accuracy
Cost
Latency
Context size
Reasoning ability
Multimodal support
Scalability

Choosing Between Models

Use LLMs When:

Complex reasoning is required
Broad knowledge is needed
Multi-step workflows are involved

Use SLMs When:

Low latency matters
Cost optimization is critical
Tasks are narrow or repetitive

Use Code Models When:

Building developer tools
Generating code
Supporting programming workflows

Use Multimodal Models When:

Images or audio are required
Visual understanding is needed
Mixed media inputs are processed

Scaling Model Deployments

Scaling strategies may include:

Autoscaling
Regional deployments
Load balancing
Rate limiting

Monitoring Deployments

Organizations should monitor:

Latency
Throughput
Token usage
Errors
Safety events
Cost

Cost Optimization

Cost optimization strategies include:

Choosing smaller models
Limiting token usage
Caching responses
Using batch processing

Responsible AI Considerations

Developers should implement:

Safety filters
Guardrails
Content moderation
Monitoring
Human oversight

Multimodal Safety Concerns

Multimodal systems may require:

Image moderation
OCR filtering
Audio moderation
Content safety evaluation

Agentic AI and Model Consumption

AI agents may use:

LLMs for reasoning
SLMs for lightweight tasks
Code models for automation
Multimodal models for perception

Common AI-103 Deployment Scenarios

Scenario 1: Enterprise Chatbot

Requirements:

Strong reasoning
Long conversations
Grounded responses

Recommended Model:

LLM with RAG

Scenario 2: Mobile AI Assistant

Requirements:

Fast responses
Low cost
Lightweight inference

Recommended Model:

Small language model

Scenario 3: Developer Copilot

Requirements:

Code generation
Programming assistance
Syntax awareness

Recommended Model:

Code model

Scenario 4: Image-Aware AI Assistant

Requirements:

Image analysis
OCR
Text generation

Recommended Model:

Multimodal model

Common AI-103 Exam Tips

Understand Model Categories

Know the differences between:

LLMs
SLMs
Code models
Multimodal models

Learn Deployment Concepts

Understand:

Endpoints
Real-time inference
Batch inference
Scaling

Learn Consumption Patterns

Know:

REST APIs
SDKs
Prompt engineering
System prompts

Understand Cost and Performance Tradeoffs

Know how:

Model size affects cost
Context size affects latency
Scaling impacts performance

Summary

Azure AI Foundry enables developers to deploy and consume a wide range of AI models.

For the AI-103 exam, you should understand:

LLMs
Small language models
Code models
Multimodal models
Deployment options
Model consumption patterns
Prompt engineering
Scaling strategies
Cost optimization
Responsible AI controls

Choosing the right model and deployment strategy is essential for building:

Scalable
Reliable
Efficient
Responsible AI solutions

These concepts are foundational for generative AI and agentic systems on Azure.

Practice Exam Questions

Question 1

What is a primary strength of large language models (LLMs)?

A. Minimal compute usage
B. Complex reasoning and broad knowledge
C. Guaranteed factual accuracy
D. Extremely low latency

Answer

B. Complex reasoning and broad knowledge

Explanation

LLMs excel at reasoning, conversation, and broad knowledge tasks.

Question 2

Which model type is best suited for lightweight, low-cost inference?

A. Large language model
B. Small language model
C. Multimodal model
D. Vision transformer only

Answer

B. Small language model

Explanation

SLMs are optimized for lower latency and reduced cost.

Question 3

Which model type is specifically optimized for programming tasks?

A. Vision model
B. Code model
C. Embedding model
D. Speech model

Answer

B. Code model

Explanation

Code models are trained for software development workflows.

Question 4

What is a defining feature of multimodal models?

A. They only process text
B. They process multiple input types
C. They eliminate inference costs
D. They require no prompting

Answer

B. They process multiple input types

Explanation

Multimodal models handle text, images, audio, and other media.

Question 5

Which deployment type is best for interactive AI chat applications?

A. Batch inference
B. Real-time inference
C. Archive deployment
D. Offline storage deployment

Answer

B. Real-time inference

Explanation

Interactive applications require low-latency real-time inference.

Question 6

What does the temperature parameter control?

A. Network throughput
B. Output randomness and creativity
C. Storage replication
D. GPU memory allocation

Answer

B. Output randomness and creativity

Explanation

Temperature affects how deterministic or creative outputs become.

Question 7

Which technique improves factual accuracy by using trusted data sources?

A. GPU scaling
B. Retrieval-Augmented Generation (RAG)
C. Semantic caching
D. Compression indexing

Answer

B. Retrieval-Augmented Generation (RAG)

Explanation

RAG grounds model outputs using retrieved enterprise data.

Question 8

What is a major benefit of streaming responses?

A. Reduced storage costs
B. Faster perceived response times
C. Elimination of monitoring
D. Improved vector indexing

Answer

B. Faster perceived response times

Explanation

Streaming improves user experience during response generation.

Question 9

Which authentication method supports passwordless access to Azure AI services?

A. Static credentials only
B. Managed identities
C. Anonymous access
D. Embedded API secrets in code

Answer

B. Managed identities

Explanation

Managed identities support secure, keyless authentication.

Question 10

Which model type is most appropriate for image understanding and OCR tasks?

A. Small language model
B. Multimodal model
C. Traditional relational database
D. Static rules engine

Answer

B. Multimodal model

Explanation

Multimodal models process images and text together.

Go to the AI-103 Exam Prep Hub main page

AI, AI-103, Artificial Intelligence (AI), Microsoft Certification May 25, 2026

Choose an appropriate model for each task, including large language models (LLMs), small language models, multimodal models, and Foundry Tools (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Plan and manage an Azure AI solution (25–30%)
   --> Choose the appropriate Foundry services for generative AI and agents
      --> Choose an appropriate model for each task, including large language models (LLMs), small language models, multimodal models, and Foundry Tools

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

One of the most important skills for the AI-103: Develop AI Apps and Agents on Azure certification exam is understanding how to choose the correct AI model and supporting Azure AI Foundry tools for a given business or technical scenario.

Modern AI development is no longer about simply selecting “an AI model.” Instead, developers must evaluate:

The type of task being performed
Cost constraints
Latency requirements
Accuracy expectations
Reasoning complexity
Context window needs
Multimodal capabilities
Deployment environment
Security and governance requirements
Agent orchestration requirements

Azure AI Foundry provides access to multiple categories of models and tools that help developers build generative AI applications and AI agents efficiently.

For the AI-103 exam, you should understand:

When to use Large Language Models (LLMs)
When Small Language Models (SLMs) are preferable
When multimodal models are required
How Azure AI Foundry tools support model selection and orchestration
Tradeoffs between performance, cost, speed, and capability
Common real-world scenarios for each model category

Azure AI Foundry Overview

Azure AI Foundry is Microsoft’s unified platform for building, evaluating, deploying, and managing AI applications and agents.

Azure AI Foundry provides:

Access to foundation models
Agent development capabilities
Prompt engineering tools
Evaluation tools
Safety and content filtering
Retrieval-augmented generation (RAG) support
Fine-tuning capabilities
Monitoring and observability
Integration with Azure AI services

Azure AI Foundry enables developers to:

Compare multiple models
Test prompts
Evaluate outputs
Build AI agents
Connect enterprise data
Deploy scalable AI applications

For the AI-103 exam, understanding the relationship between model capabilities and Azure AI Foundry tools is extremely important.

Understanding Model Categories

The exam focuses heavily on selecting the correct model type for specific tasks.

The major categories include:

Large Language Models (LLMs)
Small Language Models (SLMs)
Multimodal Models
Embedding Models
Specialized Models

Each category serves different purposes.

Large Language Models (LLMs)

What Are Large Language Models?

Large Language Models are advanced AI models trained on massive datasets containing text, code, and other information.

LLMs are designed for:

Natural language understanding
Natural language generation
Complex reasoning
Summarization
Coding assistance
Question answering
Conversational AI
Agent workflows
Content creation

Examples include:

GPT-4 family models
GPT-4o models
GPT-4 Turbo
Phi large models
Other frontier foundation models available in Azure AI Foundry

Characteristics of LLMs

Strengths

LLMs are excellent at:

Complex Reasoning

Examples:

Multi-step problem solving
Data interpretation
Logical analysis
Decision support

Advanced Content Generation

Examples:

Marketing content
Technical documentation
Email drafting
Knowledge-base generation

Conversational Experiences

Examples:

AI chatbots
AI copilots
Virtual assistants
Interactive tutoring systems

Agentic Workflows

LLMs are commonly used as the “reasoning engine” behind AI agents.

They can:

Plan tasks
Determine next actions
Call tools
Use memory
Chain workflows
Interact with APIs

Limitations of LLMs

Although powerful, LLMs have tradeoffs.

Higher Cost

LLMs generally:

Require more compute
Cost more per token
Increase infrastructure expenses

Increased Latency

Larger models may:

Respond more slowly
Increase application response times
Affect real-time user experiences

Resource Requirements

LLMs require:

More GPU resources
More memory
Larger deployments

Overkill for Simple Tasks

Using GPT-4-level reasoning for basic classification or short summarization tasks may be unnecessary and expensive.

When to Use LLMs

Choose an LLM when tasks require:

Advanced reasoning
Long-context understanding
High-quality content generation
Complex conversational behavior
Tool calling and agent orchestration
Coding assistance
Sophisticated summarization
Enterprise copilots

Example LLM Scenarios

Scenario 1: Enterprise AI Copilot

A company wants an AI assistant that:

Reads internal documentation
Answers employee questions
Generates summaries
Explains policies
Uses tools and APIs

Best choice:

Large Language Model with RAG integration

Reason:

Requires reasoning and conversational understanding.

Scenario 2: AI Coding Assistant

A development team needs:

Code generation
Debugging suggestions
Refactoring support
Documentation generation

Best choice:

Advanced LLM

Reason:

Coding tasks require complex contextual reasoning.

Small Language Models (SLMs)

What Are Small Language Models?

Small Language Models are more lightweight AI models optimized for:

Faster responses
Lower costs
Lower resource consumption
Edge deployments
Narrower tasks

Examples include:

Smaller Phi models
Compact transformer-based models
Task-specific lightweight models

Characteristics of SLMs

Strengths

Lower Cost

SLMs:

Consume fewer resources
Cost less to run
Reduce token usage costs

Faster Inference

SLMs typically:

Respond more quickly
Improve responsiveness
Support near real-time interactions

Edge and Mobile Suitability

SLMs may run:

On edge devices
On mobile hardware
In constrained environments

Efficient for Narrow Tasks

SLMs work well for:

Classification
Basic summarization
Intent detection
Simple chat interactions
Lightweight automation

Limitations of SLMs

Reduced Reasoning Ability

Compared to LLMs, SLMs may struggle with:

Complex logic
Long context handling
Multi-step reasoning
Sophisticated conversations

Lower Output Quality

Outputs may:

Be less nuanced
Contain reduced detail
Provide weaker contextual understanding

When to Use SLMs

Choose an SLM when:

Speed is critical
Cost optimization matters
Tasks are relatively simple
Edge deployment is needed
High throughput is required
Lightweight AI experiences are sufficient

Example SLM Scenarios

Scenario 1: Customer Intent Classification

An application classifies support tickets into categories such as:

Billing
Technical support
Returns
Sales

Best choice:

Small Language Model

Reason:

Classification is relatively simple and does not require advanced reasoning.

Scenario 2: Edge Device Assistant

A manufacturing company deploys an AI assistant on factory equipment with limited compute.

Best choice:

Small Language Model

Reason:

Edge environments benefit from lightweight models.

Multimodal Models

What Are Multimodal Models?

Multimodal models can process multiple data types simultaneously.

Examples include:

Text
Images
Audio
Video
Documents

These models combine information across modalities to produce richer outputs.

Capabilities of Multimodal Models

Multimodal models can:

Analyze images and answer questions about them
Generate captions from images
Extract information from documents
Process speech and text together
Understand charts and diagrams
Support visual reasoning

Common Multimodal Tasks

Image Understanding

Examples:

Object detection
Scene analysis
Image captioning
Visual question answering

Document Intelligence

Examples:

Invoice extraction
Receipt processing
Form analysis
OCR workflows

Audio + Text Experiences

Examples:

Voice assistants
Meeting summarization
Speech transcription
Audio analysis

When to Use Multimodal Models

Choose multimodal models when applications involve:

Images and text together
Document processing
Speech interactions
Visual understanding
Cross-modal reasoning

Example Multimodal Scenarios

Scenario 1: Invoice Processing

A company needs to:

Read invoices
Extract totals
Identify vendors
Validate line items

Best choice:

Multimodal document processing model

Reason:

The solution must interpret both layout and text.

Scenario 2: Retail Image Assistant

Users upload photos of products and ask questions about them.

Best choice:

Multimodal model

Reason:

Requires simultaneous image and text understanding.

Embedding Models

What Are Embedding Models?

Embedding models convert text or other content into vector representations.

These vectors capture semantic meaning.

Embedding models are essential for:

Semantic search
Retrieval-Augmented Generation (RAG)
Similarity matching
Recommendation systems
Knowledge retrieval

Retrieval-Augmented Generation (RAG)

RAG combines:

Embedding models
Vector databases
LLMs

Workflow:

Convert documents into embeddings
Store embeddings in a vector index
Convert user query into embeddings
Retrieve relevant content
Send retrieved data to the LLM

RAG improves:

Accuracy
Freshness of information
Enterprise grounding
Hallucination reduction

Specialized Models

Some tasks are better handled by specialized AI models instead of general-purpose LLMs.

Examples:

Translation models
Speech models
OCR models
Vision models
Classification models

Why Specialized Models Matter

Specialized models may provide:

Better accuracy
Lower cost
Faster performance
Simpler deployment

Example:

Using a dedicated OCR service is often more efficient than asking an LLM to read text from images.

Model Selection Factors

The AI-103 exam heavily tests your ability to select the correct model based on requirements.

Factor 1: Task Complexity

Use LLMs For:

Advanced reasoning
Multi-step workflows
Complex conversations

Use SLMs For:

Simple classification
Lightweight interactions
Fast automation

Factor 2: Cost

LLMs

Higher operational cost
More expensive inference

SLMs

Lower operational cost
Better for high-volume workloads

Factor 3: Latency

Low-Latency Requirements

Prefer:

SLMs
Lightweight models

Complex Processing

Prefer:

LLMs

Even if response time increases.

Factor 4: Context Window

Some tasks require processing:

Long documents
Large conversations
Extensive histories

Choose models with larger context windows for:

Legal analysis
Knowledge assistants
Long-form summarization

Factor 5: Multimodal Requirements

If the application involves:

Images
Audio
Video
Documents

Choose multimodal-capable models.

Factor 6: Deployment Environment

Cloud-Hosted Applications

May use:

Large frontier models
GPU-intensive deployments

Edge or Mobile Deployments

Prefer:

Small models
Quantized models
Lightweight inference

Azure AI Foundry Tools

Azure AI Foundry includes numerous tools that support model selection and AI application development.

Model Catalog

The Model Catalog allows developers to:

Browse available models
Compare capabilities
Review benchmarks
Deploy models
Evaluate pricing

The catalog includes:

Microsoft-hosted models
Open-source models
Partner models
Frontier models

Prompt Flow

Prompt Flow helps developers:

Build AI workflows
Chain prompts together
Integrate tools
Evaluate prompts
Test model behavior

Prompt Flow is useful for:

Agent orchestration
RAG pipelines
Multi-step AI workflows

AI Agent Development Tools

Azure AI Foundry supports AI agents that can:

Use tools
Access data
Maintain memory
Perform actions
Execute workflows

Agent frameworks may include:

Tool calling
Function calling
Retrieval integration
Multi-agent orchestration

Evaluation Tools

Evaluation tools help developers assess:

Accuracy
Groundedness
Safety
Relevance
Latency
Cost

Evaluation is critical because model quality varies by task.

Content Safety Tools

Azure AI Foundry includes safety features such as:

Content filtering
Harm detection
Prompt injection detection
Responsible AI controls

These tools help ensure safe AI deployments.

Fine-Tuning Tools

Fine-tuning allows developers to customize models using:

Domain-specific data
Proprietary terminology
Specialized workflows

Fine-tuning may improve:

Accuracy
Consistency
Industry-specific responses

However, fine-tuning also:

Increases cost
Requires data preparation
Adds operational complexity

Choosing Between Prompt Engineering, RAG, and Fine-Tuning

This is a very important AI-103 exam topic.

Prompt Engineering

Use when:

You need quick customization
Tasks are general-purpose
No private data integration is needed

Advantages:

Fast
Cheap
Easy to maintain

RAG

Use when:

You need current or proprietary data
You want grounding in enterprise content
You need dynamic knowledge retrieval

Advantages:

Reduces hallucinations
Keeps knowledge current
Avoids retraining

Fine-Tuning

Use when:

Consistent specialized outputs are required
Domain language is highly unique
Behavioral customization is necessary

Advantages:

Tailored responses
Better domain alignment

Real-World Model Selection Examples

Example 1: FAQ Chatbot

Requirements:

Low cost
Fast responses
Basic conversational support

Best Choice:

Small Language Model + RAG

Example 2: Legal Document Assistant

Requirements:

Long-context understanding
Detailed summarization
Advanced reasoning

Best Choice:

Large Language Model with large context window

Example 3: Mobile AI App

Requirements:

Offline capability
Fast performance
Low resource usage

Best Choice:

Small Language Model

Example 4: Image-Based Customer Support

Requirements:

Analyze uploaded photos
Understand text and images
Generate responses

Best Choice:

Multimodal model

Key AI-103 Exam Tips

Understand Tradeoffs

You should know:

Bigger models are not always better
Simpler tasks may not require advanced LLMs
Cost and latency matter
Specialized models may outperform general models

Know Common Pairings

LLM + RAG

Used for:

Enterprise chatbots
Knowledge assistants
AI copilots

Embeddings + Vector Search

Used for:

Semantic search
Knowledge retrieval
Similarity matching

Multimodal Models

Used for:

Vision AI
Document processing
Audio interactions

Learn the Azure AI Foundry Ecosystem

Know the purpose of:

Model Catalog
Prompt Flow
Evaluation tools
Agent tools
Safety systems
Fine-tuning workflows

Summary

Selecting the correct AI model is one of the most important responsibilities for an Azure AI developer.

For the AI-103 exam, you should understand:

The differences between LLMs and SLMs
When multimodal models are required
How embedding models support RAG
When specialized models outperform general-purpose models
The tradeoffs between cost, speed, and reasoning capability
How Azure AI Foundry tools support AI development and orchestration

In real-world AI systems, choosing the correct model can dramatically improve:

Performance
User experience
Scalability
Operational cost
Reliability
Maintainability

A strong understanding of model selection is essential for designing effective Azure AI applications and AI agents.

Practice Exam Questions

Question 1

A company is building an enterprise AI assistant that must answer complex employee questions using internal documentation and perform multi-step reasoning. Which model type is MOST appropriate?

A. Small Language Model (SLM)
B. Embedding model only
C. Large Language Model (LLM)
D. OCR model

Answer

C. Large Language Model (LLM)

Explanation

Complex reasoning and conversational understanding are best handled by LLMs.

Question 2

Which model type is generally BEST for low-cost, low-latency classification tasks?

A. Large multimodal model
B. Small Language Model (SLM)
C. GPT-4-class reasoning model
D. Vision foundation model

Answer

B. Small Language Model (SLM)

Explanation

SLMs are optimized for lightweight and cost-efficient tasks.

Question 3

A solution must process uploaded invoices and extract totals, vendor names, and line items. Which model type is MOST appropriate?

A. Embedding model
B. Small Language Model
C. Multimodal model
D. Translation model

Answer

C. Multimodal model

Explanation

Invoice extraction requires understanding both layout and text.

Question 4

What is the primary purpose of embedding models?

A. Image generation
B. Semantic vector representation
C. Audio transcription
D. Tool orchestration

Answer

B. Semantic vector representation

Explanation

Embedding models convert content into vectors for semantic search and retrieval.

Question 5

Which Azure AI Foundry tool helps developers chain prompts, integrate tools, and build AI workflows?

A. Azure Monitor
B. Prompt Flow
C. Azure Policy
D. Azure Functions

Answer

B. Prompt Flow

Explanation

Prompt Flow is designed for workflow orchestration and prompt pipelines.

Question 6

A mobile AI application must operate with minimal compute resources and very fast response times. Which model type is MOST appropriate?

A. Large Language Model
B. Small Language Model
C. Large multimodal model
D. High-context reasoning model

Answer

B. Small Language Model

Explanation

SLMs are optimized for lightweight and edge deployments.

Question 7

Which approach is BEST when an AI chatbot must use current enterprise data without retraining the model?

A. Fine-tuning only
B. Prompt engineering only
C. Retrieval-Augmented Generation (RAG)
D. Quantization

Answer

C. Retrieval-Augmented Generation (RAG)

Explanation

RAG retrieves current information dynamically without retraining.

Question 8

Which factor MOST strongly indicates that a multimodal model is required?

A. Need for vector embeddings
B. Need for faster response times
C. Need to process images and text together
D. Need for lower cost

Answer

C. Need to process images and text together

Explanation

Multimodal models handle multiple input modalities simultaneously.

Question 9

What is a major tradeoff of using larger language models?

A. Reduced reasoning capability
B. Lower context windows
C. Increased operational cost
D. Inability to support agents

Answer

C. Increased operational cost

Explanation

Larger models typically require more compute resources and cost more.

Question 10

Which Azure AI Foundry capability helps evaluate model quality, safety, and groundedness?

A. Azure Load Balancer
B. Evaluation tools
C. Azure Backup
D. Traffic Manager

Answer

B. Evaluation tools

Explanation

Evaluation tools assess output quality, safety, and performance metrics.

Go to the AI-103 Exam Prep Hub main page

AI, AI-900, Artificial Intelligence (AI), Large Language Models (LLMs), Microsoft Certification January 31, 2026

Describe features and capabilities of Azure AI Foundry model catalog (AI-900 Exam Prep)

What Is the Azure AI Foundry Model Catalog?

The Azure AI Foundry model catalog (also known as Microsoft Foundry Models) is a centralized, searchable repository of AI models that developers and organizations can use to build generative AI solutions on Azure. It contains hundreds to thousands of models from multiple providers — including Microsoft, OpenAI, Anthropic, Meta, Cohere, DeepSeek, NVIDIA, and more — and provides tools to explore, compare, and deploy them for various AI workloads.

The model catalog is a key feature of Azure AI Foundry because it lets teams discover and evaluate the right models for specific tasks before integrating them into applications.

Key Capabilities of the Model Catalog

🌐 1. Wide and Diverse Model Selection

The catalog includes a broad set of models, such as:

Large language models (LLMs) for text generation and chat
Domain-specific models for legal, medical, or industry tasks
Multimodal models that handle text + images
Reasoning and specialized task models
These models come from multiple providers including Microsoft, OpenAI, Anthropic, Meta, Mistral AI, and more.

This diversity ensures that developers can find models that fit a wide range of use cases, from simple text completion to advanced multi-agent workflows.

🔍 2. Search and Filtering Tools

The model catalog provides tools to help you find the right model by:

Keyword search
Provider and collection filters
Filtering by capabilities (e.g., reasoning, tool calling)
Deployment type (e.g., serverless API vs managed compute)
Inference and fine-tune task types
Industry or domain tags

These filters make it easier to match models to specific AI workloads.

📊 3. Comparison and Benchmarking

The catalog includes features like:

Model performance leaderboards
Benchmark metrics for selected models
Side-by-side comparison tools

This lets organizations evaluate and compare models based on real-world performance metrics before deployment.

This is especially useful when choosing between models for accuracy, cost, or task suitability.

📄 4. Model Cards with Metadata

Each model in the catalog has a model card that provides:

Quick facts about the model
A description
Version and supported data types
Licenses and legal information
Benchmark results (if available)
Deployment status and options

Model cards help users understand model capabilities, constraints, and appropriate use cases.

🚀 5. Multiple Deployment Options

Models in the Foundry catalog can be deployed using:

Serverless API: A “Models as a Service” approach where the model is hosted and managed by Azure, and you pay per API call
Managed compute: Dedicated virtual machines for predictable performance and long-running applications

This gives teams flexibility in choosing cost and performance trade-offs.

⚙️ 6. Integration and Customization

The model catalog isn’t just for discovery — it also supports:

Fine-tuning of models based on your data
Custom deployments within your enterprise environment
Integration with other Azure tools and services, like Azure AI Foundry deployment workflows and AI development tooling

This makes the catalog a foundational piece of end-to-end generative AI development on Azure.

Model Categories in the Catalog

The model catalog is organized into key categories such as:

Models sold directly by Azure: Models hosted and supported by Microsoft with enterprise-grade integration, support, and compliant terms.
Partner and community models: Models developed by external organizations like OpenAI, Anthropic, Meta, or Cohere. These often extend capabilities or offer domain-specific strengths.

This structure helps teams select between fully supported enterprise models and innovative third-party models.

Scenarios Where You Would Use the Model Catalog

The Azure AI Foundry model catalog is especially useful when:

Exploring models for text generation, chat, summarization, or reasoning
Comparing multiple models for accuracy vs cost
Deploying models in different formats (serverless API vs compute)
Integrating models from multiple providers in a single AI pipeline

It is a central discovery and evaluation hub for generative AI on Azure.

How This Relates to AI-900

For the AI-900 exam, you should understand:

The model catalog is a core capability of Azure AI Foundry
It allows discovering, comparing, and deploying models
It supports multiple model providers
It offers deployment options and metadata to guide selection

If a question mentions finding the right generative model for a use case, evaluating model performance, or using a variety of models in Azure, then the Azure AI Foundry model catalog is likely being described.

Summary (Exam Highlights)

Azure AI Foundry model catalog provides discoverability for thousands of AI models.
Models can be filtered, compared, and evaluated.
Catalog entries include useful metadata (model cards) and benchmarking.
Models come from Microsoft and partner providers like OpenAI, Anthropic, Meta, etc.
Deployment options vary between serverless APIs and managed compute.

Go to the Practice Exam Questions for this topic.

Go to the AI-900 Exam Prep Hub main page.