This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
   --> Build generative applications by using Foundry
      --> Deploy and consume LLMs, small models, code models, and multimodal models

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI applications rely on a wide variety of AI models.

Different models are optimized for different workloads, including:

Conversational AI
Code generation
Text summarization
Image understanding
Audio processing
Reasoning tasks
Agentic workflows

The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of how to deploy and consume AI models in Azure AI Foundry.

For the AI-103 exam, you should understand:

Large language models (LLMs)
Small language models (SLMs)
Code models
Multimodal models
Model deployment concepts
Model consumption patterns
API-based model access
Endpoint configuration
Performance and cost tradeoffs
Model selection strategies
Responsible AI considerations

What Are Large Language Models (LLMs)?

Large language models are advanced AI systems trained on massive datasets.

LLMs can:

Generate text
Summarize documents
Answer questions
Translate languages
Reason across prompts
Support conversational AI

Common LLM Use Cases

Typical use cases include:

AI assistants
Enterprise chatbots
Content generation
Knowledge retrieval
Agent orchestration
Workflow automation

Characteristics of LLMs

LLMs typically provide:

Strong reasoning
Broad general knowledge
Advanced conversational abilities
Complex instruction following

However, they also:

Require more compute
Cost more to run
May introduce higher latency

What Are Small Language Models (SLMs)?

Small language models are lightweight models optimized for:

Faster inference
Lower cost
Lower latency
Edge deployment
Specialized tasks

Common SLM Use Cases

SLMs are often used for:

Classification
Simple chatbots
Mobile applications
Embedded AI
Lightweight assistants

Benefits of Small Models

Advantages include:

Reduced infrastructure cost
Faster response times
Lower resource requirements
Easier deployment at scale

LLM vs SLM Tradeoffs

LLMs

Best for:

Complex reasoning
Broad knowledge
Multi-step tasks

Tradeoffs:

Higher cost
Higher latency
Larger infrastructure requirements

SLMs

Best for:

Lightweight inference
Narrow tasks
Cost-sensitive workloads

Tradeoffs:

Reduced reasoning capability
Smaller context windows
Less flexibility

What Are Code Models?

Code models are specialized AI models trained for software development tasks.

These models can:

Generate code
Explain code
Complete functions
Debug issues
Convert between languages

Common Code Model Use Cases

Typical scenarios include:

Developer copilots
Code generation
Documentation generation
Test generation
Refactoring assistance

Code Model Capabilities

Code models often support:

Multiple programming languages
Natural language prompts
Code reasoning
Syntax understanding

What Are Multimodal Models?

Multimodal models process multiple types of input.

Examples include:

Text and images
Text and audio
Video and text

Multimodal AI Capabilities

Multimodal models may support:

Image understanding
OCR
Visual question answering
Audio transcription
Speech interaction
Video analysis

Common Multimodal Use Cases

Examples include:

AI vision assistants
Document understanding
Medical imaging analysis
Voice assistants
Image captioning

Model Deployment in Azure AI Foundry

Azure AI Foundry enables developers to:

Discover models
Deploy models
Test models
Monitor deployments
Consume models through APIs

Model Catalogs

Azure AI Foundry provides access to:

Foundation models
Open-source models
Specialized models
Multimodal models

Deployment Concepts

A deployment makes a model available through:

APIs
Endpoints
Applications
Agent workflows

Deployment Types

Common deployment options include:

Managed online deployments
Serverless deployments
Real-time inference endpoints
Batch inference deployments

Real-Time Inference

Real-time inference is used for:

Interactive chat
AI assistants
Live applications
Agent workflows

Batch Inference

Batch inference is used for:

Large-scale document processing
Offline analysis
Scheduled workloads
Bulk content generation

Endpoint Configuration

Deployments expose endpoints for application access.

Endpoints may include:

Authentication
Rate limits
Scaling policies
Monitoring settings

Authentication and Authorization

Applications may access models using:

API keys
Managed identities
Microsoft Entra ID
Role-based access control (RBAC)

Consuming Models Through APIs

Applications consume deployed models using:

REST APIs
SDKs
Client libraries

Prompt-Based Interactions

Generative AI applications commonly interact with models through prompts.

Prompts may include:

Instructions
Context
Examples
Retrieved documents

System Prompts

System prompts define:

AI behavior
Tone
Constraints
Safety policies

Model Parameters

Common inference parameters include:

Temperature
Top-p
Max tokens
Frequency penalty
Presence penalty

Temperature

Temperature controls output randomness.

Lower temperature:

More deterministic
More predictable

Higher temperature:

More creative
More variable

Context Windows

Context windows determine how much information a model can process in a request.

Larger context windows support:

Long conversations
Large documents
Multi-document grounding

Streaming Responses

Streaming enables applications to receive responses incrementally.

Benefits include:

Improved user experience
Faster perceived response times

Grounding Models

Grounding improves factual accuracy by providing trusted data.

Grounded applications commonly use:

Vector search
Retrieval-Augmented Generation (RAG)
Enterprise knowledge sources

Model Selection Considerations

Developers should evaluate:

Accuracy
Cost
Latency
Context size
Reasoning ability
Multimodal support
Scalability

Choosing Between Models

Use LLMs When:

Complex reasoning is required
Broad knowledge is needed
Multi-step workflows are involved

Use SLMs When:

Low latency matters
Cost optimization is critical
Tasks are narrow or repetitive

Use Code Models When:

Building developer tools
Generating code
Supporting programming workflows

Use Multimodal Models When:

Images or audio are required
Visual understanding is needed
Mixed media inputs are processed

Scaling Model Deployments

Scaling strategies may include:

Autoscaling
Regional deployments
Load balancing
Rate limiting

Monitoring Deployments

Organizations should monitor:

Latency
Throughput
Token usage
Errors
Safety events
Cost

Cost Optimization

Cost optimization strategies include:

Choosing smaller models
Limiting token usage
Caching responses
Using batch processing

Responsible AI Considerations

Developers should implement:

Safety filters
Guardrails
Content moderation
Monitoring
Human oversight

Multimodal Safety Concerns

Multimodal systems may require:

Image moderation
OCR filtering
Audio moderation
Content safety evaluation

Agentic AI and Model Consumption

AI agents may use:

LLMs for reasoning
SLMs for lightweight tasks
Code models for automation
Multimodal models for perception

Common AI-103 Deployment Scenarios

Scenario 1: Enterprise Chatbot

Requirements:

Strong reasoning
Long conversations
Grounded responses

Recommended Model:

LLM with RAG

Scenario 2: Mobile AI Assistant

Requirements:

Fast responses
Low cost
Lightweight inference

Recommended Model:

Small language model

Scenario 3: Developer Copilot

Requirements:

Code generation
Programming assistance
Syntax awareness

Recommended Model:

Code model

Scenario 4: Image-Aware AI Assistant

Requirements:

Image analysis
OCR
Text generation

Recommended Model:

Multimodal model

Common AI-103 Exam Tips

Understand Model Categories

Know the differences between:

LLMs
SLMs
Code models
Multimodal models

Learn Deployment Concepts

Understand:

Endpoints
Real-time inference
Batch inference
Scaling

Learn Consumption Patterns

Know:

REST APIs
SDKs
Prompt engineering
System prompts

Understand Cost and Performance Tradeoffs

Know how:

Model size affects cost
Context size affects latency
Scaling impacts performance

Summary

Azure AI Foundry enables developers to deploy and consume a wide range of AI models.

For the AI-103 exam, you should understand:

LLMs
Small language models
Code models
Multimodal models
Deployment options
Model consumption patterns
Prompt engineering
Scaling strategies
Cost optimization
Responsible AI controls

Choosing the right model and deployment strategy is essential for building:

Scalable
Reliable
Efficient
Responsible AI solutions

These concepts are foundational for generative AI and agentic systems on Azure.

Practice Exam Questions

Question 1

What is a primary strength of large language models (LLMs)?

A. Minimal compute usage
B. Complex reasoning and broad knowledge
C. Guaranteed factual accuracy
D. Extremely low latency

Answer

B. Complex reasoning and broad knowledge

Explanation

LLMs excel at reasoning, conversation, and broad knowledge tasks.

Question 2

Which model type is best suited for lightweight, low-cost inference?

A. Large language model
B. Small language model
C. Multimodal model
D. Vision transformer only

Answer

B. Small language model

Explanation

SLMs are optimized for lower latency and reduced cost.

Question 3

Which model type is specifically optimized for programming tasks?

A. Vision model
B. Code model
C. Embedding model
D. Speech model

Answer

B. Code model

Explanation

Code models are trained for software development workflows.

Question 4

What is a defining feature of multimodal models?

A. They only process text
B. They process multiple input types
C. They eliminate inference costs
D. They require no prompting

Answer

B. They process multiple input types

Explanation

Multimodal models handle text, images, audio, and other media.

Question 5

Which deployment type is best for interactive AI chat applications?

A. Batch inference
B. Real-time inference
C. Archive deployment
D. Offline storage deployment

Answer

B. Real-time inference

Explanation

Interactive applications require low-latency real-time inference.

Question 6

What does the temperature parameter control?

A. Network throughput
B. Output randomness and creativity
C. Storage replication
D. GPU memory allocation

Answer

B. Output randomness and creativity

Explanation

Temperature affects how deterministic or creative outputs become.

Question 7

Which technique improves factual accuracy by using trusted data sources?

A. GPU scaling
B. Retrieval-Augmented Generation (RAG)
C. Semantic caching
D. Compression indexing

Answer

B. Retrieval-Augmented Generation (RAG)

Explanation

RAG grounds model outputs using retrieved enterprise data.

Question 8

What is a major benefit of streaming responses?

A. Reduced storage costs
B. Faster perceived response times
C. Elimination of monitoring
D. Improved vector indexing

Answer

B. Faster perceived response times

Explanation

Streaming improves user experience during response generation.

Question 9

Which authentication method supports passwordless access to Azure AI services?

A. Static credentials only
B. Managed identities
C. Anonymous access
D. Embedded API secrets in code

Answer

B. Managed identities

Explanation

Managed identities support secure, keyless authentication.

Question 10

Which model type is most appropriate for image understanding and OCR tasks?

A. Small language model
B. Multimodal model
C. Traditional relational database
D. Static rules engine

Answer

B. Multimodal model

Explanation

Multimodal models process images and text together.

Go to the AI-103 Exam Prep Hub main page

The Data Community

Leave a comment Cancel reply

Information and resources for the data professionals' community