Deploy and consume LLMs, small models, code models, and multimodal models (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
--> Build generative applications by using Foundry
--> Deploy and consume LLMs, small models, code models, and multimodal models


Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI applications rely on a wide variety of AI models.

Different models are optimized for different workloads, including:

  • Conversational AI
  • Code generation
  • Text summarization
  • Image understanding
  • Audio processing
  • Reasoning tasks
  • Agentic workflows

The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of how to deploy and consume AI models in Azure AI Foundry.

For the AI-103 exam, you should understand:

  • Large language models (LLMs)
  • Small language models (SLMs)
  • Code models
  • Multimodal models
  • Model deployment concepts
  • Model consumption patterns
  • API-based model access
  • Endpoint configuration
  • Performance and cost tradeoffs
  • Model selection strategies
  • Responsible AI considerations

What Are Large Language Models (LLMs)?

Large language models are advanced AI systems trained on massive datasets.

LLMs can:

  • Generate text
  • Summarize documents
  • Answer questions
  • Translate languages
  • Reason across prompts
  • Support conversational AI

Common LLM Use Cases

Typical use cases include:

  • AI assistants
  • Enterprise chatbots
  • Content generation
  • Knowledge retrieval
  • Agent orchestration
  • Workflow automation

Characteristics of LLMs

LLMs typically provide:

  • Strong reasoning
  • Broad general knowledge
  • Advanced conversational abilities
  • Complex instruction following

However, they also:

  • Require more compute
  • Cost more to run
  • May introduce higher latency

What Are Small Language Models (SLMs)?

Small language models are lightweight models optimized for:

  • Faster inference
  • Lower cost
  • Lower latency
  • Edge deployment
  • Specialized tasks

Common SLM Use Cases

SLMs are often used for:

  • Classification
  • Simple chatbots
  • Mobile applications
  • Embedded AI
  • Lightweight assistants

Benefits of Small Models

Advantages include:

  • Reduced infrastructure cost
  • Faster response times
  • Lower resource requirements
  • Easier deployment at scale

LLM vs SLM Tradeoffs

LLMs

Best for:

  • Complex reasoning
  • Broad knowledge
  • Multi-step tasks

Tradeoffs:

  • Higher cost
  • Higher latency
  • Larger infrastructure requirements

SLMs

Best for:

  • Lightweight inference
  • Narrow tasks
  • Cost-sensitive workloads

Tradeoffs:

  • Reduced reasoning capability
  • Smaller context windows
  • Less flexibility

What Are Code Models?

Code models are specialized AI models trained for software development tasks.

These models can:

  • Generate code
  • Explain code
  • Complete functions
  • Debug issues
  • Convert between languages

Common Code Model Use Cases

Typical scenarios include:

  • Developer copilots
  • Code generation
  • Documentation generation
  • Test generation
  • Refactoring assistance

Code Model Capabilities

Code models often support:

  • Multiple programming languages
  • Natural language prompts
  • Code reasoning
  • Syntax understanding

What Are Multimodal Models?

Multimodal models process multiple types of input.

Examples include:

  • Text and images
  • Text and audio
  • Video and text

Multimodal AI Capabilities

Multimodal models may support:

  • Image understanding
  • OCR
  • Visual question answering
  • Audio transcription
  • Speech interaction
  • Video analysis

Common Multimodal Use Cases

Examples include:

  • AI vision assistants
  • Document understanding
  • Medical imaging analysis
  • Voice assistants
  • Image captioning

Model Deployment in Azure AI Foundry

Azure AI Foundry enables developers to:

  • Discover models
  • Deploy models
  • Test models
  • Monitor deployments
  • Consume models through APIs

Model Catalogs

Azure AI Foundry provides access to:

  • Foundation models
  • Open-source models
  • Specialized models
  • Multimodal models

Deployment Concepts

A deployment makes a model available through:

  • APIs
  • Endpoints
  • Applications
  • Agent workflows

Deployment Types

Common deployment options include:

  • Managed online deployments
  • Serverless deployments
  • Real-time inference endpoints
  • Batch inference deployments

Real-Time Inference

Real-time inference is used for:

  • Interactive chat
  • AI assistants
  • Live applications
  • Agent workflows

Batch Inference

Batch inference is used for:

  • Large-scale document processing
  • Offline analysis
  • Scheduled workloads
  • Bulk content generation

Endpoint Configuration

Deployments expose endpoints for application access.

Endpoints may include:

  • Authentication
  • Rate limits
  • Scaling policies
  • Monitoring settings

Authentication and Authorization

Applications may access models using:

  • API keys
  • Managed identities
  • Microsoft Entra ID
  • Role-based access control (RBAC)

Consuming Models Through APIs

Applications consume deployed models using:

  • REST APIs
  • SDKs
  • Client libraries

Prompt-Based Interactions

Generative AI applications commonly interact with models through prompts.

Prompts may include:

  • Instructions
  • Context
  • Examples
  • Retrieved documents

System Prompts

System prompts define:

  • AI behavior
  • Tone
  • Constraints
  • Safety policies

Model Parameters

Common inference parameters include:

  • Temperature
  • Top-p
  • Max tokens
  • Frequency penalty
  • Presence penalty

Temperature

Temperature controls output randomness.

Lower temperature:

  • More deterministic
  • More predictable

Higher temperature:

  • More creative
  • More variable

Context Windows

Context windows determine how much information a model can process in a request.

Larger context windows support:

  • Long conversations
  • Large documents
  • Multi-document grounding

Streaming Responses

Streaming enables applications to receive responses incrementally.

Benefits include:

  • Improved user experience
  • Faster perceived response times

Grounding Models

Grounding improves factual accuracy by providing trusted data.

Grounded applications commonly use:

  • Vector search
  • Retrieval-Augmented Generation (RAG)
  • Enterprise knowledge sources

Model Selection Considerations

Developers should evaluate:

  • Accuracy
  • Cost
  • Latency
  • Context size
  • Reasoning ability
  • Multimodal support
  • Scalability

Choosing Between Models

Use LLMs When:

  • Complex reasoning is required
  • Broad knowledge is needed
  • Multi-step workflows are involved

Use SLMs When:

  • Low latency matters
  • Cost optimization is critical
  • Tasks are narrow or repetitive

Use Code Models When:

  • Building developer tools
  • Generating code
  • Supporting programming workflows

Use Multimodal Models When:

  • Images or audio are required
  • Visual understanding is needed
  • Mixed media inputs are processed

Scaling Model Deployments

Scaling strategies may include:

  • Autoscaling
  • Regional deployments
  • Load balancing
  • Rate limiting

Monitoring Deployments

Organizations should monitor:

  • Latency
  • Throughput
  • Token usage
  • Errors
  • Safety events
  • Cost

Cost Optimization

Cost optimization strategies include:

  • Choosing smaller models
  • Limiting token usage
  • Caching responses
  • Using batch processing

Responsible AI Considerations

Developers should implement:

  • Safety filters
  • Guardrails
  • Content moderation
  • Monitoring
  • Human oversight

Multimodal Safety Concerns

Multimodal systems may require:

  • Image moderation
  • OCR filtering
  • Audio moderation
  • Content safety evaluation

Agentic AI and Model Consumption

AI agents may use:

  • LLMs for reasoning
  • SLMs for lightweight tasks
  • Code models for automation
  • Multimodal models for perception

Common AI-103 Deployment Scenarios

Scenario 1: Enterprise Chatbot

Requirements:

  • Strong reasoning
  • Long conversations
  • Grounded responses

Recommended Model:

  • LLM with RAG

Scenario 2: Mobile AI Assistant

Requirements:

  • Fast responses
  • Low cost
  • Lightweight inference

Recommended Model:

  • Small language model

Scenario 3: Developer Copilot

Requirements:

  • Code generation
  • Programming assistance
  • Syntax awareness

Recommended Model:

  • Code model

Scenario 4: Image-Aware AI Assistant

Requirements:

  • Image analysis
  • OCR
  • Text generation

Recommended Model:

  • Multimodal model

Common AI-103 Exam Tips

Understand Model Categories

Know the differences between:

  • LLMs
  • SLMs
  • Code models
  • Multimodal models

Learn Deployment Concepts

Understand:

  • Endpoints
  • Real-time inference
  • Batch inference
  • Scaling

Learn Consumption Patterns

Know:

  • REST APIs
  • SDKs
  • Prompt engineering
  • System prompts

Understand Cost and Performance Tradeoffs

Know how:

  • Model size affects cost
  • Context size affects latency
  • Scaling impacts performance

Summary

Azure AI Foundry enables developers to deploy and consume a wide range of AI models.

For the AI-103 exam, you should understand:

  • LLMs
  • Small language models
  • Code models
  • Multimodal models
  • Deployment options
  • Model consumption patterns
  • Prompt engineering
  • Scaling strategies
  • Cost optimization
  • Responsible AI controls

Choosing the right model and deployment strategy is essential for building:

  • Scalable
  • Reliable
  • Efficient
  • Responsible AI solutions

These concepts are foundational for generative AI and agentic systems on Azure.


Practice Exam Questions

Question 1

What is a primary strength of large language models (LLMs)?

A. Minimal compute usage
B. Complex reasoning and broad knowledge
C. Guaranteed factual accuracy
D. Extremely low latency

Answer

B. Complex reasoning and broad knowledge

Explanation

LLMs excel at reasoning, conversation, and broad knowledge tasks.


Question 2

Which model type is best suited for lightweight, low-cost inference?

A. Large language model
B. Small language model
C. Multimodal model
D. Vision transformer only

Answer

B. Small language model

Explanation

SLMs are optimized for lower latency and reduced cost.


Question 3

Which model type is specifically optimized for programming tasks?

A. Vision model
B. Code model
C. Embedding model
D. Speech model

Answer

B. Code model

Explanation

Code models are trained for software development workflows.


Question 4

What is a defining feature of multimodal models?

A. They only process text
B. They process multiple input types
C. They eliminate inference costs
D. They require no prompting

Answer

B. They process multiple input types

Explanation

Multimodal models handle text, images, audio, and other media.


Question 5

Which deployment type is best for interactive AI chat applications?

A. Batch inference
B. Real-time inference
C. Archive deployment
D. Offline storage deployment

Answer

B. Real-time inference

Explanation

Interactive applications require low-latency real-time inference.


Question 6

What does the temperature parameter control?

A. Network throughput
B. Output randomness and creativity
C. Storage replication
D. GPU memory allocation

Answer

B. Output randomness and creativity

Explanation

Temperature affects how deterministic or creative outputs become.


Question 7

Which technique improves factual accuracy by using trusted data sources?

A. GPU scaling
B. Retrieval-Augmented Generation (RAG)
C. Semantic caching
D. Compression indexing

Answer

B. Retrieval-Augmented Generation (RAG)

Explanation

RAG grounds model outputs using retrieved enterprise data.


Question 8

What is a major benefit of streaming responses?

A. Reduced storage costs
B. Faster perceived response times
C. Elimination of monitoring
D. Improved vector indexing

Answer

B. Faster perceived response times

Explanation

Streaming improves user experience during response generation.


Question 9

Which authentication method supports passwordless access to Azure AI services?

A. Static credentials only
B. Managed identities
C. Anonymous access
D. Embedded API secrets in code

Answer

B. Managed identities

Explanation

Managed identities support secure, keyless authentication.


Question 10

Which model type is most appropriate for image understanding and OCR tasks?

A. Small language model
B. Multimodal model
C. Traditional relational database
D. Static rules engine

Answer

B. Multimodal model

Explanation

Multimodal models process images and text together.


Go to the AI-103 Exam Prep Hub main page

Leave a comment