This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Plan and manage an Azure AI solution (25–30%)
--> Manage, monitor, and secure AI systems
--> Manage quotas, scaling, rate limits, and cost footprints for model and agent workloads
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern AI applications and agent-based systems can consume significant compute resources and operational costs.
Generative AI workloads often involve:
- Large Language Models (LLMs)
- Embedding generation
- Vector search
- Retrieval-Augmented Generation (RAG)
- AI agents
- Tool execution
- Workflow orchestration
- Multimodal processing
As AI applications scale, organizations must carefully manage:
- Quotas
- Throughput limits
- Rate limits
- Token usage
- Infrastructure scaling
- Operational costs
- Resource utilization
The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of how to manage and optimize AI workloads in Azure.
For the AI-103 exam, you should understand:
- Quota management
- Rate limiting
- Scaling strategies
- Throughput optimization
- Cost optimization
- Monitoring AI workloads
- Autoscaling
- Capacity planning
- Token management
- Model selection tradeoffs
- Agent workload optimization
Understanding AI Workload Consumption
AI workloads consume resources differently than traditional applications.
Key consumption factors include:
- Prompt size
- Response size
- Number of requests
- Model size
- Embedding generation
- Retrieval operations
- Concurrent users
- Tool execution
Tokens and Token Consumption
Generative AI models process text using tokens.
Tokens represent:
- Words
- Word fragments
- Characters
- Symbols
Token usage directly affects:
- Cost
- Latency
- Throughput
- Performance
Input Tokens
Input tokens include:
- User prompts
- System prompts
- Retrieved documents
- Conversation history
Output Tokens
Output tokens represent generated responses.
Longer responses increase:
- Costs
- Latency
- Resource consumption
Context Windows
A context window is the amount of information a model can process in a request.
Larger context windows:
- Support more information
- Increase token consumption
- Increase costs
- Potentially increase latency
What Are Quotas?
Quotas define resource usage limits for Azure AI services.
Quotas help:
- Prevent overconsumption
- Ensure fair resource usage
- Protect service reliability
Common Azure AI Quotas
Common quotas include:
- Requests per minute (RPM)
- Tokens per minute (TPM)
- Concurrent requests
- Deployment limits
- Resource limits
Requests Per Minute (RPM)
RPM limits how many API requests can be processed each minute.
High request volumes may require:
- Additional deployments
- Provisioned throughput
- Load balancing
Tokens Per Minute (TPM)
TPM limits the number of tokens processed per minute.
High-token workloads often require:
- Throughput optimization
- Smaller prompts
- Efficient retrieval
- Better chunking strategies
Provisioned Throughput
Provisioned throughput reserves dedicated model capacity.
Benefits include:
- Predictable performance
- Consistent latency
- Higher throughput
Tradeoffs include:
- Higher cost
- Capacity planning requirements
Standard Deployments vs Provisioned Throughput
Standard Deployments
Advantages:
- Lower cost
- Flexible scaling
- Simpler management
Disadvantages:
- Shared capacity
- Less predictable latency
Provisioned Throughput Deployments
Advantages:
- Dedicated capacity
- Predictable performance
- Enterprise reliability
Disadvantages:
- Higher cost
- Requires workload planning
Rate Limiting
Rate limiting controls how frequently clients can access services.
Benefits include:
- Preventing abuse
- Improving stability
- Protecting infrastructure
Why Rate Limits Matter
Without rate limits:
- Services may become overloaded
- Costs may increase rapidly
- Applications may experience outages
Handling Rate Limit Errors
Applications should gracefully handle rate limit responses.
Common strategies include:
- Retry policies
- Exponential backoff
- Queueing
- Load balancing
Exponential Backoff
Exponential backoff increases wait times between retries.
Benefits:
- Reduces service overload
- Improves reliability
- Helps recover from temporary spikes
Queue-Based Architectures
Queues help manage burst traffic.
Common Azure services include:
- Azure Service Bus
- Azure Queue Storage
Benefits:
- Improved reliability
- Controlled workload processing
- Better scalability
Scaling AI Workloads
AI systems must scale efficiently.
Horizontal Scaling
Horizontal scaling adds more instances.
Examples:
- Additional containers
- More API instances
- More worker nodes
Benefits:
- Better concurrency
- Higher throughput
- Improved resilience
Vertical Scaling
Vertical scaling increases resource capacity.
Examples:
- More CPU
- More memory
- Larger compute sizes
Autoscaling
Autoscaling dynamically adjusts resources based on workload demand.
Common Azure services supporting autoscaling:
- AKS
- Azure Functions
- Azure App Service
- Azure Container Apps
Scaling AI Agents
AI agents often require additional scaling considerations.
Agent workloads may involve:
- Tool execution
- Retrieval pipelines
- Multi-step reasoning
- Long-running workflows
Multi-Agent Systems
Multi-agent systems may generate:
- High API volumes
- Increased orchestration complexity
- Heavy retrieval traffic
Scaling strategies may include:
- Distributed architectures
- Queue systems
- Parallel processing
Cost Footprints for AI Systems
AI systems can become expensive very quickly.
Common AI Cost Drivers
Major cost drivers include:
- Token usage
- Large models
- Embedding generation
- Vector search
- Provisioned throughput
- Storage
- Networking
- Agent orchestration
Large Models vs Small Models
Large Models
Advantages:
- Better reasoning
- Higher-quality responses
- Stronger generalization
Disadvantages:
- Higher costs
- Increased latency
- Greater resource consumption
Small Models
Advantages:
- Lower cost
- Faster responses
- Reduced latency
Disadvantages:
- Reduced reasoning capability
- Less sophisticated outputs
Choosing the Right Model
Choose smaller models when:
- Tasks are simple
- Low latency matters
- Budget constraints exist
Choose larger models when:
- Advanced reasoning is required
- Complex workflows exist
- Higher quality is critical
Optimizing Prompt Design
Prompt design directly affects cost.
Long prompts:
- Increase token usage
- Increase latency
- Increase costs
Prompt Optimization Strategies
Strategies include:
- Shorter prompts
- Better instructions
- Efficient context usage
- Retrieval filtering
- Context summarization
Retrieval Optimization
RAG systems can significantly increase token usage.
Retrieved documents consume context window space.
Chunking Optimization
Chunking strategies affect:
- Retrieval accuracy
- Token consumption
- Latency
Poor chunking may:
- Increase irrelevant retrieval
- Increase costs
- Reduce quality
Hybrid Search Optimization
Hybrid search combines:
- Vector search
- Keyword search
Benefits include:
- Better retrieval accuracy
- Reduced hallucinations
- More relevant grounding
Monitoring AI Workloads
Monitoring is essential for operational management.
Azure Monitor
Azure Monitor provides:
- Metrics
- Alerts
- Logs
- Diagnostics
Application Insights
Application Insights supports:
- Telemetry
- Request tracing
- Dependency monitoring
- Performance analysis
Important Metrics to Monitor
Common AI metrics include:
- Token usage
- Latency
- Error rates
- Throughput
- Cost trends
- Retrieval quality
- Tool execution failures
Cost Monitoring
Organizations should track:
- Daily usage
- Monthly spend
- Per-user costs
- Per-agent costs
- API consumption
Azure Cost Management
Azure Cost Management helps:
- Analyze spending
- Forecast costs
- Create budgets
- Detect anomalies
Budget Alerts
Budget alerts notify teams when spending thresholds are exceeded.
Benefits include:
- Better cost control
- Early detection of anomalies
- Prevention of runaway spending
Security and Cost Protection
Security issues can increase costs.
Examples include:
- API abuse
- Prompt injection attacks
- Excessive automated requests
API Management
Azure API Management helps:
- Apply throttling
- Control rate limits
- Secure APIs
- Monitor usage
Caching Strategies
Caching reduces repeated AI calls.
Benefits include:
- Reduced token usage
- Lower latency
- Lower costs
Common Caching Scenarios
Cache:
- Frequent responses
- Static retrieval results
- Reusable embeddings
- Common prompts
High Availability Considerations
Scaling should also support:
- Reliability
- Fault tolerance
- Disaster recovery
Load Balancing
Load balancing distributes requests across instances.
Benefits:
- Improved scalability
- Better resilience
- Higher throughput
Common AI-103 Operational Scenarios
Scenario 1: Enterprise AI Copilot
Requirements:
- High concurrency
- Predictable latency
- Cost monitoring
Recommended Strategy:
- Provisioned throughput
- Autoscaling
- Budget alerts
Scenario 2: Internal Knowledge Assistant
Requirements:
- Retrieval optimization
- Controlled costs
- Moderate scale
Recommended Strategy:
- Efficient chunking
- Hybrid search
- Smaller embedding models
Scenario 3: Multi-Agent Workflow Platform
Requirements:
- Heavy orchestration
- Parallel execution
- High throughput
Recommended Strategy:
- Queue-based architecture
- AKS autoscaling
- API throttling
Scenario 4: Public AI Chatbot
Requirements:
- Abuse protection
- Traffic spikes
- Cost protection
Recommended Strategy:
- API Management
- Rate limiting
- Caching
- Autoscaling
Common AI-103 Exam Tips
Understand Quota Concepts
Know:
- RPM limits
- TPM limits
- Provisioned throughput
- Concurrent request limits
Understand Scaling Strategies
Know the differences between:
- Horizontal scaling
- Vertical scaling
- Autoscaling
Learn Cost Optimization Techniques
Understand:
- Prompt optimization
- Model selection
- Retrieval optimization
- Caching
- Budget monitoring
Know Monitoring and Operational Management
Understand:
- Azure Monitor
- Application Insights
- Azure Cost Management
- API Management
Summary
Managing quotas, scaling, rate limits, and cost footprints is essential for production AI systems.
For the AI-103 exam, you should understand:
- Token consumption
- Quota management
- Throughput planning
- Rate limiting
- Scaling strategies
- Cost optimization
- Retrieval optimization
- Monitoring AI workloads
- Budget management
- Operational resilience
Strong operational management practices help ensure AI systems remain:
- Reliable
- Scalable
- Cost-effective
- Secure
- High performing
These concepts are critical for enterprise AI applications and agent-based solutions on Azure.
Practice Exam Questions
Question 1
What does TPM stand for in Azure AI workloads?
A. Tokens Per Minute
B. Tasks Per Model
C. Throughput Per Memory
D. Transactions Per Model
Answer
A. Tokens Per Minute
Explanation
TPM measures how many tokens can be processed each minute.
Question 2
Which deployment option provides dedicated processing capacity?
A. Shared deployment
B. Provisioned throughput deployment
C. Standard deployment
D. Public deployment
Answer
B. Provisioned throughput deployment
Explanation
Provisioned throughput reserves dedicated model capacity.
Question 3
What is the primary purpose of rate limiting?
A. Increase latency
B. Prevent abuse and protect services
C. Reduce storage replication
D. Encrypt prompts
Answer
B. Prevent abuse and protect services
Explanation
Rate limiting helps maintain service stability and prevent overload.
Question 4
Which retry strategy gradually increases wait times between retries?
A. Static retry
B. Exponential backoff
C. Parallel retry
D. Immediate retry
Answer
B. Exponential backoff
Explanation
Exponential backoff reduces overload during retry attempts.
Question 5
Which scaling strategy adds more instances to support increased workloads?
A. Vertical scaling
B. Horizontal scaling
C. Static scaling
D. Semantic scaling
Answer
B. Horizontal scaling
Explanation
Horizontal scaling increases capacity by adding instances.
Question 6
Which Azure service helps analyze and forecast cloud spending?
A. Azure Cost Management
B. Azure CDN
C. Azure Backup
D. Azure DNS
Answer
A. Azure Cost Management
Explanation
Azure Cost Management provides spending analysis and budgeting.
Question 7
What is one benefit of caching AI responses?
A. Increased token usage
B. Reduced costs and latency
C. Higher embedding size
D. Reduced monitoring
Answer
B. Reduced costs and latency
Explanation
Caching avoids repeated AI calls and improves performance.
Question 8
Which Azure service supports API throttling and traffic control?
A. Azure API Management
B. Azure Files
C. Azure DNS
D. Azure Backup
Answer
A. Azure API Management
Explanation
Azure API Management supports throttling, monitoring, and API governance.
Question 9
Which factor directly increases token consumption in generative AI systems?
A. Smaller prompts
B. Longer prompts and responses
C. Lower concurrency
D. Reduced context windows
Answer
B. Longer prompts and responses
Explanation
Larger prompts and outputs consume more tokens.
Question 10
Which Azure monitoring service provides telemetry and diagnostics for AI applications?
A. Application Insights
B. Azure Firewall
C. Azure CDN
D. Azure Files
Answer
A. Application Insights
Explanation
Application Insights provides telemetry, diagnostics, and performance monitoring.
Go to the AI-103 Exam Prep Hub main page
