This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
--> Build agents by using Foundry
--> Integrate monitoring into deployed agents, evaluate agent behavior, and perform error analysis
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Monitoring, evaluation, and error analysis are critical components of production-grade AI agent systems. In the AI-103 certification exam, Microsoft expects candidates to understand how to monitor deployed agents, assess their behavior, identify failures, improve safety and reliability, and continuously optimize agent performance.
Modern AI agents are dynamic systems that can reason, retrieve information, call tools, maintain memory, and execute multistep workflows. Because of this complexity, monitoring an AI agent goes far beyond checking whether an API endpoint is online. Developers must monitor prompts, tool usage, retrieval quality, token consumption, latency, failures, safety issues, hallucinations, and overall user satisfaction.
Azure AI Foundry provides tools and integrations that help developers monitor deployed agents, evaluate outputs, perform safety evaluations, collect telemetry, and conduct root-cause analysis when problems occur.
This article covers the key AI-103 exam concepts related to:
- Monitoring deployed AI agents
- Agent observability
- Telemetry collection
- Logging and tracing
- Evaluating agent behavior
- Measuring quality and safety
- Detecting hallucinations and grounding failures
- Tool-call monitoring
- Conversation analytics
- Error analysis techniques
- Root-cause investigation
- Failure handling and resiliency
- Responsible AI evaluation
- Continuous improvement workflows
Why Monitoring Matters in AI Agent Systems
Traditional software systems generally behave deterministically. Given the same input, the system usually produces the same output.
AI agents behave probabilistically. Outputs may vary even when prompts are similar. Agents can also:
- Use external tools
- Retrieve documents
- Perform reasoning steps
- Maintain conversational memory
- Execute actions autonomously
- Interact with multiple systems
Because of this complexity, production AI systems require strong observability and monitoring capabilities.
Monitoring helps organizations:
- Detect failures quickly
- Identify hallucinations
- Measure quality
- Improve safety
- Optimize costs
- Detect prompt injection attempts
- Analyze user satisfaction
- Improve retrieval relevance
- Tune prompts and workflows
- Validate grounding quality
- Ensure compliance and auditing
Without monitoring, developers cannot reliably improve or trust deployed AI systems.
Core Monitoring Concepts
Observability
Observability refers to the ability to understand what an AI system is doing internally based on telemetry and logs.
An observable AI system provides insight into:
- Prompts
- Responses
- Tool calls
- Retrieval results
- Execution paths
- Latency
- Failures
- Safety violations
- Token usage
- Model selection
- User interactions
Observability enables developers to diagnose problems efficiently.
Telemetry
Telemetry is operational data collected from the AI system.
Examples include:
- API response times
- Number of tokens consumed
- Tool invocation counts
- Search query performance
- Error rates
- Memory usage
- Agent workflow duration
- Failed requests
- User feedback scores
Telemetry data is often stored in:
- Azure Monitor
- Application Insights
- Log Analytics
- Event Hubs
- Data Lake storage
Trace Logging
Tracing records the sequence of operations executed during an agent interaction.
A trace may include:
- User prompt
- System prompt
- Retrieval request
- Retrieved documents
- Tool calls
- Model response
- Safety filter results
- Final output
Tracing is essential for debugging multistep agent workflows.
Monitoring Deployed Agents in Azure
Azure AI Foundry Monitoring
Azure AI Foundry provides monitoring capabilities for:
- Model deployments
- Agent workflows
- Prompt flows
- Evaluation pipelines
- Safety evaluations
- Token usage
- Latency metrics
- Failure tracking
Developers can analyze:
- Request success rates
- Response quality
- Grounding quality
- Safety incidents
- Performance bottlenecks
Azure Monitor
Azure Monitor collects metrics and logs across Azure resources.
Common AI monitoring scenarios include:
- Monitoring API latency
- Detecting spikes in failed requests
- Monitoring throughput
- Alerting on quota exhaustion
- Monitoring infrastructure health
Azure Monitor can trigger:
- Email alerts
- SMS notifications
- Logic Apps workflows
- Incident response tickets
Application Insights
Application Insights provides detailed application telemetry.
For AI agents, it can track:
- User sessions
- API calls
- Exceptions
- Dependency failures
- Custom events
- Prompt execution traces
- Response timing
Application Insights is commonly integrated into:
- Web applications
- Chatbots
- Agent orchestration systems
- API gateways
Log Analytics
Log Analytics enables querying and analyzing telemetry data.
Developers can:
- Search logs
- Build dashboards
- Analyze trends
- Correlate failures
- Investigate incidents
Kusto Query Language (KQL) is commonly used for analysis.
Example:
requests| where success == false| summarize count() by operation_Name
Important Metrics for AI Agents
Latency
Latency measures how long it takes for the agent to respond.
High latency may be caused by:
- Slow model inference
- Large prompts
- Slow tool APIs
- Complex orchestration
- Vector search delays
- Network bottlenecks
Low latency is especially important for:
- Customer support bots
- Interactive copilots
- Real-time assistants
Token Usage
Large token consumption increases cost and latency.
Developers monitor:
- Prompt tokens
- Completion tokens
- Total tokens per session
- Tokens per workflow step
Reducing token usage may involve:
- Shorter prompts
- Better chunking
- Summarized memory
- Smaller models
- Context pruning
Error Rates
Error monitoring helps identify instability.
Examples:
- Failed tool calls
- Timeout errors
- Retrieval failures
- API authentication errors
- Model overload conditions
- Rate-limit violations
High error rates indicate reliability issues.
Throughput
Throughput measures how many requests the system can handle.
Important for:
- High-scale enterprise systems
- Public-facing chatbots
- Large customer-service systems
User Satisfaction
User feedback is critical for evaluating agent quality.
Methods include:
- Thumbs up/down feedback
- Star ratings
- Survey scores
- Conversation abandonment rates
- Escalation frequency
User feedback helps identify:
- Hallucinations
- Poor reasoning
- Irrelevant responses
- Unsafe behavior
Evaluating Agent Behavior
Why Evaluation Is Important
AI agents may appear functional while still producing:
- Unsafe outputs
- Incorrect reasoning
- Fabricated facts
- Poor tool usage
- Low-quality retrieval
- Biased responses
Evaluation ensures the system performs reliably.
Types of Evaluations
Quality Evaluation
Measures:
- Accuracy
- Completeness
- Helpfulness
- Relevance
- Coherence
Example questions:
- Did the response answer the user question?
- Was the answer correct?
- Was the response understandable?
Grounding Evaluation
Grounding evaluations verify whether responses are supported by retrieved data.
This is especially important in RAG systems.
Developers evaluate:
- Citation accuracy
- Retrieval relevance
- Hallucination frequency
- Source alignment
Poor grounding may indicate:
- Bad chunking
- Weak embeddings
- Incorrect search ranking
- Missing documents
Safety Evaluation
Safety evaluations identify harmful or policy-violating outputs.
Examples:
- Hate speech
- Violence
- Self-harm content
- Prompt injection success
- Sensitive information leakage
- Toxic responses
Azure AI safety tooling can help detect these issues.
Tool Usage Evaluation
Agents may incorrectly:
- Select the wrong tool
- Pass invalid parameters
- Call tools too frequently
- Fail to call required tools
Tool evaluation measures:
- Tool selection accuracy
- Parameter correctness
- Tool success rates
- Tool latency
Conversation Evaluation
Conversation quality evaluation measures:
- Context retention
- Memory quality
- Conversation consistency
- Turn-by-turn coherence
- Goal completion success
Evaluators in Azure AI Foundry
Azure AI Foundry supports evaluators that help assess model and agent quality.
Evaluators may analyze:
- Relevance
- Groundedness
- Coherence
- Fluency
- Safety
- Similarity to reference answers
Evaluation pipelines may run:
- During development
- During testing
- After deployment
- Continuously in production
Detecting Hallucinations
What Is a Hallucination?
A hallucination occurs when the model generates false or fabricated information.
Examples:
- Invented facts
- Nonexistent citations
- False calculations
- Fabricated policies
- Incorrect summaries
Causes of Hallucinations
Common causes include:
- Weak grounding
- Missing context
- Poor prompts
- Overly broad tasks
- Outdated training data
- Low retrieval quality
Hallucination Detection Techniques
Methods include:
- Grounding evaluations
- Citation verification
- Reference-answer comparison
- Human review
- Fact-checking pipelines
- Confidence scoring
Monitoring Retrieval Quality
In RAG systems, retrieval quality strongly affects response quality.
Developers monitor:
- Search relevance
- Chunk quality
- Embedding effectiveness
- Citation accuracy
- Vector search latency
- Retrieval precision
- Retrieval recall
Poor retrieval causes:
- Irrelevant answers
- Missing context
- Hallucinations
- Reduced trustworthiness
Error Analysis in AI Systems
What Is Error Analysis?
Error analysis is the process of investigating failures and identifying root causes.
The goal is to improve:
- Reliability
- Accuracy
- Safety
- Performance
- User experience
Common AI Agent Failure Types
Retrieval Failures
Examples:
- Wrong documents retrieved
- Missing relevant documents
- Low-quality embeddings
- Poor chunking strategy
Solutions:
- Improve chunking
- Use hybrid search
- Tune embeddings
- Improve metadata filtering
Prompt Failures
Examples:
- Ambiguous prompts
- Missing instructions
- Weak system prompts
- Excessively large prompts
Solutions:
- Refine prompt templates
- Add examples
- Improve role instructions
- Use structured outputs
Tool Invocation Failures
Examples:
- Tool unavailable
- Invalid parameters
- Incorrect API schema
- Timeout issues
Solutions:
- Add retries
- Validate inputs
- Improve schemas
- Add fallback workflows
Reasoning Failures
Examples:
- Incorrect multistep logic
- Incomplete planning
- Contradictory outputs
- Failed task sequencing
Solutions:
- Break tasks into smaller steps
- Use orchestration frameworks
- Add verification stages
- Add human approval checkpoints
Memory Failures
Examples:
- Forgetting earlier conversation context
- Using outdated memory
- Injecting irrelevant memory
Solutions:
- Summarize memory
- Use memory expiration policies
- Improve retrieval logic
Root-Cause Analysis
Developers use logs and traces to identify:
- What failed
- Where it failed
- Why it failed
- Which dependency caused failure
Root-cause analysis often examines:
- Prompt versions
- Model versions
- Retrieved documents
- Tool responses
- System state
- User inputs
A/B Testing and Continuous Improvement
A/B Testing
A/B testing compares multiple versions of:
- Prompts
- Models
- Retrieval strategies
- Tool orchestration
- Agent workflows
Example:
- Version A uses GPT-4
- Version B uses a smaller model
Metrics are compared to determine the better approach.
Continuous Evaluation
Production AI systems should continuously evaluate:
- Safety
- Quality
- Relevance
- Cost
- Latency
- User satisfaction
Continuous evaluation helps detect:
- Drift
- Degradation
- Emerging risks
Responsible AI Monitoring
Responsible AI monitoring includes:
- Safety evaluations
- Bias detection
- Toxicity detection
- Compliance auditing
- Human oversight
- Approval workflows
Monitoring should ensure agents:
- Follow policies
- Avoid harmful outputs
- Respect privacy
- Operate within defined constraints
Human-in-the-Loop Monitoring
High-risk systems often include human review.
Examples:
- Financial recommendations
- Medical suggestions
- Legal analysis
- Security operations
Human reviewers may:
- Approve actions
- Review flagged outputs
- Escalate incidents
- Correct model errors
Alerting and Incident Response
Monitoring systems should generate alerts for:
- Increased hallucinations
- Safety violations
- Tool failures
- Excessive latency
- Rising error rates
- Unusual traffic spikes
Alerts support rapid incident response.
Dashboards and Visualization
Dashboards help teams monitor AI systems visually.
Typical dashboard metrics include:
- Request volume
- Token consumption
- Failure rates
- Latency
- Safety incidents
- Tool usage
- Retrieval quality
- User ratings
Azure dashboards commonly use:
- Azure Monitor
- Power BI
- Application Insights workbooks
Best Practices for Monitoring AI Agents
Enable Full Tracing
Capture:
- Inputs
- Outputs
- Tool calls
- Retrieval results
- Safety decisions
Log Prompt Versions
Always track:
- Prompt templates
- System messages
- Model versions
This simplifies debugging.
Evaluate Continuously
Do not evaluate only during development.
Production evaluation is essential.
Use Human Review for High-Risk Tasks
High-impact decisions should include human oversight.
Monitor Cost and Performance
Track:
- Token usage
- Latency
- Throughput
- Scaling costs
Test Failure Scenarios
Simulate:
- Tool outages
- Bad retrieval
- Prompt injection
- Rate limits
- Safety attacks
AI-103 Exam Tips
For the AI-103 exam, remember these important points:
- Monitoring AI agents requires more than infrastructure monitoring.
- Observability includes prompts, tool calls, retrieval, memory, and outputs.
- Application Insights and Azure Monitor are commonly used for telemetry.
- Grounding evaluations help detect hallucinations.
- Safety evaluations identify harmful outputs.
- Trace logging is essential for debugging multistep workflows.
- Tool-call monitoring helps identify orchestration failures.
- Retrieval quality directly affects RAG system quality.
- Error analysis focuses on root causes and corrective actions.
- Human oversight is important in high-risk systems.
Practice Exam Questions
Question 1
What is the primary purpose of observability in AI agent systems?
A. Reduce cloud storage usage
B. Understand internal agent behavior through telemetry and logs
C. Eliminate all hallucinations
D. Increase GPU memory
Correct Answer
B. Understand internal agent behavior through telemetry and logs
Explanation
Observability helps developers understand prompts, tool calls, retrieval steps, failures, and outputs within AI systems.
Question 2
Which Azure service is commonly used for collecting application telemetry and exceptions?
A. Azure DNS
B. Azure Kubernetes Service
C. Application Insights
D. Azure Files
Correct Answer
C. Application Insights
Explanation
Application Insights collects telemetry, traces, exceptions, performance metrics, and dependency information.
Question 3
What is a hallucination in generative AI?
A. A successful retrieval operation
B. A fabricated or incorrect model output
C. A network timeout
D. A token optimization method
Correct Answer
B. A fabricated or incorrect model output
Explanation
Hallucinations occur when a model generates false or unsupported information.
Question 4
Which evaluation type verifies whether model responses are supported by retrieved documents?
A. Infrastructure evaluation
B. Throughput evaluation
C. Grounding evaluation
D. Scaling evaluation
Correct Answer
C. Grounding evaluation
Explanation
Grounding evaluations assess whether responses align with retrieved sources.
Question 5
Which issue is most likely caused by poor retrieval quality in a RAG system?
A. GPU overheating
B. Irrelevant or incomplete answers
C. Faster response times
D. Lower token usage
Correct Answer
B. Irrelevant or incomplete answers
Explanation
Poor retrieval quality reduces the relevance and accuracy of generated answers.
Question 6
What is the purpose of trace logging in AI workflows?
A. Increase storage costs
B. Encrypt prompts
C. Record workflow execution details for debugging
D. Replace vector search
Correct Answer
C. Record workflow execution details for debugging
Explanation
Trace logging captures execution steps, tool calls, retrieval results, and model outputs.
Question 7
Which metric directly measures how quickly an AI agent responds?
A. Recall
B. Latency
C. Groundedness
D. Fluency
Correct Answer
B. Latency
Explanation
Latency measures response time.
Question 8
What is a common strategy for improving reliability in high-risk AI systems?
A. Removing all monitoring
B. Disabling safety filters
C. Adding human-in-the-loop approvals
D. Eliminating trace logs
Correct Answer
C. Adding human-in-the-loop approvals
Explanation
Human review improves oversight and reduces risks in sensitive workflows.
Question 9
Which type of failure occurs when an agent selects the wrong API or tool?
A. Memory failure
B. Retrieval failure
C. Tool invocation failure
D. Scaling failure
Correct Answer
C. Tool invocation failure
Explanation
Incorrect tool selection or invalid tool parameters are tool invocation failures.
Question 10
Why is continuous evaluation important in production AI systems?
A. To permanently lock model behavior
B. To detect degradation, drift, and emerging risks
C. To reduce all network traffic
D. To eliminate telemetry collection
Correct Answer
B. To detect degradation, drift, and emerging risks
Explanation
Continuous evaluation helps organizations identify quality degradation, safety issues, and changing system behavior over time.
Final Thoughts
Monitoring and evaluating AI agents is one of the most important responsibilities for AI developers working with Azure AI Foundry. Production AI systems require continuous observability, telemetry analysis, safety evaluation, grounding validation, and error analysis.
For the AI-103 exam, candidates should understand:
- How to monitor AI agents
- Which Azure services support observability
- How to evaluate AI quality and safety
- How to detect hallucinations
- How to analyze failures
- How to improve agent reliability and performance
Strong monitoring and evaluation practices are essential for building trustworthy, scalable, and production-ready AI systems.
Go to the AI-103 Exam Prep Hub main page
