This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
   --> Build agents by using Foundry
      --> Integrate monitoring into deployed agents, evaluate agent behavior, and perform error analysis

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Monitoring, evaluation, and error analysis are critical components of production-grade AI agent systems. In the AI-103 certification exam, Microsoft expects candidates to understand how to monitor deployed agents, assess their behavior, identify failures, improve safety and reliability, and continuously optimize agent performance.

Modern AI agents are dynamic systems that can reason, retrieve information, call tools, maintain memory, and execute multistep workflows. Because of this complexity, monitoring an AI agent goes far beyond checking whether an API endpoint is online. Developers must monitor prompts, tool usage, retrieval quality, token consumption, latency, failures, safety issues, hallucinations, and overall user satisfaction.

Azure AI Foundry provides tools and integrations that help developers monitor deployed agents, evaluate outputs, perform safety evaluations, collect telemetry, and conduct root-cause analysis when problems occur.

This article covers the key AI-103 exam concepts related to:

Monitoring deployed AI agents
Agent observability
Telemetry collection
Logging and tracing
Evaluating agent behavior
Measuring quality and safety
Detecting hallucinations and grounding failures
Tool-call monitoring
Conversation analytics
Error analysis techniques
Root-cause investigation
Failure handling and resiliency
Responsible AI evaluation
Continuous improvement workflows

Why Monitoring Matters in AI Agent Systems

Traditional software systems generally behave deterministically. Given the same input, the system usually produces the same output.

AI agents behave probabilistically. Outputs may vary even when prompts are similar. Agents can also:

Use external tools
Retrieve documents
Perform reasoning steps
Maintain conversational memory
Execute actions autonomously
Interact with multiple systems

Because of this complexity, production AI systems require strong observability and monitoring capabilities.

Monitoring helps organizations:

Detect failures quickly
Identify hallucinations
Measure quality
Improve safety
Optimize costs
Detect prompt injection attempts
Analyze user satisfaction
Improve retrieval relevance
Tune prompts and workflows
Validate grounding quality
Ensure compliance and auditing

Without monitoring, developers cannot reliably improve or trust deployed AI systems.

Core Monitoring Concepts

Observability

Observability refers to the ability to understand what an AI system is doing internally based on telemetry and logs.

An observable AI system provides insight into:

Prompts
Responses
Tool calls
Retrieval results
Execution paths
Latency
Failures
Safety violations
Token usage
Model selection
User interactions

Observability enables developers to diagnose problems efficiently.

Telemetry

Telemetry is operational data collected from the AI system.

Examples include:

API response times
Number of tokens consumed
Tool invocation counts
Search query performance
Error rates
Memory usage
Agent workflow duration
Failed requests
User feedback scores

Telemetry data is often stored in:

Azure Monitor
Application Insights
Log Analytics
Event Hubs
Data Lake storage

Trace Logging

Tracing records the sequence of operations executed during an agent interaction.

A trace may include:

User prompt
System prompt
Retrieval request
Retrieved documents
Tool calls
Model response
Safety filter results
Final output

Tracing is essential for debugging multistep agent workflows.

Monitoring Deployed Agents in Azure

Azure AI Foundry Monitoring

Azure AI Foundry provides monitoring capabilities for:

Model deployments
Agent workflows
Prompt flows
Evaluation pipelines
Safety evaluations
Token usage
Latency metrics
Failure tracking

Developers can analyze:

Request success rates
Response quality
Grounding quality
Safety incidents
Performance bottlenecks

Azure Monitor

Azure Monitor collects metrics and logs across Azure resources.

Common AI monitoring scenarios include:

Monitoring API latency
Detecting spikes in failed requests
Monitoring throughput
Alerting on quota exhaustion
Monitoring infrastructure health

Azure Monitor can trigger:

Email alerts
SMS notifications
Logic Apps workflows
Incident response tickets

Application Insights

Application Insights provides detailed application telemetry.

For AI agents, it can track:

User sessions
API calls
Exceptions
Dependency failures
Custom events
Prompt execution traces
Response timing

Application Insights is commonly integrated into:

Web applications
Chatbots
Agent orchestration systems
API gateways

Log Analytics

Log Analytics enables querying and analyzing telemetry data.

Developers can:

Search logs
Build dashboards
Analyze trends
Correlate failures
Investigate incidents

Kusto Query Language (KQL) is commonly used for analysis.

Example:

			
requests
| where success == false
| summarize count() by operation_Name

Important Metrics for AI Agents

Latency

Latency measures how long it takes for the agent to respond.

High latency may be caused by:

Slow model inference
Large prompts
Slow tool APIs
Complex orchestration
Vector search delays
Network bottlenecks

Low latency is especially important for:

Customer support bots
Interactive copilots
Real-time assistants

Token Usage

Large token consumption increases cost and latency.

Developers monitor:

Prompt tokens
Completion tokens
Total tokens per session
Tokens per workflow step

Reducing token usage may involve:

Shorter prompts
Better chunking
Summarized memory
Smaller models
Context pruning

Error Rates

Error monitoring helps identify instability.

Examples:

Failed tool calls
Timeout errors
Retrieval failures
API authentication errors
Model overload conditions
Rate-limit violations

High error rates indicate reliability issues.

Throughput

Throughput measures how many requests the system can handle.

Important for:

High-scale enterprise systems
Public-facing chatbots
Large customer-service systems

User Satisfaction

User feedback is critical for evaluating agent quality.

Methods include:

Thumbs up/down feedback
Star ratings
Survey scores
Conversation abandonment rates
Escalation frequency

User feedback helps identify:

Hallucinations
Poor reasoning
Irrelevant responses
Unsafe behavior

Evaluating Agent Behavior

Why Evaluation Is Important

AI agents may appear functional while still producing:

Unsafe outputs
Incorrect reasoning
Fabricated facts
Poor tool usage
Low-quality retrieval
Biased responses

Evaluation ensures the system performs reliably.

Types of Evaluations

Quality Evaluation

Measures:

Accuracy
Completeness
Helpfulness
Relevance
Coherence

Example questions:

Did the response answer the user question?
Was the answer correct?
Was the response understandable?

Grounding Evaluation

Grounding evaluations verify whether responses are supported by retrieved data.

This is especially important in RAG systems.

Developers evaluate:

Citation accuracy
Retrieval relevance
Hallucination frequency
Source alignment

Poor grounding may indicate:

Bad chunking
Weak embeddings
Incorrect search ranking
Missing documents

Safety Evaluation

Safety evaluations identify harmful or policy-violating outputs.

Examples:

Hate speech
Violence
Self-harm content
Prompt injection success
Sensitive information leakage
Toxic responses

Azure AI safety tooling can help detect these issues.

Tool Usage Evaluation

Agents may incorrectly:

Select the wrong tool
Pass invalid parameters
Call tools too frequently
Fail to call required tools

Tool evaluation measures:

Tool selection accuracy
Parameter correctness
Tool success rates
Tool latency

Conversation Evaluation

Conversation quality evaluation measures:

Context retention
Memory quality
Conversation consistency
Turn-by-turn coherence
Goal completion success

Evaluators in Azure AI Foundry

Azure AI Foundry supports evaluators that help assess model and agent quality.

Evaluators may analyze:

Relevance
Groundedness
Coherence
Fluency
Safety
Similarity to reference answers

Evaluation pipelines may run:

During development
During testing
After deployment
Continuously in production

Detecting Hallucinations

What Is a Hallucination?

A hallucination occurs when the model generates false or fabricated information.

Examples:

Invented facts
Nonexistent citations
False calculations
Fabricated policies
Incorrect summaries

Causes of Hallucinations

Common causes include:

Weak grounding
Missing context
Poor prompts
Overly broad tasks
Outdated training data
Low retrieval quality

Hallucination Detection Techniques

Methods include:

Grounding evaluations
Citation verification
Reference-answer comparison
Human review
Fact-checking pipelines
Confidence scoring

Monitoring Retrieval Quality

In RAG systems, retrieval quality strongly affects response quality.

Developers monitor:

Search relevance
Chunk quality
Embedding effectiveness
Citation accuracy
Vector search latency
Retrieval precision
Retrieval recall

Poor retrieval causes:

Irrelevant answers
Missing context
Hallucinations
Reduced trustworthiness

Error Analysis in AI Systems

What Is Error Analysis?

Error analysis is the process of investigating failures and identifying root causes.

The goal is to improve:

Reliability
Accuracy
Safety
Performance
User experience

Common AI Agent Failure Types

Retrieval Failures

Examples:

Wrong documents retrieved
Missing relevant documents
Low-quality embeddings
Poor chunking strategy

Solutions:

Improve chunking
Use hybrid search
Tune embeddings
Improve metadata filtering

Prompt Failures

Examples:

Ambiguous prompts
Missing instructions
Weak system prompts
Excessively large prompts

Solutions:

Refine prompt templates
Add examples
Improve role instructions
Use structured outputs

Tool Invocation Failures

Examples:

Tool unavailable
Invalid parameters
Incorrect API schema
Timeout issues

Solutions:

Add retries
Validate inputs
Improve schemas
Add fallback workflows

Reasoning Failures

Examples:

Incorrect multistep logic
Incomplete planning
Contradictory outputs
Failed task sequencing

Solutions:

Break tasks into smaller steps
Use orchestration frameworks
Add verification stages
Add human approval checkpoints

Memory Failures

Examples:

Forgetting earlier conversation context
Using outdated memory
Injecting irrelevant memory

Solutions:

Summarize memory
Use memory expiration policies
Improve retrieval logic

Root-Cause Analysis

Developers use logs and traces to identify:

What failed
Where it failed
Why it failed
Which dependency caused failure

Root-cause analysis often examines:

Prompt versions
Model versions
Retrieved documents
Tool responses
System state
User inputs

A/B Testing and Continuous Improvement

A/B Testing

A/B testing compares multiple versions of:

Prompts
Models
Retrieval strategies
Tool orchestration
Agent workflows

Example:

Version A uses GPT-4
Version B uses a smaller model

Metrics are compared to determine the better approach.

Continuous Evaluation

Production AI systems should continuously evaluate:

Safety
Quality
Relevance
Cost
Latency
User satisfaction

Continuous evaluation helps detect:

Drift
Degradation
Emerging risks

Responsible AI Monitoring

Responsible AI monitoring includes:

Safety evaluations
Bias detection
Toxicity detection
Compliance auditing
Human oversight
Approval workflows

Monitoring should ensure agents:

Follow policies
Avoid harmful outputs
Respect privacy
Operate within defined constraints

Human-in-the-Loop Monitoring

High-risk systems often include human review.

Examples:

Financial recommendations
Medical suggestions
Legal analysis
Security operations

Human reviewers may:

Approve actions
Review flagged outputs
Escalate incidents
Correct model errors

Alerting and Incident Response

Monitoring systems should generate alerts for:

Increased hallucinations
Safety violations
Tool failures
Excessive latency
Rising error rates
Unusual traffic spikes

Alerts support rapid incident response.

Dashboards and Visualization

Dashboards help teams monitor AI systems visually.

Typical dashboard metrics include:

Request volume
Token consumption
Failure rates
Latency
Safety incidents
Tool usage
Retrieval quality
User ratings

Azure dashboards commonly use:

Azure Monitor
Power BI
Application Insights workbooks

Best Practices for Monitoring AI Agents

Enable Full Tracing

Capture:

Inputs
Outputs
Tool calls
Retrieval results
Safety decisions

Log Prompt Versions

Always track:

Prompt templates
System messages
Model versions

This simplifies debugging.

Evaluate Continuously

Do not evaluate only during development.

Production evaluation is essential.

Use Human Review for High-Risk Tasks

High-impact decisions should include human oversight.

Monitor Cost and Performance

Track:

Token usage
Latency
Throughput
Scaling costs

Test Failure Scenarios

Simulate:

Tool outages
Bad retrieval
Prompt injection
Rate limits
Safety attacks

AI-103 Exam Tips

For the AI-103 exam, remember these important points:

Monitoring AI agents requires more than infrastructure monitoring.
Observability includes prompts, tool calls, retrieval, memory, and outputs.
Application Insights and Azure Monitor are commonly used for telemetry.
Grounding evaluations help detect hallucinations.
Safety evaluations identify harmful outputs.
Trace logging is essential for debugging multistep workflows.
Tool-call monitoring helps identify orchestration failures.
Retrieval quality directly affects RAG system quality.
Error analysis focuses on root causes and corrective actions.
Human oversight is important in high-risk systems.

Continuous evaluation helps organizations identify quality degradation, safety issues, and changing system behavior over time.

Final Thoughts

Monitoring and evaluating AI agents is one of the most important responsibilities for AI developers working with Azure AI Foundry. Production AI systems require continuous observability, telemetry analysis, safety evaluation, grounding validation, and error analysis.

For the AI-103 exam, candidates should understand: