This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
--> Build generative applications by using Foundry
--> Evaluate models and apps, including detecting fabrications, relevance, quality, and safety
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Building generative AI applications is only part of the development process.
Organizations must also evaluate whether AI systems are:
- Accurate
- Reliable
- Relevant
- Safe
- Grounded
- Trustworthy
AI systems can generate:
- Hallucinations
- Unsafe content
- Biased responses
- Irrelevant answers
- Inconsistent outputs
The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of evaluating models and applications.
For the AI-103 exam, you should understand:
- Model evaluation
- Application evaluation
- Fabrication detection
- Groundedness
- Relevance evaluation
- Quality evaluation
- Safety evaluation
- Responsible AI testing
- Automated evaluators
- Human evaluation
- Benchmarking
- Monitoring and continuous evaluation
Why AI Evaluation Matters
Evaluation is essential because generative AI systems are probabilistic.
This means:
- Responses may vary
- Outputs may be incorrect
- Safety risks may occur
- Hallucinations may appear
Without evaluation, organizations cannot reliably trust AI systems.
What Is AI Evaluation?
AI evaluation is the process of measuring:
- Accuracy
- Safety
- Reliability
- Relevance
- Groundedness
- User satisfaction
Types of AI Evaluation
Common evaluation categories include:
- Model evaluation
- Prompt evaluation
- Retrieval evaluation
- Application evaluation
- Safety evaluation
- Human evaluation
Model Evaluation
Model evaluation focuses on:
- Model quality
- Accuracy
- Performance
- Reasoning ability
Application Evaluation
Application evaluation measures:
- End-to-end user experience
- Workflow success
- Tool orchestration quality
- Groundedness
What Are Fabrications?
Fabrications are generated outputs that:
- Are incorrect
- Are unsupported
- Contain invented facts
- Misrepresent information
Fabrications are commonly called hallucinations.
Causes of Fabrications
Fabrications may occur because:
- The model lacks relevant knowledge
- Prompts are ambiguous
- Retrieval quality is poor
- Context is insufficient
- Safety constraints are weak
Fabrication Detection
Organizations should evaluate whether outputs:
- Match trusted sources
- Remain grounded
- Avoid unsupported claims
Groundedness Evaluation
Groundedness measures whether responses are supported by:
- Retrieved documents
- Enterprise data
- Trusted sources
Importance of Groundedness
Grounded responses:
- Improve trust
- Reduce hallucinations
- Increase explainability
Retrieval Quality Evaluation
RAG systems should evaluate:
- Search relevance
- Retrieved chunk quality
- Citation accuracy
- Context completeness
Relevance Evaluation
Relevance measures whether responses:
- Answer the user’s question
- Stay on-topic
- Match user intent
Quality Evaluation
Quality evaluations may assess:
- Clarity
- Completeness
- Coherence
- Fluency
- Professionalism
Consistency Evaluation
Consistency measures whether models:
- Produce stable responses
- Avoid contradictory outputs
- Maintain predictable behavior
Safety Evaluation
Safety evaluations identify:
- Harmful outputs
- Toxic content
- Unsafe instructions
- Policy violations
Responsible AI Evaluation
Responsible AI testing focuses on:
- Fairness
- Safety
- Transparency
- Accountability
- Privacy
Bias Evaluation
Organizations should evaluate whether models:
- Produce biased outputs
- Treat groups unfairly
- Reinforce stereotypes
Toxicity Detection
Toxicity evaluations identify:
- Offensive language
- Hate speech
- Harassment
- Abusive content
Jailbreak Testing
Jailbreak testing evaluates whether users can bypass:
- Safety controls
- Content filters
- Guardrails
Adversarial Testing
Adversarial testing intentionally challenges models using:
- Malicious prompts
- Edge cases
- Prompt injection attacks
Prompt Injection Testing
Prompt injection testing evaluates whether:
- External content manipulates model behavior
- Instructions override safety policies
Automated Evaluators
Automated evaluators use:
- Rules
- Scoring systems
- AI-based evaluators
To assess model outputs.
AI-Assisted Evaluation
Some systems use LLMs to evaluate:
- Relevance
- Groundedness
- Quality
- Safety
Human Evaluation
Human reviewers may evaluate:
- Accuracy
- Tone
- Helpfulness
- Safety
- Business alignment
Human-in-the-Loop Evaluation
Human-in-the-loop evaluation combines:
- Automated evaluation
- Human oversight
- Expert validation
Benchmarking Models
Benchmarking compares models using:
- Standard datasets
- Consistent prompts
- Defined metrics
A/B Testing
A/B testing compares:
- Different prompts
- Different models
- Different workflows
Evaluation Metrics
Common metrics include:
- Precision
- Recall
- Accuracy
- Relevance
- Groundedness
- Toxicity scores
- Latency
- User satisfaction
Precision and Recall
Precision
Measures how many retrieved results are relevant.
Recall
Measures how many relevant results were successfully retrieved.
Latency Evaluation
Organizations should measure:
- Response times
- Retrieval delays
- Tool execution times
Cost Evaluation
Cost evaluation considers:
- Token usage
- API calls
- Infrastructure consumption
User Satisfaction Evaluation
Organizations may measure:
- User feedback
- Completion success
- Satisfaction ratings
Continuous Evaluation
AI systems should be evaluated continuously because:
- User behavior changes
- Data evolves
- Model drift may occur
Model Drift
Model drift occurs when:
- Performance changes over time
- Inputs evolve
- User expectations shift
Monitoring Production Systems
Organizations should monitor:
- Safety violations
- Hallucination rates
- Retrieval failures
- Latency spikes
- Cost increases
Evaluation Pipelines
Evaluation pipelines automate:
- Testing
- Scoring
- Reporting
- Regression analysis
Regression Testing
Regression testing ensures updates do not:
- Reduce quality
- Break workflows
- Increase hallucinations
Azure AI Foundry Evaluation Capabilities
Azure AI Foundry supports:
- Evaluation workflows
- Automated evaluators
- Safety monitoring
- Groundedness evaluation
- Prompt testing
- Trace analysis
Trace Analysis
Trace analysis helps inspect:
- Tool calls
- Retrieval steps
- Agent decisions
- Workflow execution
Evaluation Datasets
Organizations should create datasets containing:
- Expected outputs
- Edge cases
- Adversarial prompts
- Real-world scenarios
Synthetic Test Data
Synthetic data may help test:
- Rare scenarios
- Adversarial prompts
- Safety boundaries
Real-World Evaluation Scenarios
Scenario 1: Enterprise Chatbot
Requirements:
- Accurate responses
- Citation support
- Low hallucination rate
Recommended Evaluation:
- Groundedness testing
- Retrieval quality evaluation
Scenario 2: Financial Assistant
Requirements:
- High accuracy
- Safety compliance
- Low fabrication risk
Recommended Evaluation:
- Human review
- Adversarial testing
- Approval workflows
Scenario 3: Customer Support Copilot
Requirements:
- Relevant responses
- Fast response times
- Consistent tone
Recommended Evaluation:
- Latency evaluation
- Quality scoring
- A/B testing
Scenario 4: Agentic Workflow System
Requirements:
- Tool accuracy
- Safe tool execution
- Workflow traceability
Recommended Evaluation:
- Trace analysis
- Tool execution monitoring
- HITL evaluation
Common AI-103 Exam Tips
Understand Evaluation Categories
Know the differences between:
- Relevance
- Quality
- Groundedness
- Safety
- Consistency
Learn Fabrication Detection Concepts
Understand:
- Hallucinations
- Unsupported claims
- Grounding validation
Understand Safety Testing
Know:
- Toxicity testing
- Jailbreak testing
- Prompt injection evaluation
- Adversarial testing
Learn Monitoring Concepts
Understand:
- Continuous evaluation
- Drift detection
- Trace analysis
- Regression testing
Summary
Evaluating generative AI systems is critical for building:
- Reliable
- Safe
- Grounded
- Trustworthy applications
For the AI-103 exam, you should understand:
- Fabrication detection
- Groundedness evaluation
- Retrieval quality
- Relevance testing
- Quality evaluation
- Safety evaluation
- Toxicity detection
- Adversarial testing
- Human evaluation
- Automated evaluators
- Monitoring and drift detection
- Evaluation pipelines
These concepts are foundational for developing enterprise-grade AI applications and agentic systems on Azure.
Practice Exam Questions
Question 1
What is a fabrication in generative AI?
A. A storage replication process
B. An unsupported or invented response
C. A vector indexing method
D. A deployment strategy
Answer
B. An unsupported or invented response
Explanation
Fabrications, also called hallucinations, are incorrect or invented outputs.
Question 2
What does groundedness measure?
A. GPU performance
B. Whether outputs are supported by trusted sources
C. Network bandwidth
D. Token compression efficiency
Answer
B. Whether outputs are supported by trusted sources
Explanation
Groundedness evaluates factual support from retrieved or trusted data.
Question 3
Which evaluation type focuses on harmful or unsafe outputs?
A. Latency evaluation
B. Safety evaluation
C. Compression evaluation
D. Replication evaluation
Answer
B. Safety evaluation
Explanation
Safety evaluations detect harmful, toxic, or policy-violating outputs.
Question 4
What is the purpose of retrieval quality evaluation in RAG systems?
A. Measure GPU speed
B. Assess search relevance and retrieved context quality
C. Reduce storage redundancy
D. Disable embeddings
Answer
B. Assess search relevance and retrieved context quality
Explanation
Retrieval quality measures how useful and relevant retrieved information is.
Question 5
What is jailbreak testing?
A. Testing storage failures
B. Evaluating attempts to bypass safety controls
C. Measuring retrieval latency
D. Compressing prompts
Answer
B. Evaluating attempts to bypass safety controls
Explanation
Jailbreak testing checks whether users can circumvent AI safety mechanisms.
Question 6
Which metric measures whether responses answer the user’s question appropriately?
A. Relevance
B. Replication
C. Throughput
D. Compression
Answer
A. Relevance
Explanation
Relevance evaluates how well outputs match user intent.
Question 7
Why is continuous evaluation important?
A. To eliminate all infrastructure costs
B. Because models and data can change over time
C. To remove all safety policies
D. To disable monitoring
Answer
B. Because models and data can change over time
Explanation
Continuous evaluation helps detect drift and performance degradation.
Question 8
What is adversarial testing?
A. Testing network redundancy
B. Challenging AI systems with malicious or difficult prompts
C. Increasing vector dimensions
D. Optimizing GPU allocation
Answer
B. Challenging AI systems with malicious or difficult prompts
Explanation
Adversarial testing identifies vulnerabilities and unsafe behaviors.
Question 9
What is a benefit of A/B testing in AI systems?
A. Eliminates monitoring requirements
B. Compares prompts or models to identify better performance
C. Removes the need for evaluation datasets
D. Disables retrieval pipelines
Answer
B. Compares prompts or models to identify better performance
Explanation
A/B testing helps optimize prompts, workflows, and models.
Question 10
Which Azure capability helps inspect workflow execution and tool calls?
A. Trace analysis
B. DNS failover
C. Storage mirroring
D. GPU partitioning
Answer
A. Trace analysis
Explanation
Trace analysis provides visibility into workflow execution and reasoning steps.
Go to the AI-103 Exam Prep Hub main page
