This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
   --> Build generative applications by using Foundry
      --> Evaluate models and apps, including detecting fabrications, relevance, quality, and safety

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Building generative AI applications is only part of the development process.

Organizations must also evaluate whether AI systems are:

Accurate
Reliable
Relevant
Safe
Grounded
Trustworthy

AI systems can generate:

Hallucinations
Unsafe content
Biased responses
Irrelevant answers
Inconsistent outputs

The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of evaluating models and applications.

For the AI-103 exam, you should understand:

Model evaluation
Application evaluation
Fabrication detection
Groundedness
Relevance evaluation
Quality evaluation
Safety evaluation
Responsible AI testing
Automated evaluators
Human evaluation
Benchmarking
Monitoring and continuous evaluation

Why AI Evaluation Matters

Evaluation is essential because generative AI systems are probabilistic.

This means:

Responses may vary
Outputs may be incorrect
Safety risks may occur
Hallucinations may appear

Without evaluation, organizations cannot reliably trust AI systems.

What Is AI Evaluation?

AI evaluation is the process of measuring:

Accuracy
Safety
Reliability
Relevance
Groundedness
User satisfaction

Types of AI Evaluation

Common evaluation categories include:

Model evaluation
Prompt evaluation
Retrieval evaluation
Application evaluation
Safety evaluation
Human evaluation

Model Evaluation

Model evaluation focuses on:

Model quality
Accuracy
Performance
Reasoning ability

Application Evaluation

Application evaluation measures:

End-to-end user experience
Workflow success
Tool orchestration quality
Groundedness

What Are Fabrications?

Fabrications are generated outputs that:

Are incorrect
Are unsupported
Contain invented facts
Misrepresent information

Fabrications are commonly called hallucinations.

Causes of Fabrications

Fabrications may occur because:

The model lacks relevant knowledge
Prompts are ambiguous
Retrieval quality is poor
Context is insufficient
Safety constraints are weak

Fabrication Detection

Organizations should evaluate whether outputs:

Match trusted sources
Remain grounded
Avoid unsupported claims

Groundedness Evaluation

Groundedness measures whether responses are supported by:

Retrieved documents
Enterprise data
Trusted sources

Importance of Groundedness

Grounded responses:

Improve trust
Reduce hallucinations
Increase explainability

Retrieval Quality Evaluation

RAG systems should evaluate:

Search relevance
Retrieved chunk quality
Citation accuracy
Context completeness

Relevance Evaluation

Relevance measures whether responses:

Answer the user’s question
Stay on-topic
Match user intent

Quality Evaluation

Quality evaluations may assess:

Clarity
Completeness
Coherence
Fluency
Professionalism

Consistency Evaluation

Consistency measures whether models:

Produce stable responses
Avoid contradictory outputs
Maintain predictable behavior

Safety Evaluation

Safety evaluations identify:

Harmful outputs
Toxic content
Unsafe instructions
Policy violations

Responsible AI Evaluation

Responsible AI testing focuses on:

Fairness
Safety
Transparency
Accountability
Privacy

Bias Evaluation

Organizations should evaluate whether models:

Produce biased outputs
Treat groups unfairly
Reinforce stereotypes

Toxicity Detection

Toxicity evaluations identify:

Offensive language
Hate speech
Harassment
Abusive content

Jailbreak Testing

Jailbreak testing evaluates whether users can bypass:

Safety controls
Content filters
Guardrails

Adversarial Testing

Adversarial testing intentionally challenges models using:

Malicious prompts
Edge cases
Prompt injection attacks

Prompt Injection Testing

Prompt injection testing evaluates whether:

External content manipulates model behavior
Instructions override safety policies

Automated Evaluators

Automated evaluators use:

Rules
Scoring systems
AI-based evaluators

To assess model outputs.

AI-Assisted Evaluation

Some systems use LLMs to evaluate:

Relevance
Groundedness
Quality
Safety

Human Evaluation

Human reviewers may evaluate:

Accuracy
Tone
Helpfulness
Safety
Business alignment

Human-in-the-Loop Evaluation

Human-in-the-loop evaluation combines:

Automated evaluation
Human oversight
Expert validation

Benchmarking Models

Benchmarking compares models using:

Standard datasets
Consistent prompts
Defined metrics

A/B Testing

A/B testing compares:

Different prompts
Different models
Different workflows

Evaluation Metrics

Common metrics include:

Precision
Recall
Accuracy
Relevance
Groundedness
Toxicity scores
Latency
User satisfaction

Precision and Recall

Precision

Measures how many retrieved results are relevant.

Recall

Measures how many relevant results were successfully retrieved.

Latency Evaluation

Organizations should measure:

Response times
Retrieval delays
Tool execution times

Cost Evaluation

Cost evaluation considers:

Token usage
API calls
Infrastructure consumption

User Satisfaction Evaluation

Organizations may measure:

User feedback
Completion success
Satisfaction ratings

Continuous Evaluation

AI systems should be evaluated continuously because:

User behavior changes
Data evolves
Model drift may occur

Model Drift

Model drift occurs when:

Performance changes over time
Inputs evolve
User expectations shift

Monitoring Production Systems

Organizations should monitor:

Safety violations
Hallucination rates
Retrieval failures
Latency spikes
Cost increases

Evaluation Pipelines

Evaluation pipelines automate:

Testing
Scoring
Reporting
Regression analysis

Regression Testing

Regression testing ensures updates do not:

Reduce quality
Break workflows
Increase hallucinations

Azure AI Foundry Evaluation Capabilities

Azure AI Foundry supports:

Evaluation workflows
Automated evaluators
Safety monitoring
Groundedness evaluation
Prompt testing
Trace analysis

Trace Analysis

Trace analysis helps inspect:

Tool calls
Retrieval steps
Agent decisions
Workflow execution

Evaluation Datasets

Organizations should create datasets containing:

Expected outputs
Edge cases
Adversarial prompts
Real-world scenarios

Synthetic Test Data

Synthetic data may help test:

Rare scenarios
Adversarial prompts
Safety boundaries

Real-World Evaluation Scenarios

Scenario 1: Enterprise Chatbot

Requirements:

Accurate responses
Citation support
Low hallucination rate

Recommended Evaluation:

Groundedness testing
Retrieval quality evaluation

Scenario 2: Financial Assistant

Requirements:

High accuracy
Safety compliance
Low fabrication risk

Recommended Evaluation:

Human review
Adversarial testing
Approval workflows

Scenario 3: Customer Support Copilot

Requirements:

Relevant responses
Fast response times
Consistent tone

Recommended Evaluation:

Latency evaluation
Quality scoring
A/B testing

Scenario 4: Agentic Workflow System

Requirements:

Tool accuracy
Safe tool execution
Workflow traceability

Recommended Evaluation:

Trace analysis
Tool execution monitoring
HITL evaluation

Common AI-103 Exam Tips

Understand Evaluation Categories

Know the differences between:

Relevance
Quality
Groundedness
Safety
Consistency

Learn Fabrication Detection Concepts

Understand:

Hallucinations
Unsupported claims
Grounding validation

Understand Safety Testing

Know:

Toxicity testing
Jailbreak testing
Prompt injection evaluation
Adversarial testing

Learn Monitoring Concepts

Understand:

Continuous evaluation
Drift detection
Trace analysis
Regression testing

Summary

Evaluating generative AI systems is critical for building:

Reliable
Safe
Grounded
Trustworthy applications

For the AI-103 exam, you should understand:

Fabrication detection
Groundedness evaluation
Retrieval quality
Relevance testing
Quality evaluation
Safety evaluation
Toxicity detection
Adversarial testing
Human evaluation
Automated evaluators
Monitoring and drift detection
Evaluation pipelines

These concepts are foundational for developing enterprise-grade AI applications and agentic systems on Azure.

Practice Exam Questions

Question 1

What is a fabrication in generative AI?

A. A storage replication process
B. An unsupported or invented response
C. A vector indexing method
D. A deployment strategy

Answer

B. An unsupported or invented response

Explanation

Fabrications, also called hallucinations, are incorrect or invented outputs.

Question 2

What does groundedness measure?

A. GPU performance
B. Whether outputs are supported by trusted sources
C. Network bandwidth
D. Token compression efficiency

Answer

B. Whether outputs are supported by trusted sources

Explanation

Groundedness evaluates factual support from retrieved or trusted data.

Question 3

Which evaluation type focuses on harmful or unsafe outputs?

A. Latency evaluation
B. Safety evaluation
C. Compression evaluation
D. Replication evaluation

Answer

B. Safety evaluation

Explanation

Safety evaluations detect harmful, toxic, or policy-violating outputs.

Question 4

What is the purpose of retrieval quality evaluation in RAG systems?

A. Measure GPU speed
B. Assess search relevance and retrieved context quality
C. Reduce storage redundancy
D. Disable embeddings

Answer

B. Assess search relevance and retrieved context quality

Explanation

Retrieval quality measures how useful and relevant retrieved information is.

Question 5

What is jailbreak testing?

A. Testing storage failures
B. Evaluating attempts to bypass safety controls
C. Measuring retrieval latency
D. Compressing prompts

Answer

B. Evaluating attempts to bypass safety controls

Explanation

Jailbreak testing checks whether users can circumvent AI safety mechanisms.

Question 6

Which metric measures whether responses answer the user’s question appropriately?

A. Relevance
B. Replication
C. Throughput
D. Compression

Answer

A. Relevance

Explanation

Relevance evaluates how well outputs match user intent.

Question 7

Why is continuous evaluation important?

A. To eliminate all infrastructure costs
B. Because models and data can change over time
C. To remove all safety policies
D. To disable monitoring

Answer

B. Because models and data can change over time

Explanation

Continuous evaluation helps detect drift and performance degradation.

Question 8

What is adversarial testing?

A. Testing network redundancy
B. Challenging AI systems with malicious or difficult prompts
C. Increasing vector dimensions
D. Optimizing GPU allocation

Answer

B. Challenging AI systems with malicious or difficult prompts

Explanation

Adversarial testing identifies vulnerabilities and unsafe behaviors.

Question 9

What is a benefit of A/B testing in AI systems?

A. Eliminates monitoring requirements
B. Compares prompts or models to identify better performance
C. Removes the need for evaluation datasets
D. Disables retrieval pipelines