Evaluate models and apps, including detecting fabrications, relevance, quality, and safety (AI-103 Exam Prep)

This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement generative AI and agentic solutions (30–35%)
--> Build generative applications by using Foundry
--> Evaluate models and apps, including detecting fabrications, relevance, quality, and safety


Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Building generative AI applications is only part of the development process.

Organizations must also evaluate whether AI systems are:

  • Accurate
  • Reliable
  • Relevant
  • Safe
  • Grounded
  • Trustworthy

AI systems can generate:

  • Hallucinations
  • Unsafe content
  • Biased responses
  • Irrelevant answers
  • Inconsistent outputs

The AI-103: Develop AI Apps and Agents on Azure certification exam tests your understanding of evaluating models and applications.

For the AI-103 exam, you should understand:

  • Model evaluation
  • Application evaluation
  • Fabrication detection
  • Groundedness
  • Relevance evaluation
  • Quality evaluation
  • Safety evaluation
  • Responsible AI testing
  • Automated evaluators
  • Human evaluation
  • Benchmarking
  • Monitoring and continuous evaluation

Why AI Evaluation Matters

Evaluation is essential because generative AI systems are probabilistic.

This means:

  • Responses may vary
  • Outputs may be incorrect
  • Safety risks may occur
  • Hallucinations may appear

Without evaluation, organizations cannot reliably trust AI systems.


What Is AI Evaluation?

AI evaluation is the process of measuring:

  • Accuracy
  • Safety
  • Reliability
  • Relevance
  • Groundedness
  • User satisfaction

Types of AI Evaluation

Common evaluation categories include:

  • Model evaluation
  • Prompt evaluation
  • Retrieval evaluation
  • Application evaluation
  • Safety evaluation
  • Human evaluation

Model Evaluation

Model evaluation focuses on:

  • Model quality
  • Accuracy
  • Performance
  • Reasoning ability

Application Evaluation

Application evaluation measures:

  • End-to-end user experience
  • Workflow success
  • Tool orchestration quality
  • Groundedness

What Are Fabrications?

Fabrications are generated outputs that:

  • Are incorrect
  • Are unsupported
  • Contain invented facts
  • Misrepresent information

Fabrications are commonly called hallucinations.


Causes of Fabrications

Fabrications may occur because:

  • The model lacks relevant knowledge
  • Prompts are ambiguous
  • Retrieval quality is poor
  • Context is insufficient
  • Safety constraints are weak

Fabrication Detection

Organizations should evaluate whether outputs:

  • Match trusted sources
  • Remain grounded
  • Avoid unsupported claims

Groundedness Evaluation

Groundedness measures whether responses are supported by:

  • Retrieved documents
  • Enterprise data
  • Trusted sources

Importance of Groundedness

Grounded responses:

  • Improve trust
  • Reduce hallucinations
  • Increase explainability

Retrieval Quality Evaluation

RAG systems should evaluate:

  • Search relevance
  • Retrieved chunk quality
  • Citation accuracy
  • Context completeness

Relevance Evaluation

Relevance measures whether responses:

  • Answer the user’s question
  • Stay on-topic
  • Match user intent

Quality Evaluation

Quality evaluations may assess:

  • Clarity
  • Completeness
  • Coherence
  • Fluency
  • Professionalism

Consistency Evaluation

Consistency measures whether models:

  • Produce stable responses
  • Avoid contradictory outputs
  • Maintain predictable behavior

Safety Evaluation

Safety evaluations identify:

  • Harmful outputs
  • Toxic content
  • Unsafe instructions
  • Policy violations

Responsible AI Evaluation

Responsible AI testing focuses on:

  • Fairness
  • Safety
  • Transparency
  • Accountability
  • Privacy

Bias Evaluation

Organizations should evaluate whether models:

  • Produce biased outputs
  • Treat groups unfairly
  • Reinforce stereotypes

Toxicity Detection

Toxicity evaluations identify:

  • Offensive language
  • Hate speech
  • Harassment
  • Abusive content

Jailbreak Testing

Jailbreak testing evaluates whether users can bypass:

  • Safety controls
  • Content filters
  • Guardrails

Adversarial Testing

Adversarial testing intentionally challenges models using:

  • Malicious prompts
  • Edge cases
  • Prompt injection attacks

Prompt Injection Testing

Prompt injection testing evaluates whether:

  • External content manipulates model behavior
  • Instructions override safety policies

Automated Evaluators

Automated evaluators use:

  • Rules
  • Scoring systems
  • AI-based evaluators

To assess model outputs.


AI-Assisted Evaluation

Some systems use LLMs to evaluate:

  • Relevance
  • Groundedness
  • Quality
  • Safety

Human Evaluation

Human reviewers may evaluate:

  • Accuracy
  • Tone
  • Helpfulness
  • Safety
  • Business alignment

Human-in-the-Loop Evaluation

Human-in-the-loop evaluation combines:

  • Automated evaluation
  • Human oversight
  • Expert validation

Benchmarking Models

Benchmarking compares models using:

  • Standard datasets
  • Consistent prompts
  • Defined metrics

A/B Testing

A/B testing compares:

  • Different prompts
  • Different models
  • Different workflows

Evaluation Metrics

Common metrics include:

  • Precision
  • Recall
  • Accuracy
  • Relevance
  • Groundedness
  • Toxicity scores
  • Latency
  • User satisfaction

Precision and Recall

Precision

Measures how many retrieved results are relevant.

Recall

Measures how many relevant results were successfully retrieved.


Latency Evaluation

Organizations should measure:

  • Response times
  • Retrieval delays
  • Tool execution times

Cost Evaluation

Cost evaluation considers:

  • Token usage
  • API calls
  • Infrastructure consumption

User Satisfaction Evaluation

Organizations may measure:

  • User feedback
  • Completion success
  • Satisfaction ratings

Continuous Evaluation

AI systems should be evaluated continuously because:

  • User behavior changes
  • Data evolves
  • Model drift may occur

Model Drift

Model drift occurs when:

  • Performance changes over time
  • Inputs evolve
  • User expectations shift

Monitoring Production Systems

Organizations should monitor:

  • Safety violations
  • Hallucination rates
  • Retrieval failures
  • Latency spikes
  • Cost increases

Evaluation Pipelines

Evaluation pipelines automate:

  • Testing
  • Scoring
  • Reporting
  • Regression analysis

Regression Testing

Regression testing ensures updates do not:

  • Reduce quality
  • Break workflows
  • Increase hallucinations

Azure AI Foundry Evaluation Capabilities

Azure AI Foundry supports:

  • Evaluation workflows
  • Automated evaluators
  • Safety monitoring
  • Groundedness evaluation
  • Prompt testing
  • Trace analysis

Trace Analysis

Trace analysis helps inspect:

  • Tool calls
  • Retrieval steps
  • Agent decisions
  • Workflow execution

Evaluation Datasets

Organizations should create datasets containing:

  • Expected outputs
  • Edge cases
  • Adversarial prompts
  • Real-world scenarios

Synthetic Test Data

Synthetic data may help test:

  • Rare scenarios
  • Adversarial prompts
  • Safety boundaries

Real-World Evaluation Scenarios

Scenario 1: Enterprise Chatbot

Requirements:

  • Accurate responses
  • Citation support
  • Low hallucination rate

Recommended Evaluation:

  • Groundedness testing
  • Retrieval quality evaluation

Scenario 2: Financial Assistant

Requirements:

  • High accuracy
  • Safety compliance
  • Low fabrication risk

Recommended Evaluation:

  • Human review
  • Adversarial testing
  • Approval workflows

Scenario 3: Customer Support Copilot

Requirements:

  • Relevant responses
  • Fast response times
  • Consistent tone

Recommended Evaluation:

  • Latency evaluation
  • Quality scoring
  • A/B testing

Scenario 4: Agentic Workflow System

Requirements:

  • Tool accuracy
  • Safe tool execution
  • Workflow traceability

Recommended Evaluation:

  • Trace analysis
  • Tool execution monitoring
  • HITL evaluation

Common AI-103 Exam Tips

Understand Evaluation Categories

Know the differences between:

  • Relevance
  • Quality
  • Groundedness
  • Safety
  • Consistency

Learn Fabrication Detection Concepts

Understand:

  • Hallucinations
  • Unsupported claims
  • Grounding validation

Understand Safety Testing

Know:

  • Toxicity testing
  • Jailbreak testing
  • Prompt injection evaluation
  • Adversarial testing

Learn Monitoring Concepts

Understand:

  • Continuous evaluation
  • Drift detection
  • Trace analysis
  • Regression testing

Summary

Evaluating generative AI systems is critical for building:

  • Reliable
  • Safe
  • Grounded
  • Trustworthy applications

For the AI-103 exam, you should understand:

  • Fabrication detection
  • Groundedness evaluation
  • Retrieval quality
  • Relevance testing
  • Quality evaluation
  • Safety evaluation
  • Toxicity detection
  • Adversarial testing
  • Human evaluation
  • Automated evaluators
  • Monitoring and drift detection
  • Evaluation pipelines

These concepts are foundational for developing enterprise-grade AI applications and agentic systems on Azure.


Practice Exam Questions

Question 1

What is a fabrication in generative AI?

A. A storage replication process
B. An unsupported or invented response
C. A vector indexing method
D. A deployment strategy

Answer

B. An unsupported or invented response

Explanation

Fabrications, also called hallucinations, are incorrect or invented outputs.


Question 2

What does groundedness measure?

A. GPU performance
B. Whether outputs are supported by trusted sources
C. Network bandwidth
D. Token compression efficiency

Answer

B. Whether outputs are supported by trusted sources

Explanation

Groundedness evaluates factual support from retrieved or trusted data.


Question 3

Which evaluation type focuses on harmful or unsafe outputs?

A. Latency evaluation
B. Safety evaluation
C. Compression evaluation
D. Replication evaluation

Answer

B. Safety evaluation

Explanation

Safety evaluations detect harmful, toxic, or policy-violating outputs.


Question 4

What is the purpose of retrieval quality evaluation in RAG systems?

A. Measure GPU speed
B. Assess search relevance and retrieved context quality
C. Reduce storage redundancy
D. Disable embeddings

Answer

B. Assess search relevance and retrieved context quality

Explanation

Retrieval quality measures how useful and relevant retrieved information is.


Question 5

What is jailbreak testing?

A. Testing storage failures
B. Evaluating attempts to bypass safety controls
C. Measuring retrieval latency
D. Compressing prompts

Answer

B. Evaluating attempts to bypass safety controls

Explanation

Jailbreak testing checks whether users can circumvent AI safety mechanisms.


Question 6

Which metric measures whether responses answer the user’s question appropriately?

A. Relevance
B. Replication
C. Throughput
D. Compression

Answer

A. Relevance

Explanation

Relevance evaluates how well outputs match user intent.


Question 7

Why is continuous evaluation important?

A. To eliminate all infrastructure costs
B. Because models and data can change over time
C. To remove all safety policies
D. To disable monitoring

Answer

B. Because models and data can change over time

Explanation

Continuous evaluation helps detect drift and performance degradation.


Question 8

What is adversarial testing?

A. Testing network redundancy
B. Challenging AI systems with malicious or difficult prompts
C. Increasing vector dimensions
D. Optimizing GPU allocation

Answer

B. Challenging AI systems with malicious or difficult prompts

Explanation

Adversarial testing identifies vulnerabilities and unsafe behaviors.


Question 9

What is a benefit of A/B testing in AI systems?

A. Eliminates monitoring requirements
B. Compares prompts or models to identify better performance
C. Removes the need for evaluation datasets
D. Disables retrieval pipelines

Answer

B. Compares prompts or models to identify better performance

Explanation

A/B testing helps optimize prompts, workflows, and models.


Question 10

Which Azure capability helps inspect workflow execution and tool calls?

A. Trace analysis
B. DNS failover
C. Storage mirroring
D. GPU partitioning

Answer

A. Trace analysis

Explanation

Trace analysis provides visibility into workflow execution and reasoning steps.


Go to the AI-103 Exam Prep Hub main page

Leave a comment