This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub. 
This topic falls under these sections:
Implement text analysis solutions (10–15%)
   --> Implement speech solutions
      --> Implement workflows to convert speech to text and text to speech for agentic interactions

Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Modern AI agents increasingly communicate through voice. Organizations use speech-enabled AI systems to:

Power virtual assistants
Support customer service automation
Enable hands-free interactions
Provide accessibility features
Create multilingual conversational experiences
Enable real-time voice AI agents

For the AI-103 certification exam, you should understand how to implement:

Speech-to-text (STT)
Text-to-speech (TTS)
Real-time voice pipelines
Agentic conversational workflows
Speech orchestration in Azure AI Foundry
Responsible AI and speech safety controls

This topic falls under:

“Implement speech solutions”

What Are Speech Solutions?

Speech solutions allow AI systems to:

Understand spoken language
Generate spoken responses
Support voice-based interactions
Enable conversational AI experiences

Speech workflows are a major part of:

AI copilots
Voice assistants
AI contact centers
Accessibility systems

Core Speech Capabilities

Speech systems commonly include:

Speech-to-text (STT)
Text-to-speech (TTS)
Speaker recognition
Real-time transcription
Language detection
Voice translation

Azure AI Speech

Microsoft provides:
Azure AI Speech

to support:

Speech recognition
Voice synthesis
Real-time transcription
Custom voices
Multilingual speech workflows

Speech-to-Text (STT)

What Is Speech-to-Text?

Speech-to-text converts spoken audio into written text.

Example

Audio input:

"Schedule a meeting for tomorrow at 10 AM."

Transcribed output:

Schedule a meeting for tomorrow at 10 AM.

Common STT Use Cases

Organizations use STT for:

Call center transcription
Meeting transcription
Voice-enabled chatbots
Voice commands
Accessibility solutions

Real-Time Transcription

What Is Real-Time STT?

Real-time STT processes audio streams continuously as users speak.

Example Workflow

User speaks into microphone
Audio stream sent to speech service
Speech recognized incrementally
Transcript sent to AI agent
Agent generates response

Batch Transcription

Batch transcription processes prerecorded audio files.

Common examples:

Recorded meetings
Podcasts
Training videos
Customer support recordings

Text-to-Speech (TTS)

What Is Text-to-Speech?

TTS converts written text into synthesized speech.

Example

Input text:

Your appointment has been confirmed.

Generated output:

AI-generated spoken audio

Common TTS Use Cases

TTS is used for:

Voice assistants
Accessibility readers
AI agents
Automated announcements
Interactive voice response (IVR) systems

Neural Text-to-Speech

Modern TTS systems use neural networks to create:

Natural speech
Human-like intonation
Emotional tone
Improved pronunciation

SSML (Speech Synthesis Markup Language)

What Is SSML?

SSML controls synthesized speech characteristics.

It allows customization of:

Pitch
Speed
Pronunciation
Emphasis
Pauses

Example SSML

			
<speak>
    <prosody rate="slow">
        Welcome to Contoso support.
    </prosody>
</speak>

		

Voice AI Agents

What Are Voice Agents?

Voice agents combine:

Speech recognition
LLM reasoning
Text generation
Speech synthesis

to create conversational AI systems.

Agentic Voice Workflow

User speaks
Speech converted to text
AI agent interprets intent
Agent performs actions
Response generated
Response converted to speech
Spoken response returned

Azure AI Foundry

supports:

AI orchestration
Prompt flows
Speech-enabled workflows
Agentic pipelines

Azure OpenAI Service

supports:

Conversational AI
Agent reasoning
Prompt-based workflows
Voice-enabled copilots

Conversational Memory

Voice agents often maintain:

Conversation history
User context
Session state
Intent tracking

This improves:

Multi-turn conversations
Personalization
Context continuity

Interruptions and Turn-Taking

Advanced voice systems support:

Interruptions
Natural pauses
Multi-turn dialogue
Conversational turn-taking

Multilingual Speech Workflows

Speech systems may:

Detect spoken language
Translate conversations
Generate multilingual speech responses

Example Multilingual Pipeline

Detect spoken language
Convert speech to text
Translate text
Generate AI response
Convert translated response to speech

Voice Translation

Voice translation combines:

STT
Translation
TTS

to enable multilingual communication.

Speaker Recognition

What Is Speaker Recognition?

Speaker recognition identifies or verifies speakers.

Use cases:

Security
Authentication
Meeting analytics
Call center analysis

Custom Voices

Organizations may create branded AI voices.

Use cases:

Corporate assistants
Brand consistency
Accessibility applications

Responsible use policies are important for synthetic voice generation.

Responsible AI Considerations

Voice AI systems introduce risks including:

Impersonation
Deepfakes
Biased recognition
Privacy concerns
Unsafe responses

Speech Safety Controls

Organizations should:

Moderate generated content
Authenticate users
Log interactions
Apply access controls
Monitor misuse

Privacy Considerations

Speech systems may process:

Sensitive conversations
PII
Medical information
Financial data

Organizations should:

Encrypt audio
Restrict storage access
Apply retention policies
Use secure APIs

Latency in Voice Systems

Low latency is critical for natural conversations.

Sources of latency include:

Audio streaming
Speech recognition
LLM inference
TTS synthesis
Network delays

Reducing Voice Latency

Strategies include:

Streaming pipelines
Incremental transcription
Smaller response chunks
Optimized models
Edge processing

Monitoring and Observability

Production voice systems should monitor:

Recognition accuracy
Response latency
Audio quality
Failed transcriptions
Token usage
User interruptions
Safety violations

Hallucinations in Voice Agents

Voice agents may hallucinate:

Incorrect information
Unsupported claims
False actions

Grounding and retrieval help reduce hallucinations.

Retrieval-Augmented Generation (RAG)

Voice agents often use:

Vector search
Knowledge retrieval
Enterprise grounding

before generating spoken responses.

Real-World Example

A healthcare organization deploys a multilingual voice assistant.

Workflow:

Patient speaks naturally
Speech converted to text
AI retrieves patient policy information
AI generates response
Text converted to spoken audio
Interaction logged securely

This demonstrates:

STT
TTS
RAG
Multilingual speech
Responsible AI practices

Best Practices for Speech Workflows

Use Streaming Pipelines

Reduce conversational latency.

Ground Agent Responses

Reduce hallucinations using enterprise data.

Secure Audio Data

Protect sensitive speech information.

Monitor Recognition Accuracy

Track transcription quality continuously.

Use SSML Carefully

Improve speech quality and accessibility.

Implement Safety Controls

Prevent misuse and unsafe outputs.

Optimize for Low Latency

Voice interactions should feel natural and responsive.

Exam Tips for AI-103

For the AI-103 exam, remember these important concepts:

Speech-to-text converts spoken audio into text.
Text-to-speech converts text into synthesized speech.
Azure AI Speech provides speech AI capabilities.
SSML customizes synthesized voice behavior.
Voice agents combine STT, LLMs, and TTS.
Streaming pipelines reduce conversational latency.
Multilingual voice workflows may include translation.
Responsible AI is critical for voice systems.
Voice agents should be grounded to reduce hallucinations.
Azure AI Foundry supports orchestration of speech-enabled workflows.

Practice Exam Questions

Question 1

What is the purpose of speech-to-text (STT)?

A. Converting written text into audio
B. Translating images into captions
C. Converting spoken audio into written text
D. Compressing audio streams

Answer

C. Converting spoken audio into written text

Explanation

STT converts spoken language into machine-readable text.

Question 2

What is the purpose of text-to-speech (TTS)?

A. Converting text into synthesized speech
B. Detecting image objects
C. Encrypting audio files
D. Translating vector embeddings

Answer

A. Converting text into synthesized speech

Explanation

TTS generates spoken audio from written text.

Question 3

Which Azure service provides speech AI capabilities?

A. Azure VPN Gateway
B. Azure CDN
C. Azure Firewall
D. Azure AI Speech

Answer

D. Azure AI Speech

Explanation

Azure AI Speech supports speech recognition and speech synthesis workflows.

Question 4

What is SSML primarily used for?

A. Customizing synthesized speech behavior
B. Encrypting speech transcripts
C. Compressing audio files
D. Detecting unsafe prompts

Answer

A. Customizing synthesized speech behavior

Explanation

SSML controls pitch, rate, pauses, pronunciation, and emphasis.

Question 5

What is a major advantage of streaming speech pipelines?

A. Increased hallucination rates
B. Reduced conversational latency
C. Eliminated token usage
D. Reduced audio quality

Answer

B. Reduced conversational latency

Explanation

Streaming pipelines improve responsiveness for real-time voice interactions.

Question 6

What components are commonly combined in a voice AI agent?

A. VPN gateways and DNS zones
B. OCR, CDN, and firewall rules
C. Vector compression and SQL indexing
D. STT, LLM reasoning, and TTS

Answer

D. STT, LLM reasoning, and TTS

Explanation

Voice agents use speech recognition, AI reasoning, and synthesized responses.

Question 7

What is a common use case for batch transcription?

A. Processing prerecorded audio files
B. Generating vector embeddings
C. Translating images automatically
D. Detecting hallucinations

Answer

A. Processing prerecorded audio files

Explanation

Batch transcription processes stored audio recordings.

Question 8

Why is grounding important for voice agents?

A. It removes multilingual support
B. It increases network latency
C. It reduces hallucinations and unsupported responses
D. It disables speech recognition