This post is a part of the AI-103: Develop AI Apps and Agents on Azure Exam Prep Hub.
This topic falls under these sections:
Implement text analysis solutions (10–15%)
--> Implement speech solutions
--> Implement workflows to convert speech to text and text to speech for agentic interactions
Note that there are 10 practice questions (with answers and explanations) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.
Introduction
Modern AI agents increasingly communicate through voice. Organizations use speech-enabled AI systems to:
- Power virtual assistants
- Support customer service automation
- Enable hands-free interactions
- Provide accessibility features
- Create multilingual conversational experiences
- Enable real-time voice AI agents
For the AI-103 certification exam, you should understand how to implement:
- Speech-to-text (STT)
- Text-to-speech (TTS)
- Real-time voice pipelines
- Agentic conversational workflows
- Speech orchestration in Azure AI Foundry
- Responsible AI and speech safety controls
This topic falls under:
“Implement speech solutions”
What Are Speech Solutions?
Speech solutions allow AI systems to:
- Understand spoken language
- Generate spoken responses
- Support voice-based interactions
- Enable conversational AI experiences
Speech workflows are a major part of:
- AI copilots
- Voice assistants
- AI contact centers
- Accessibility systems
Core Speech Capabilities
Speech systems commonly include:
- Speech-to-text (STT)
- Text-to-speech (TTS)
- Speaker recognition
- Real-time transcription
- Language detection
- Voice translation
Azure AI Speech
Microsoft provides:
Azure AI Speech
to support:
- Speech recognition
- Voice synthesis
- Real-time transcription
- Custom voices
- Multilingual speech workflows
Speech-to-Text (STT)
What Is Speech-to-Text?
Speech-to-text converts spoken audio into written text.
Example
Audio input:
"Schedule a meeting for tomorrow at 10 AM."
Transcribed output:
Schedule a meeting for tomorrow at 10 AM.
Common STT Use Cases
Organizations use STT for:
- Call center transcription
- Meeting transcription
- Voice-enabled chatbots
- Voice commands
- Accessibility solutions
Real-Time Transcription
What Is Real-Time STT?
Real-time STT processes audio streams continuously as users speak.
Example Workflow
- User speaks into microphone
- Audio stream sent to speech service
- Speech recognized incrementally
- Transcript sent to AI agent
- Agent generates response
Batch Transcription
Batch transcription processes prerecorded audio files.
Common examples:
- Recorded meetings
- Podcasts
- Training videos
- Customer support recordings
Text-to-Speech (TTS)
What Is Text-to-Speech?
TTS converts written text into synthesized speech.
Example
Input text:
Your appointment has been confirmed.
Generated output:
- AI-generated spoken audio
Common TTS Use Cases
TTS is used for:
- Voice assistants
- Accessibility readers
- AI agents
- Automated announcements
- Interactive voice response (IVR) systems
Neural Text-to-Speech
Modern TTS systems use neural networks to create:
- Natural speech
- Human-like intonation
- Emotional tone
- Improved pronunciation
SSML (Speech Synthesis Markup Language)
What Is SSML?
SSML controls synthesized speech characteristics.
It allows customization of:
- Pitch
- Speed
- Pronunciation
- Emphasis
- Pauses
Example SSML
<speak> <prosody rate="slow"> Welcome to Contoso support. </prosody></speak>
Voice AI Agents
What Are Voice Agents?
Voice agents combine:
- Speech recognition
- LLM reasoning
- Text generation
- Speech synthesis
to create conversational AI systems.
Agentic Voice Workflow
- User speaks
- Speech converted to text
- AI agent interprets intent
- Agent performs actions
- Response generated
- Response converted to speech
- Spoken response returned
Azure AI Foundry
Azure AI Foundry
supports:
- AI orchestration
- Prompt flows
- Speech-enabled workflows
- Agentic pipelines
Azure OpenAI Service
Azure OpenAI Service
supports:
- Conversational AI
- Agent reasoning
- Prompt-based workflows
- Voice-enabled copilots
Conversational Memory
Voice agents often maintain:
- Conversation history
- User context
- Session state
- Intent tracking
This improves:
- Multi-turn conversations
- Personalization
- Context continuity
Interruptions and Turn-Taking
Advanced voice systems support:
- Interruptions
- Natural pauses
- Multi-turn dialogue
- Conversational turn-taking
Multilingual Speech Workflows
Speech systems may:
- Detect spoken language
- Translate conversations
- Generate multilingual speech responses
Example Multilingual Pipeline
- Detect spoken language
- Convert speech to text
- Translate text
- Generate AI response
- Convert translated response to speech
Voice Translation
Voice translation combines:
- STT
- Translation
- TTS
to enable multilingual communication.
Speaker Recognition
What Is Speaker Recognition?
Speaker recognition identifies or verifies speakers.
Use cases:
- Security
- Authentication
- Meeting analytics
- Call center analysis
Custom Voices
Organizations may create branded AI voices.
Use cases:
- Corporate assistants
- Brand consistency
- Accessibility applications
Responsible use policies are important for synthetic voice generation.
Responsible AI Considerations
Voice AI systems introduce risks including:
- Impersonation
- Deepfakes
- Biased recognition
- Privacy concerns
- Unsafe responses
Speech Safety Controls
Organizations should:
- Moderate generated content
- Authenticate users
- Log interactions
- Apply access controls
- Monitor misuse
Privacy Considerations
Speech systems may process:
- Sensitive conversations
- PII
- Medical information
- Financial data
Organizations should:
- Encrypt audio
- Restrict storage access
- Apply retention policies
- Use secure APIs
Latency in Voice Systems
Low latency is critical for natural conversations.
Sources of latency include:
- Audio streaming
- Speech recognition
- LLM inference
- TTS synthesis
- Network delays
Reducing Voice Latency
Strategies include:
- Streaming pipelines
- Incremental transcription
- Smaller response chunks
- Optimized models
- Edge processing
Monitoring and Observability
Production voice systems should monitor:
- Recognition accuracy
- Response latency
- Audio quality
- Failed transcriptions
- Token usage
- User interruptions
- Safety violations
Hallucinations in Voice Agents
Voice agents may hallucinate:
- Incorrect information
- Unsupported claims
- False actions
Grounding and retrieval help reduce hallucinations.
Retrieval-Augmented Generation (RAG)
Voice agents often use:
- Vector search
- Knowledge retrieval
- Enterprise grounding
before generating spoken responses.
Real-World Example
A healthcare organization deploys a multilingual voice assistant.
Workflow:
- Patient speaks naturally
- Speech converted to text
- AI retrieves patient policy information
- AI generates response
- Text converted to spoken audio
- Interaction logged securely
This demonstrates:
- STT
- TTS
- RAG
- Multilingual speech
- Responsible AI practices
Best Practices for Speech Workflows
Use Streaming Pipelines
Reduce conversational latency.
Ground Agent Responses
Reduce hallucinations using enterprise data.
Secure Audio Data
Protect sensitive speech information.
Monitor Recognition Accuracy
Track transcription quality continuously.
Use SSML Carefully
Improve speech quality and accessibility.
Implement Safety Controls
Prevent misuse and unsafe outputs.
Optimize for Low Latency
Voice interactions should feel natural and responsive.
Exam Tips for AI-103
For the AI-103 exam, remember these important concepts:
- Speech-to-text converts spoken audio into text.
- Text-to-speech converts text into synthesized speech.
- Azure AI Speech provides speech AI capabilities.
- SSML customizes synthesized voice behavior.
- Voice agents combine STT, LLMs, and TTS.
- Streaming pipelines reduce conversational latency.
- Multilingual voice workflows may include translation.
- Responsible AI is critical for voice systems.
- Voice agents should be grounded to reduce hallucinations.
- Azure AI Foundry supports orchestration of speech-enabled workflows.
Practice Exam Questions
Question 1
What is the purpose of speech-to-text (STT)?
A. Converting written text into audio
B. Translating images into captions
C. Converting spoken audio into written text
D. Compressing audio streams
Answer
C. Converting spoken audio into written text
Explanation
STT converts spoken language into machine-readable text.
Question 2
What is the purpose of text-to-speech (TTS)?
A. Converting text into synthesized speech
B. Detecting image objects
C. Encrypting audio files
D. Translating vector embeddings
Answer
A. Converting text into synthesized speech
Explanation
TTS generates spoken audio from written text.
Question 3
Which Azure service provides speech AI capabilities?
A. Azure VPN Gateway
B. Azure CDN
C. Azure Firewall
D. Azure AI Speech
Answer
D. Azure AI Speech
Explanation
Azure AI Speech supports speech recognition and speech synthesis workflows.
Question 4
What is SSML primarily used for?
A. Customizing synthesized speech behavior
B. Encrypting speech transcripts
C. Compressing audio files
D. Detecting unsafe prompts
Answer
A. Customizing synthesized speech behavior
Explanation
SSML controls pitch, rate, pauses, pronunciation, and emphasis.
Question 5
What is a major advantage of streaming speech pipelines?
A. Increased hallucination rates
B. Reduced conversational latency
C. Eliminated token usage
D. Reduced audio quality
Answer
B. Reduced conversational latency
Explanation
Streaming pipelines improve responsiveness for real-time voice interactions.
Question 6
What components are commonly combined in a voice AI agent?
A. VPN gateways and DNS zones
B. OCR, CDN, and firewall rules
C. Vector compression and SQL indexing
D. STT, LLM reasoning, and TTS
Answer
D. STT, LLM reasoning, and TTS
Explanation
Voice agents use speech recognition, AI reasoning, and synthesized responses.
Question 7
What is a common use case for batch transcription?
A. Processing prerecorded audio files
B. Generating vector embeddings
C. Translating images automatically
D. Detecting hallucinations
Answer
A. Processing prerecorded audio files
Explanation
Batch transcription processes stored audio recordings.
Question 8
Why is grounding important for voice agents?
A. It removes multilingual support
B. It increases network latency
C. It reduces hallucinations and unsupported responses
D. It disables speech recognition
Answer
C. It reduces hallucinations and unsupported responses
Explanation
Grounding improves reliability using trusted enterprise data.
Question 9
What is a responsible AI concern related to speech systems?
A. Faster vector indexing
B. Deepfake or voice impersonation misuse
C. Reduced OCR quality
D. Excessive semantic search accuracy
Answer
B. Deepfake or voice impersonation misuse
Explanation
Synthetic voice systems may be abused for impersonation or fraud.
Question 10
Which platform supports orchestration of speech-enabled AI workflows?
A. Azure AI Foundry
B. Azure ExpressRoute
C. Azure DNS
D. Azure Load Balancer
Answer
A. Azure AI Foundry
Explanation
Azure AI Foundry supports orchestration and workflow automation for AI solutions.
Go to the AI-103 Exam Prep Hub main page
