This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub.
This topic falls under these sections:
Identify AI concepts and capabilities (40–45%)
--> Identify AI workloads
--> Identify features and capabilities of Speech Recognition and Speech Synthesis
Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.
AI-901: Microsoft Azure AI Fundamentals (beta)
Speech AI is one of the major AI workloads covered in the AI-901 certification exam. Microsoft expects candidates to understand how AI systems process spoken language using technologies such as speech recognition and speech synthesis.
These capabilities allow computers to listen to, understand, and generate human speech, enabling more natural human-computer interaction.
This topic falls under the “Identify AI workloads” section of the AI-901 exam objectives.
What Is Speech AI?
Speech AI refers to AI technologies that process spoken language and audio.
Speech AI enables systems to:
- Recognize spoken words
- Convert speech into text
- Generate spoken responses
- Translate spoken language
- Identify speakers
Speech technologies are commonly used in modern AI assistants and accessibility tools.
Major Speech AI Capabilities
For the AI-901 exam, important speech AI capabilities include:
- Speech recognition
- Speech synthesis
- Speech translation
- Speaker recognition
The primary focus of this topic is speech recognition and speech synthesis.
What Is Speech Recognition?
Speech recognition converts spoken language into written text.
It is often called:
- Speech-to-text
- Automatic Speech Recognition (ASR)
Speech recognition allows computers to “listen” to human speech and interpret it as text.
How Speech Recognition Works
Speech recognition systems typically perform these steps:
- Capture audio input
- Analyze sound patterns
- Identify spoken words
- Convert speech into text output
Modern speech recognition systems often use:
- Machine learning
- Deep learning
- Neural networks
- Large speech datasets
Speech Recognition Example
Spoken Input
“Schedule a meeting for tomorrow at 2 PM.”
Text Output
Schedule a meeting for tomorrow at 2 PM.
Common Features of Speech Recognition
Speech recognition systems may support:
- Real-time transcription
- Multiple languages
- Noise reduction
- Speaker identification
- Continuous speech recognition
- Command recognition
Real-Time Transcription
Real-time transcription converts speech into text immediately as someone speaks.
Common Use Cases
- Live captions
- Meeting transcription
- Accessibility tools
Example
A video conferencing platform generates live subtitles during meetings.
Continuous Speech Recognition
Continuous speech recognition processes natural conversation without requiring pauses between words.
Example
Voice assistants processing full spoken sentences naturally.
Command Recognition
Some speech systems focus on recognizing specific spoken commands.
Example Commands
- “Play music”
- “Turn on the lights”
- “Call John”
These systems are commonly used in smart devices.
Noise Reduction
Speech recognition systems often include noise filtering capabilities.
This helps improve accuracy in noisy environments.
Example
Recognizing speech in a crowded airport.
Multilingual Speech Recognition
Many modern speech systems support multiple languages and accents.
Example
An AI assistant understanding English, Spanish, and French speakers.
Common Use Cases for Speech Recognition
Virtual Assistants
Examples include voice-controlled assistants that answer questions or perform actions.
Example
A user asks a smart speaker about the weather.
Accessibility Solutions
Speech recognition helps users who cannot type easily.
Example
Voice dictation software for users with disabilities.
Meeting Transcription
Organizations convert meetings into searchable text records.
Example
Automatic meeting notes.
Customer Service Systems
Interactive voice response (IVR) systems process spoken customer requests.
Example
A phone system asks customers to describe their issue verbally.
Hands-Free Applications
Speech recognition supports hands-free operation.
Example
Voice-controlled navigation while driving.
What Is Speech Synthesis?
Speech synthesis converts written text into spoken audio.
It is often called:
- Text-to-speech (TTS)
Speech synthesis allows computers to “speak” naturally to users.
How Speech Synthesis Works
Speech synthesis systems:
- Receive text input
- Analyze words and pronunciation
- Generate spoken audio output
Modern systems use AI-generated voices that sound increasingly human-like.
Speech Synthesis Example
Text Input
“Your appointment is scheduled for Monday at 10 AM.”
Spoken Output
The system reads the message aloud.
Features of Speech Synthesis
Speech synthesis systems may support:
- Natural-sounding voices
- Multiple languages
- Adjustable speaking speed
- Voice customization
- Emotional tone control
Natural Neural Voices
Modern AI systems use neural text-to-speech technology to create more human-like speech.
Benefits include:
- Improved pronunciation
- Better intonation
- More natural rhythm
Voice Customization
Some systems allow organizations to customize voices.
Example
A company creates a branded AI voice for customer support systems.
Adjustable Speech Settings
Speech synthesis systems may allow changes to:
- Speed
- Pitch
- Volume
- Pronunciation
Common Use Cases for Speech Synthesis
Accessibility Tools
Text-to-speech helps visually impaired users consume written content.
Example
Screen readers reading web pages aloud.
Navigation Systems
GPS applications provide spoken directions.
Example
A navigation app announcing upcoming turns.
Virtual Assistants
AI assistants respond using synthesized speech.
Example
A smart assistant answers spoken questions aloud.
Customer Service Bots
Automated phone systems communicate using AI-generated voices.
Example
A banking system reads account information to customers.
Audiobooks and Learning
Speech synthesis converts written content into audio.
Example
Educational content read aloud automatically.
Speech Translation
Speech translation combines:
- Speech recognition
- Language translation
- Speech synthesis
Example
A conference tool translates spoken English into spoken Spanish.
Speaker Recognition
Speaker recognition identifies or verifies individuals based on voice characteristics.
Types
- Speaker identification
- Speaker verification
Example
Voice-based authentication systems.
Challenges in Speech AI
Speech AI systems may face challenges such as:
- Background noise
- Strong accents
- Multiple simultaneous speakers
- Poor audio quality
- Specialized vocabulary
Responsible AI Considerations
Speech AI systems should be designed responsibly.
Important considerations include:
- Privacy
- Consent
- Security
- Accessibility
- Bias reduction
- Transparency
Voice data may contain sensitive personal information.
Azure AI Speech Services
Azure AI Speech Services provide cloud-based speech AI capabilities including:
- Speech-to-text
- Text-to-speech
- Speech translation
- Speaker recognition
These services help developers integrate speech AI into applications without building models from scratch.
Speech Recognition vs. Speech Synthesis
| Capability | Description |
|---|---|
| Speech Recognition | Converts speech into text |
| Speech Synthesis | Converts text into spoken audio |
Real-World Examples
Scenario 1: Live Meeting Captions
Goal
Convert spoken conversations into text.
Capability Used
Speech recognition
Scenario 2: GPS Navigation App
Goal
Read directions aloud.
Capability Used
Speech synthesis
Scenario 3: Voice-Controlled Smart Home
Goal
Understand spoken commands and respond verbally.
Capabilities Used
- Speech recognition
- Speech synthesis
Scenario 4: Audiobook Generator
Goal
Convert books into spoken audio.
Capability Used
Speech synthesis
Important AI-901 Exam Tips
For the exam, remember these key points:
- Speech recognition converts speech into text.
- Speech synthesis converts text into spoken audio.
- Speech-to-text is another term for speech recognition.
- Text-to-speech is another term for speech synthesis.
- Real-time transcription supports live captions.
- Neural voices produce more natural speech.
- Speech translation combines multiple speech technologies.
- Speaker recognition identifies individuals using voice characteristics.
- Speech AI is commonly used in assistants, accessibility tools, and customer service systems.
Quick Knowledge Check
Question 1
What does speech recognition do?
Answer
Converts spoken language into written text.
Question 2
What does speech synthesis do?
Answer
Converts text into spoken audio.
Question 3
What is another name for speech synthesis?
Answer
Text-to-speech (TTS).
Question 4
Which speech capability is used for live meeting captions?
Answer
Speech recognition.
Practice Exam Questions
Question 1
What is the PRIMARY function of speech recognition?
A. Converting images into text
B. Converting spoken language into written text
C. Generating images from prompts
D. Translating text into code
Correct Answer
B. Converting spoken language into written text
Explanation
Speech recognition, also called speech-to-text, converts spoken audio into written text.
Why the Other Answers Are Incorrect
A. Converting images into text
This is OCR functionality.
C. Generating images from prompts
This is a generative AI capability.
D. Translating text into code
This is unrelated to speech recognition.
Question 2
Which capability converts written text into spoken audio?
A. OCR
B. Speech synthesis
C. Object detection
D. Clustering
Correct Answer
B. Speech synthesis
Explanation
Speech synthesis, also called text-to-speech (TTS), generates spoken audio from text.
Why the Other Answers Are Incorrect
A. OCR
OCR extracts text from images.
C. Object detection
Object detection identifies objects in images.
D. Clustering
Clustering groups similar data.
Question 3
A company wants to generate live subtitles during online meetings.
Which AI capability should be used?
A. Speech recognition
B. Speech synthesis
C. Facial recognition
D. Image segmentation
Correct Answer
A. Speech recognition
Explanation
Speech recognition converts spoken conversations into text in real time, enabling live captions and subtitles.
Why the Other Answers Are Incorrect
B. Speech synthesis
Speech synthesis creates spoken audio from text.
C. Facial recognition
Facial recognition analyzes faces in images.
D. Image segmentation
Image segmentation is a computer vision task.
Question 4
What is another common name for speech synthesis?
A. Object detection
B. Text-to-speech
C. Speech-to-text
D. Named Entity Recognition
Correct Answer
B. Text-to-speech
Explanation
Speech synthesis is commonly referred to as text-to-speech (TTS).
Why the Other Answers Are Incorrect
A. Object detection
Object detection identifies objects in images.
C. Speech-to-text
Speech-to-text refers to speech recognition.
D. Named Entity Recognition
NER identifies entities in text.
Question 5
Which scenario BEST demonstrates speech synthesis?
A. A chatbot reading answers aloud to users
B. A camera identifying vehicles on a road
C. A system categorizing customer emails
D. A database sorting sales records
Correct Answer
A. A chatbot reading answers aloud to users
Explanation
Speech synthesis converts text responses into spoken audio for users.
Why the Other Answers Are Incorrect
B. A camera identifying vehicles on a road
This is computer vision.
C. A system categorizing customer emails
This is text classification.
D. A database sorting sales records
This is not a speech AI task.
Question 6
Which feature helps speech recognition systems perform better in noisy environments?
A. Image enhancement
B. Noise reduction
C. OCR optimization
D. Regression tuning
Correct Answer
B. Noise reduction
Explanation
Noise reduction filters background sounds to improve speech recognition accuracy.
Why the Other Answers Are Incorrect
A. Image enhancement
Image enhancement relates to visual processing.
C. OCR optimization
OCR works with images and text extraction.
D. Regression tuning
Regression is unrelated to speech audio processing.
Question 7
A navigation application reads driving directions aloud to users.
Which AI capability is being used?
A. Sentiment analysis
B. Speech synthesis
C. Object detection
D. Language detection
Correct Answer
B. Speech synthesis
Explanation
Speech synthesis converts written navigation instructions into spoken audio.
Why the Other Answers Are Incorrect
A. Sentiment analysis
Sentiment analysis evaluates emotional tone in text.
C. Object detection
Object detection analyzes images.
D. Language detection
Language detection identifies languages in text.
Question 8
Which statement BEST describes speech translation?
A. It converts images into searchable text
B. It combines speech recognition, translation, and speech synthesis
C. It identifies objects in audio recordings
D. It predicts future speech patterns
Correct Answer
B. It combines speech recognition, translation, and speech synthesis
Explanation
Speech translation systems convert spoken language into another language and often generate translated spoken output.
Why the Other Answers Are Incorrect
A. It converts images into searchable text
This is OCR.
C. It identifies objects in audio recordings
This is not a standard speech AI capability.
D. It predicts future speech patterns
This is unrelated to translation systems.
Question 9
What is the PRIMARY purpose of speaker recognition?
A. Generating synthetic voices
B. Identifying or verifying individuals using voice characteristics
C. Translating speech into multiple languages
D. Extracting keywords from documents
Correct Answer
B. Identifying or verifying individuals using voice characteristics
Explanation
Speaker recognition systems use voice patterns to identify or authenticate users.
Why the Other Answers Are Incorrect
A. Generating synthetic voices
This is speech synthesis.
C. Translating speech into multiple languages
This is speech translation.
D. Extracting keywords from documents
This is keyword extraction.
Question 10
Which pair correctly matches the capability with its function?
A. Speech recognition → Converts text into speech
B. Speech synthesis → Converts speech into text
C. Speech recognition → Converts speech into text
D. OCR → Generates spoken audio
Correct Answer
C. Speech recognition → Converts speech into text
Explanation
Speech recognition converts spoken language into written text, while speech synthesis converts text into spoken audio.
Why the Other Answers Are Incorrect
A. Speech recognition → Converts text into speech
This describes speech synthesis.
B. Speech synthesis → Converts speech into text
This describes speech recognition.
D. OCR → Generates spoken audio
OCR extracts text from images.
Final Thoughts
Speech AI technologies are essential components of modern AI systems and are an important topic for the AI-901 certification exam. Microsoft expects candidates to understand how speech recognition and speech synthesis work, along with common business scenarios where these technologies are applied.
These capabilities help organizations build more natural, accessible, and interactive AI-powered experiences using Azure AI services.
Go to the AI-901 Exam Prep Hub main page

One thought on “Identify features and capabilities of Speech Recognition and Speech Synthesis (AI-901 Exam Prep)”