This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Identify AI concepts and capabilities (40–45%)
   --> Identify AI workloads
      --> Identify features and capabilities of Speech Recognition and Speech Synthesis

Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

AI-901: Microsoft Azure AI Fundamentals (beta)

Speech AI is one of the major AI workloads covered in the AI-901 certification exam. Microsoft expects candidates to understand how AI systems process spoken language using technologies such as speech recognition and speech synthesis.

These capabilities allow computers to listen to, understand, and generate human speech, enabling more natural human-computer interaction.

This topic falls under the “Identify AI workloads” section of the AI-901 exam objectives.

What Is Speech AI?

Speech AI refers to AI technologies that process spoken language and audio.

Speech AI enables systems to:

Recognize spoken words
Convert speech into text
Generate spoken responses
Translate spoken language
Identify speakers

Speech technologies are commonly used in modern AI assistants and accessibility tools.

Major Speech AI Capabilities

For the AI-901 exam, important speech AI capabilities include:

Speech recognition
Speech synthesis
Speech translation
Speaker recognition

The primary focus of this topic is speech recognition and speech synthesis.

What Is Speech Recognition?

Speech recognition converts spoken language into written text.

It is often called:

Speech-to-text
Automatic Speech Recognition (ASR)

Speech recognition allows computers to “listen” to human speech and interpret it as text.

How Speech Recognition Works

Speech recognition systems typically perform these steps:

Capture audio input
Analyze sound patterns
Identify spoken words
Convert speech into text output

Modern speech recognition systems often use:

Machine learning
Deep learning
Neural networks
Large speech datasets

Speech Recognition Example

Spoken Input

“Schedule a meeting for tomorrow at 2 PM.”

Text Output

Schedule a meeting for tomorrow at 2 PM.

Common Features of Speech Recognition

Speech recognition systems may support:

Real-time transcription
Multiple languages
Noise reduction
Speaker identification
Continuous speech recognition
Command recognition

Real-Time Transcription

Real-time transcription converts speech into text immediately as someone speaks.

Common Use Cases

Live captions
Meeting transcription
Accessibility tools

Example

A video conferencing platform generates live subtitles during meetings.

Continuous Speech Recognition

Continuous speech recognition processes natural conversation without requiring pauses between words.

Example

Voice assistants processing full spoken sentences naturally.

Command Recognition

Some speech systems focus on recognizing specific spoken commands.

Example Commands

“Play music”
“Turn on the lights”
“Call John”

These systems are commonly used in smart devices.

Noise Reduction

Speech recognition systems often include noise filtering capabilities.

This helps improve accuracy in noisy environments.

Example

Recognizing speech in a crowded airport.

Multilingual Speech Recognition

Many modern speech systems support multiple languages and accents.

Example

An AI assistant understanding English, Spanish, and French speakers.

Common Use Cases for Speech Recognition

Virtual Assistants

Examples include voice-controlled assistants that answer questions or perform actions.

Example

A user asks a smart speaker about the weather.

Accessibility Solutions

Speech recognition helps users who cannot type easily.

Example

Voice dictation software for users with disabilities.

Meeting Transcription

Organizations convert meetings into searchable text records.

Example

Automatic meeting notes.

Customer Service Systems

Interactive voice response (IVR) systems process spoken customer requests.

Example

A phone system asks customers to describe their issue verbally.

Hands-Free Applications

Speech recognition supports hands-free operation.

Example

Voice-controlled navigation while driving.

What Is Speech Synthesis?

Speech synthesis converts written text into spoken audio.

It is often called:

Text-to-speech (TTS)

Speech synthesis allows computers to “speak” naturally to users.

How Speech Synthesis Works

Speech synthesis systems:

Receive text input
Analyze words and pronunciation
Generate spoken audio output

Modern systems use AI-generated voices that sound increasingly human-like.

Speech Synthesis Example

Text Input

“Your appointment is scheduled for Monday at 10 AM.”

Spoken Output

The system reads the message aloud.

Features of Speech Synthesis

Speech synthesis systems may support:

Natural-sounding voices
Multiple languages
Adjustable speaking speed
Voice customization
Emotional tone control

Natural Neural Voices

Modern AI systems use neural text-to-speech technology to create more human-like speech.

Benefits include:

Improved pronunciation
Better intonation
More natural rhythm

Voice Customization

Some systems allow organizations to customize voices.

Example

A company creates a branded AI voice for customer support systems.

Adjustable Speech Settings

Speech synthesis systems may allow changes to:

Speed
Pitch
Volume
Pronunciation

Common Use Cases for Speech Synthesis

Accessibility Tools

Text-to-speech helps visually impaired users consume written content.

Example

Screen readers reading web pages aloud.

Navigation Systems

GPS applications provide spoken directions.

Example

A navigation app announcing upcoming turns.

Virtual Assistants

AI assistants respond using synthesized speech.

Example

A smart assistant answers spoken questions aloud.

Customer Service Bots

Automated phone systems communicate using AI-generated voices.

Example

A banking system reads account information to customers.

Audiobooks and Learning

Speech synthesis converts written content into audio.

Example

Educational content read aloud automatically.

Speech Translation

Speech translation combines:

Speech recognition
Language translation
Speech synthesis

Example

A conference tool translates spoken English into spoken Spanish.

Speaker Recognition

Speaker recognition identifies or verifies individuals based on voice characteristics.

Types

Speaker identification
Speaker verification

Example

Voice-based authentication systems.

Challenges in Speech AI

Speech AI systems may face challenges such as:

Background noise
Strong accents
Multiple simultaneous speakers
Poor audio quality
Specialized vocabulary

Responsible AI Considerations

Speech AI systems should be designed responsibly.

Important considerations include:

Privacy
Consent
Security
Accessibility
Bias reduction
Transparency

Voice data may contain sensitive personal information.

Azure AI Speech Services

Azure AI Speech Services provide cloud-based speech AI capabilities including:

Speech-to-text
Text-to-speech
Speech translation
Speaker recognition

These services help developers integrate speech AI into applications without building models from scratch.

Speech Recognition vs. Speech Synthesis

Capability	Description
Speech Recognition	Converts speech into text
Speech Synthesis	Converts text into spoken audio

Real-World Examples

Scenario 1: Live Meeting Captions

Goal

Convert spoken conversations into text.

Capability Used

Speech recognition

Scenario 2: GPS Navigation App

Goal

Read directions aloud.

Capability Used

Speech synthesis

Scenario 3: Voice-Controlled Smart Home

Goal

Understand spoken commands and respond verbally.

Capabilities Used

Speech recognition
Speech synthesis

Scenario 4: Audiobook Generator

Goal

Convert books into spoken audio.

Capability Used

Speech synthesis

Important AI-901 Exam Tips

For the exam, remember these key points:

Speech recognition converts speech into text.
Speech synthesis converts text into spoken audio.
Speech-to-text is another term for speech recognition.
Text-to-speech is another term for speech synthesis.
Real-time transcription supports live captions.
Neural voices produce more natural speech.
Speech translation combines multiple speech technologies.
Speaker recognition identifies individuals using voice characteristics.
Speech AI is commonly used in assistants, accessibility tools, and customer service systems.

Quick Knowledge Check

Question 1

What does speech recognition do?

Answer

Converts spoken language into written text.

Question 2

What does speech synthesis do?

Answer

Converts text into spoken audio.

Question 3

What is another name for speech synthesis?

Answer

Text-to-speech (TTS).

Question 4

Which speech capability is used for live meeting captions?

Answer

Speech recognition.

Practice Exam Questions

Question 1

What is the PRIMARY function of speech recognition?

A. Converting images into text
B. Converting spoken language into written text
C. Generating images from prompts
D. Translating text into code

Correct Answer

B. Converting spoken language into written text

Explanation

Speech recognition, also called speech-to-text, converts spoken audio into written text.

Why the Other Answers Are Incorrect

A. Converting images into text

This is OCR functionality.

C. Generating images from prompts

This is a generative AI capability.

D. Translating text into code

This is unrelated to speech recognition.

Question 2

Which capability converts written text into spoken audio?

A. OCR
B. Speech synthesis
C. Object detection
D. Clustering

Correct Answer

B. Speech synthesis

Explanation

Speech synthesis, also called text-to-speech (TTS), generates spoken audio from text.

Why the Other Answers Are Incorrect

A. OCR

OCR extracts text from images.

C. Object detection

Object detection identifies objects in images.

D. Clustering

Clustering groups similar data.

Question 3

A company wants to generate live subtitles during online meetings.

Which AI capability should be used?

A. Speech recognition
B. Speech synthesis
C. Facial recognition
D. Image segmentation

Correct Answer

A. Speech recognition

Explanation

Speech recognition converts spoken conversations into text in real time, enabling live captions and subtitles.

Why the Other Answers Are Incorrect

B. Speech synthesis

Speech synthesis creates spoken audio from text.

C. Facial recognition

Facial recognition analyzes faces in images.

D. Image segmentation

Image segmentation is a computer vision task.

Question 4

What is another common name for speech synthesis?

A. Object detection
B. Text-to-speech
C. Speech-to-text
D. Named Entity Recognition

Correct Answer

B. Text-to-speech

Explanation

Speech synthesis is commonly referred to as text-to-speech (TTS).

Why the Other Answers Are Incorrect

A. Object detection

Object detection identifies objects in images.

C. Speech-to-text

Speech-to-text refers to speech recognition.

D. Named Entity Recognition

NER identifies entities in text.

Question 5

Which scenario BEST demonstrates speech synthesis?

A. A chatbot reading answers aloud to users
B. A camera identifying vehicles on a road
C. A system categorizing customer emails
D. A database sorting sales records

Correct Answer

A. A chatbot reading answers aloud to users

Explanation

Speech synthesis converts text responses into spoken audio for users.

Why the Other Answers Are Incorrect

B. A camera identifying vehicles on a road

This is computer vision.

C. A system categorizing customer emails

This is text classification.

D. A database sorting sales records

This is not a speech AI task.

Question 6

Which feature helps speech recognition systems perform better in noisy environments?

A. Image enhancement
B. Noise reduction
C. OCR optimization
D. Regression tuning

Correct Answer

B. Noise reduction

Explanation

Noise reduction filters background sounds to improve speech recognition accuracy.

Why the Other Answers Are Incorrect

A. Image enhancement

Image enhancement relates to visual processing.

C. OCR optimization

OCR works with images and text extraction.

D. Regression tuning

Regression is unrelated to speech audio processing.

Question 7

A navigation application reads driving directions aloud to users.

Which AI capability is being used?

A. Sentiment analysis
B. Speech synthesis
C. Object detection
D. Language detection

Correct Answer

B. Speech synthesis

Explanation

Speech synthesis converts written navigation instructions into spoken audio.

Why the Other Answers Are Incorrect

A. Sentiment analysis

Sentiment analysis evaluates emotional tone in text.

C. Object detection

Object detection analyzes images.

D. Language detection

Language detection identifies languages in text.

Question 8

Which statement BEST describes speech translation?

A. It converts images into searchable text
B. It combines speech recognition, translation, and speech synthesis
C. It identifies objects in audio recordings
D. It predicts future speech patterns

Correct Answer

B. It combines speech recognition, translation, and speech synthesis

Explanation

Speech translation systems convert spoken language into another language and often generate translated spoken output.

Why the Other Answers Are Incorrect

A. It converts images into searchable text

This is OCR.

C. It identifies objects in audio recordings

This is not a standard speech AI capability.

D. It predicts future speech patterns

This is unrelated to translation systems.

Question 9

What is the PRIMARY purpose of speaker recognition?

A. Generating synthetic voices
B. Identifying or verifying individuals using voice characteristics
C. Translating speech into multiple languages
D. Extracting keywords from documents

Correct Answer

B. Identifying or verifying individuals using voice characteristics

Explanation

Speaker recognition systems use voice patterns to identify or authenticate users.

Why the Other Answers Are Incorrect

A. Generating synthetic voices

This is speech synthesis.

C. Translating speech into multiple languages

This is speech translation.

D. Extracting keywords from documents

This is keyword extraction.

Question 10

Which pair correctly matches the capability with its function?

A. Speech recognition → Converts text into speech
B. Speech synthesis → Converts speech into text
C. Speech recognition → Converts speech into text
D. OCR → Generates spoken audio

Correct Answer

C. Speech recognition → Converts speech into text

Explanation

Speech recognition converts spoken language into written text, while speech synthesis converts text into spoken audio.

Why the Other Answers Are Incorrect

A. Speech recognition → Converts text into speech

This describes speech synthesis.

B. Speech synthesis → Converts speech into text

This describes speech recognition.

D. OCR → Generates spoken audio

OCR extracts text from images.

Final Thoughts

Speech AI technologies are essential components of modern AI systems and are an important topic for the AI-901 certification exam. Microsoft expects candidates to understand how speech recognition and speech synthesis work, along with common business scenarios where these technologies are applied.

These capabilities help organizations build more natural, accessible, and interactive AI-powered experiences using Azure AI services.

Go to the AI-901 Exam Prep Hub main page