Identify Features and Uses for Speech Recognition and Synthesis (AI-900 Exam Prep)

Where This Fits in the Exam

  • Exam area: Describe features of Natural Language Processing (NLP) workloads on Azure (15–20%)
  • Sub-area: Identify features of common NLP workload scenarios
  • Key focus: Understanding what speech recognition and synthesis do, when to use them, and which Azure services support them

This topic is highly scenario-driven on the exam.


Overview: Speech in NLP Workloads

Speech-related NLP workloads allow AI systems to:

  • Understand spoken language (speech recognition)
  • Generate spoken language (speech synthesis)

Together, these capabilities enable voice-based interactions such as virtual assistants, voice bots, dictation tools, and accessibility solutions.


Speech Recognition

What Is Speech Recognition?

Speech recognition (also called speech-to-text) is the process of converting spoken audio into written text.

The AI system analyzes:

  • Audio signals
  • Phonemes and pronunciation
  • Language patterns
  • Context

And produces text that represents what was spoken.


Key Features of Speech Recognition

Speech recognition solutions can:

  • Convert live or recorded audio into text
  • Support real-time transcription
  • Handle multiple languages and accents
  • Apply noise reduction
  • Recognize custom vocabulary (e.g., medical or technical terms)
  • Provide timestamps for spoken words or phrases

Common Uses of Speech Recognition

Speech recognition is used when users speak instead of type.

Common scenarios include:

  • Voice commands (e.g., “Turn on the lights”)
  • Call center transcription
  • Meeting and lecture transcription
  • Voice-controlled applications
  • Accessibility tools for users with limited mobility
  • Voice input for chatbots and virtual assistants

Azure Services for Speech Recognition

In Azure, speech recognition is provided by:

Azure AI Speech (Speech service)

Capabilities include:

  • Speech-to-text
  • Real-time and batch transcription
  • Language detection
  • Custom speech models

Speech Synthesis

What Is Speech Synthesis?

Speech synthesis (also called text-to-speech) is the process of converting written text into spoken audio.

The goal is to produce natural, human-like speech that sounds fluent and expressive.


Key Features of Speech Synthesis

Speech synthesis solutions can:

  • Convert text into spoken audio
  • Use natural-sounding neural voices
  • Support multiple languages and accents
  • Adjust:
    • Pitch
    • Speed
    • Tone
  • Apply SSML (Speech Synthesis Markup Language) for fine control
  • Generate speech for audio files or real-time playback

Common Uses of Speech Synthesis

Speech synthesis is used when systems need to speak to users.

Common scenarios include:

  • Virtual assistants and chatbots
  • Navigation and GPS systems
  • Accessibility tools for visually impaired users
  • Audiobooks and e-learning content
  • Automated announcements
  • Customer service voice bots

Azure Services for Speech Synthesis

In Azure, speech synthesis is also provided by:

Azure AI Speech (Speech service)

Capabilities include:

  • Text-to-speech
  • Neural voices
  • Voice customization
  • Multilingual speech output

Speech Recognition vs Speech Synthesis

CapabilitySpeech RecognitionSpeech Synthesis
DirectionSpeech → TextText → Speech
InputAudioText
OutputTextAudio
Common NameSpeech-to-textText-to-speech
ExampleTranscribing a callReading text aloud

Combined Speech Workloads

Many real-world solutions use both capabilities together.

Example:

  1. User speaks a question (speech recognition)
  2. System processes the text using NLP or AI logic
  3. System responds verbally (speech synthesis)

This is the foundation of:

  • Voice assistants
  • Conversational AI
  • Interactive voice response (IVR) systems

Exam-Focused Clues to Watch For 👀

On the AI-900 exam, speech workloads are usually described using phrases like:

  • “Convert spoken audio into text” → Speech recognition
  • “Generate spoken responses from text” → Speech synthesis
  • “Voice-enabled application” → Azure AI Speech
  • “Real-time transcription” → Speech recognition
  • “Reads text aloud” → Speech synthesis

Key Takeaways for AI-900

  • Speech recognition converts speech to text
  • Speech synthesis converts text to speech
  • Both are part of NLP workloads
  • Azure AI Speech is the primary Azure service for both
  • Common exam scenarios involve:
    • Voice assistants
    • Transcription
    • Accessibility
    • Customer service automation

Go to the Practice Exam Questions for this topic.

Go to the AI-900 Exam Prep Hub main page.

Leave a comment