This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub.
This topic falls under these sections:
Implement AI solutions by using Microsoft Foundry (55–60%)
--> Implement AI solutions for text and speech by using Foundry
--> Respond to spoken prompts by using a deployed multimodal model
Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.
Modern AI systems increasingly support multimodal interactions, allowing users to communicate using speech, text, images, and other forms of input. Multimodal AI models can process and combine multiple input types to generate intelligent responses.
For the AI-901 certification exam, candidates should understand the foundational concepts behind responding to spoken prompts by using deployed multimodal AI models within Microsoft Azure AI Foundry and related Azure AI services.
This topic falls under the “Implement AI solutions for text and speech by using Foundry” section of the AI-901 exam objectives.
What Is a Multimodal Model?
A multimodal model is an AI model capable of processing multiple forms of input and output.
Examples of modalities include:
- Text
- Speech/audio
- Images
- Video
A multimodal model can combine information from multiple sources to generate responses.
Examples of Multimodal AI Systems
Common examples include:
- Voice assistants
- AI copilots
- Speech-enabled chatbots
- Image-and-text AI assistants
- Interactive educational tools
What Is a Spoken Prompt?
A spoken prompt is a voice-based user input provided through audio.
Instead of typing a question, the user speaks it aloud.
Example Spoken Prompt
“What is machine learning?”
The AI system converts the speech into text for processing.
Speech Recognition
Speech recognition converts spoken language into text.
This process is often called:
- Speech-to-text (STT)
- Automatic speech recognition (ASR)
Example Speech Recognition Workflow
Spoken Audio
“What time is the meeting tomorrow?”
Converted Text
“What time is the meeting tomorrow?”
The text is then processed by the AI model.
Speech Synthesis
Speech synthesis converts text into spoken audio.
This process is often called:
- Text-to-speech (TTS)
Example
AI Response Text
“The meeting starts at 10 AM.”
Spoken Output
The AI system reads the response aloud.
Azure AI Speech
Azure AI Speech provides speech recognition and speech synthesis capabilities.
Features include:
- Speech-to-text
- Text-to-speech
- Speech translation
- Voice generation
Azure AI Foundry
Azure AI Foundry provides tools for building, deploying, and testing AI applications and multimodal solutions.
Basic Workflow for Spoken Prompt Applications
A typical workflow includes:
- User speaks into microphone
- Speech recognition converts audio to text
- Text is sent to deployed multimodal model
- AI model generates response
- Optional speech synthesis converts response to audio
- User hears spoken reply
Example End-to-End Scenario
User Speaks
“Summarize today’s sales report.”
Speech Recognition
Converts audio to text
AI Model
Generates summary
Speech Synthesis
Reads summary aloud
Deployed Models
A deployed model is an AI model made available through a cloud endpoint for real-time use.
Applications interact with deployed models using APIs.
APIs and Endpoints
Applications communicate with deployed models through:
- APIs
- Endpoints
The application sends requests and receives responses programmatically.
Authentication
Applications must securely authenticate before accessing AI services.
Common methods include:
- API keys
- Azure credentials
- Managed identities
Lightweight Speech Applications
Lightweight speech-enabled applications typically include:
- Microphone input
- Speech processing
- AI response generation
- Audio playback
Conversation Context
Many speech-enabled applications maintain context between interactions.
This allows more natural conversations.
Example
User
“Who founded Microsoft?”
User Later
“When was it founded?”
The system remembers that “it” refers to Microsoft.
System Prompts
System prompts guide model behavior.
They help define:
- Tone
- Personality
- Safety rules
- Output style
Example System Prompt
“You are a professional customer support assistant.”
Model Parameters
Applications may configure settings such as:
- Temperature
- Maximum tokens
- Top-p sampling
Temperature
Temperature controls response creativity.
| Low Temperature | High Temperature |
|---|---|
| More predictable | More creative |
| More focused | More varied |
Streaming Responses
Some applications stream speech or text responses incrementally.
Streaming improves responsiveness and user experience.
Real-Time Interaction
Speech-enabled AI systems often support real-time interaction.
This creates conversational experiences similar to human dialogue.
Common Real-World Use Cases
Scenario 1: Voice Assistant
Goal
Answer spoken user questions.
Features
- Speech recognition
- Conversational AI
- Spoken responses
Scenario 2: Hands-Free AI Assistant
Goal
Allow users to interact without typing.
Features
- Voice commands
- Audio responses
- Context retention
Scenario 3: Accessibility Support
Goal
Assist users with visual or mobility impairments.
Features
- Voice interaction
- Spoken guidance
- Accessibility improvements
Responsible AI Considerations
Speech-enabled AI applications should follow Responsible AI principles.
Important considerations include:
- Privacy
- Security
- Transparency
- Fairness
- Inclusiveness
- Accountability
Privacy Concerns
Speech applications may process sensitive spoken information.
Organizations should:
- Protect audio recordings
- Secure conversations
- Limit unnecessary data storage
Transparency
Users should understand:
- AI is processing speech
- Audio may be recorded or analyzed
- AI-generated responses may contain inaccuracies
Inclusiveness
Speech systems should support:
- Different accents
- Languages
- Speech patterns
- Accessibility needs
Hallucinations
Generative AI models may produce inaccurate or fabricated responses.
These incorrect outputs are called hallucinations.
Applications should not assume all generated responses are correct.
Latency
Speech-enabled applications must minimize delays between:
- Speech input
- AI processing
- Spoken responses
High latency negatively affects user experience.
Error Handling
Applications should handle:
- Speech recognition errors
- Background noise
- Network failures
- Authentication issues
- Rate limits
Background Noise Challenges
Speech recognition may struggle with:
- Loud environments
- Multiple speakers
- Poor microphone quality
Advantages of Spoken AI Interfaces
Benefits include:
- Natural interaction
- Hands-free operation
- Accessibility improvements
- Faster communication
- Improved user experience
Limitations of Spoken AI Interfaces
Challenges include:
- Speech recognition errors
- Accent variability
- Noise interference
- Privacy concerns
- Hallucinations
- Latency
High-Level Application Workflow
A simplified workflow includes:
- Capture speech
- Convert speech to text
- Send prompt to model
- Receive response
- Convert response to speech
- Play audio response
Example High-Level Pseudocode
audio = capture_audio()text = speech_to_text(audio)response = generate_ai_response(text)speak(response)
For AI-901, understanding the workflow is more important than memorizing exact syntax.
Important AI-901 Exam Tips
For the exam, remember these key points:
- Multimodal models process multiple input types.
- Spoken prompts use speech as input.
- Speech recognition converts speech to text.
- Speech synthesis converts text to speech.
- Azure AI Speech supports speech workloads.
- Azure AI Foundry supports AI application development.
- APIs and endpoints connect applications to deployed models.
- Authentication secures AI services.
- Responsible AI principles apply to speech-enabled systems.
- Hallucinations are inaccurate AI-generated outputs.
Quick Knowledge Check
Question 1
What does speech recognition do?
Answer
Converts spoken language into text.
Question 2
What does speech synthesis do?
Answer
Converts text into spoken audio.
Question 3
What is a multimodal model?
Answer
An AI model that processes multiple forms of input and output.
Question 4
Why is inclusiveness important in speech systems?
Answer
To support different accents, languages, and accessibility needs.
Practice Exam Questions
Question 1
What is a multimodal AI model?
A. A model that only processes text
B. A model capable of processing multiple forms of input and output
C. A model used only for spreadsheets
D. A model that stores physical hardware configurations
Correct Answer
B. A model capable of processing multiple forms of input and output
Explanation
Multimodal models can work with different data types such as text, speech, images, and video.
Why the Other Answers Are Incorrect
A. A model that only processes text
That describes a text-only model, not a multimodal model.
C. A model used only for spreadsheets
This is unrelated to AI modalities.
D. A model that stores physical hardware configurations
This is unrelated to AI processing.
Question 2
What is the PRIMARY purpose of speech recognition?
A. To convert speech into text
B. To convert images into audio
C. To increase internet speed
D. To generate video animations
Correct Answer
A. To convert speech into text
Explanation
Speech recognition, also called speech-to-text, converts spoken language into written text.
Why the Other Answers Are Incorrect
B. To convert images into audio
Speech recognition does not process images.
C. To increase internet speed
Speech recognition does not affect networking.
D. To generate video animations
This is unrelated to speech processing.
Question 3
What does speech synthesis perform?
A. Converts text into spoken audio
B. Compresses speech files
C. Detects objects in images
D. Removes network latency
Correct Answer
A. Converts text into spoken audio
Explanation
Speech synthesis, also called text-to-speech, generates spoken audio from text.
Why the Other Answers Are Incorrect
B. Compresses speech files
Compression is unrelated to synthesis.
C. Detects objects in images
This is a computer vision task.
D. Removes network latency
Speech synthesis does not control network performance.
Question 4
Which Azure service provides speech recognition and speech synthesis capabilities?
A. Azure AI Speech
B. Azure Backup
C. Azure Firewall
D. Azure Virtual Machines
Correct Answer
A. Azure AI Speech
Explanation
Azure AI Speech supports speech-to-text, text-to-speech, translation, and related speech capabilities.
Why the Other Answers Are Incorrect
B. Azure Backup
This is a storage protection service.
C. Azure Firewall
This is a security service.
D. Azure Virtual Machines
This provides compute infrastructure.
Question 5
What is the purpose of deploying an AI model?
A. To make the model available for applications through an endpoint
B. To physically install computer hardware
C. To permanently disable the model
D. To compress training data
Correct Answer
A. To make the model available for applications through an endpoint
Explanation
Deployment allows applications to access AI models for real-time use.
Why the Other Answers Are Incorrect
B. To physically install computer hardware
Deployment is typically cloud-based.
C. To permanently disable the model
Deployment enables usage rather than disabling it.
D. To compress training data
Deployment does not compress datasets.
Question 6
How do applications typically communicate with deployed AI models?
A. Through APIs and endpoints
B. Through USB-only connections
C. Through monitor settings
D. Through printer drivers
Correct Answer
A. Through APIs and endpoints
Explanation
Applications use APIs connected to endpoints to exchange requests and responses with AI models.
Why the Other Answers Are Incorrect
B. Through USB-only connections
Cloud AI systems use network communication.
C. Through monitor settings
These are unrelated to AI communication.
D. Through printer drivers
Printer drivers are unrelated to AI APIs.
Question 7
Why is conversation context important in speech-enabled AI systems?
A. It allows the AI to remember previous interactions
B. It improves monitor brightness
C. It increases microphone volume automatically
D. It reduces file storage size
Correct Answer
A. It allows the AI to remember previous interactions
Explanation
Maintaining context helps create more natural and coherent conversations.
Why the Other Answers Are Incorrect
B. It improves monitor brightness
Conversation context does not affect displays.
C. It increases microphone volume automatically
This is unrelated to conversation memory.
D. It reduces file storage size
Context retention does not compress files.
Question 8
Which Responsible AI concern is especially important for speech-enabled applications?
A. Protecting sensitive spoken information
B. Increasing screen resolution
C. Accelerating video rendering
D. Improving keyboard layouts
Correct Answer
A. Protecting sensitive spoken information
Explanation
Speech-enabled systems may process personal or confidential audio data, making privacy and security important.
Why the Other Answers Are Incorrect
B. Increasing screen resolution
This is unrelated to Responsible AI.
C. Accelerating video rendering
This is unrelated to speech AI.
D. Improving keyboard layouts
Speech systems are not focused on keyboards.
Question 9
What are hallucinations in generative AI systems?
A. Incorrect or fabricated AI-generated responses
B. Hardware overheating events
C. Audio recording failures
D. Slow network connections
Correct Answer
A. Incorrect or fabricated AI-generated responses
Explanation
Hallucinations occur when AI generates information that is inaccurate or invented.
Why the Other Answers Are Incorrect
B. Hardware overheating events
This is unrelated to AI output quality.
C. Audio recording failures
This is a hardware or software issue.
D. Slow network connections
This relates to connectivity, not AI accuracy.
Question 10
What is one advantage of spoken AI interfaces?
A. Hands-free and natural interaction
B. Elimination of all recognition errors
C. Guaranteed perfect accuracy
D. Removal of all privacy concerns
Correct Answer
A. Hands-free and natural interaction
Explanation
Voice-based interfaces provide convenient and natural interaction experiences.
Why the Other Answers Are Incorrect
B. Elimination of all recognition errors
Speech systems can still make mistakes.
C. Guaranteed perfect accuracy
No AI system is perfectly accurate.
D. Removal of all privacy concerns
Speech applications still require privacy protections.
Final Thoughts
Responding to spoken prompts using deployed multimodal models is an important topic for the AI-901 certification exam. Microsoft expects candidates to understand the foundational workflow behind speech-enabled AI applications, including speech recognition, multimodal processing, speech synthesis, APIs, authentication, and Responsible AI principles.
Azure AI Foundry and Azure AI Speech provide powerful tools for building intelligent conversational applications that support natural voice interactions and modern accessibility-focused experiences.
Go to the AI-901 Exam Prep Hub main page

One thought on “Respond to spoken prompts by using a deployed multimodal model (AI-901 Exam Prep)”