This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub.
This topic falls under these sections:
Implement AI solutions by using Microsoft Foundry (55–60%)
--> Implement AI solutions with computer vision and image-generation capabilities by using Foundry
--> Interpret visual input in prompts by using a deployed multimodal model
Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.
Modern AI systems are increasingly capable of understanding not only text and speech, but also visual information such as images and videos. Multimodal AI models combine multiple forms of input to generate intelligent responses and insights.
For the AI-901 certification exam, candidates should understand the foundational concepts behind interpreting visual input in prompts by using deployed multimodal models through Microsoft Azure AI Foundry and related Azure AI services.
This topic falls under the “Implement AI solutions with computer vision and image-generation capabilities by using Foundry” section of the AI-901 exam objectives.
What Is a Multimodal Model?
A multimodal model is an AI model capable of processing multiple types of input and output.
These modalities may include:
- Text
- Images
- Speech/audio
- Video
Multimodal models can combine information across different input types to generate responses.
What Is Visual Input?
Visual input refers to image or video data provided to an AI system.
Examples include:
- Photographs
- Screenshots
- Documents
- Charts
- Diagrams
- Videos
Example Visual Prompt
A user uploads a photo and asks:
“What objects are visible in this image?”
The AI analyzes the visual content and generates a response.
Computer Vision
Computer vision is the field of AI focused on enabling systems to interpret and understand visual information.
Computer vision tasks include:
- Image classification
- Object detection
- Facial analysis
- Optical character recognition (OCR)
- Image captioning
Azure AI Vision
Azure AI Vision provides computer vision capabilities in Azure.
Features include:
- Image analysis
- OCR
- Object detection
- Image captioning
- Face-related analysis
Azure AI Foundry
Azure AI Foundry provides tools for building and managing multimodal AI applications.
Developers can:
- Deploy AI models
- Test prompts
- Analyze images
- Build AI-powered apps
Deployed Models
A deployed model is an AI model made available for real-time use through a cloud endpoint.
Applications communicate with deployed models using APIs.
Visual Prompt Workflow
A common workflow includes:
- User uploads image
- Application sends image to multimodal model
- Model analyzes visual content
- Model generates response
- Application displays results
Example Workflow
User Uploads Image
A photo of a dog playing in a park
User Prompt
“Describe this image.”
AI Response
“A brown dog is running through a grassy park.”
Image Classification
Image classification identifies the primary category of an image.
Example
Image
Picture of a cat
Classification
“Cat”
Object Detection
Object detection identifies and locates multiple objects within an image.
Example
Image
Street scene
Detected Objects
- Car
- Bicycle
- Traffic light
- Pedestrian
Optical Character Recognition (OCR)
OCR extracts text from images or scanned documents.
Example
Image
Photo of a receipt
Extracted Text
- Store name
- Total amount
- Date
Image Captioning
Image captioning generates natural-language descriptions of images.
Example
Image
A child flying a kite
Caption
“A child flying a colorful kite in a field.”
Visual Question Answering
Some multimodal models can answer questions about images.
Example
Prompt
“How many people are in the image?”
The model analyzes the image and generates an answer.
Combining Text and Images
Multimodal systems often combine:
- Text prompts
- Visual input
This improves contextual understanding.
Example
Image
A restaurant menu
Prompt
“Which item appears to be vegetarian?”
The AI analyzes both the image and the prompt together.
APIs and Endpoints
Applications communicate with deployed multimodal models through:
- APIs
- Endpoints
These allow images and prompts to be submitted programmatically.
Authentication
Applications must securely authenticate before accessing Azure AI services.
Common methods include:
- API keys
- Azure credentials
- Managed identities
User Interface Components
A lightweight visual AI application may include:
- Image upload area
- Prompt input box
- Results display
- Image preview
Real-Time Processing
Many multimodal applications support near real-time image analysis.
This enables interactive user experiences.
Common Real-World Scenarios
Scenario 1: Accessibility Assistant
Goal
Describe visual content for visually impaired users.
Features
- Image captioning
- OCR
- Voice output
Scenario 2: Retail Product Recognition
Goal
Identify products from images.
Features
- Object detection
- Classification
- Product lookup
Scenario 3: Document Processing
Goal
Extract information from scanned forms.
Features
- OCR
- Text extraction
- Data analysis
Scenario 4: Content Moderation
Goal
Identify harmful or unsafe visual content.
Features
- Image analysis
- Safety filtering
- Automated moderation
Responsible AI Considerations
Visual AI applications should follow Responsible AI principles.
Key considerations include:
- Privacy
- Fairness
- Transparency
- Inclusiveness
- Accountability
- Security
Privacy Concerns
Images may contain:
- Personal information
- Faces
- Sensitive documents
Organizations should protect user data appropriately.
Bias and Fairness
Computer vision systems may perform unevenly across:
- Skin tones
- Age groups
- Lighting conditions
- Demographics
Organizations should evaluate models carefully for fairness.
Transparency
Users should understand:
- AI is analyzing images
- AI-generated descriptions may contain errors
- Images may be stored or processed in the cloud
Hallucinations
Multimodal AI systems may generate inaccurate visual descriptions.
These incorrect outputs are called hallucinations.
Applications should not assume all AI-generated outputs are accurate.
Error Handling
Applications should handle:
- Unsupported image formats
- Low-quality images
- Network failures
- Authentication errors
- Rate limits
Image Quality Challenges
Poor image quality can reduce accuracy.
Examples include:
- Blurry images
- Poor lighting
- Occluded objects
- Low resolution
Advantages of Visual AI Applications
Benefits include:
- Automation
- Faster analysis
- Accessibility improvements
- Improved user experiences
- Scalable image processing
Limitations of Visual AI Applications
Challenges include:
- Recognition inaccuracies
- Bias
- Privacy concerns
- Hallucinations
- Sensitivity to image quality
High-Level Workflow
A simplified workflow includes:
- Upload image
- Send image and prompt to model
- Analyze visual content
- Generate response
- Display results
Example High-Level Pseudocode
image = upload_image()prompt = get_prompt()response = analyze_image(image, prompt)display_response(response)
For AI-901, understanding the workflow is more important than memorizing exact syntax.
Important AI-901 Exam Tips
For the exam, remember these key points:
- Multimodal models process multiple data types.
- Visual input includes images and video.
- Azure AI Vision supports computer vision workloads.
- OCR extracts text from images.
- Image captioning generates descriptions of images.
- Object detection identifies multiple objects in images.
- APIs and endpoints connect applications to AI services.
- Authentication secures AI access.
- Responsible AI principles apply to computer vision systems.
- Hallucinations are inaccurate AI-generated outputs.
Quick Knowledge Check
Question 1
What is OCR used for?
Answer
Extracting text from images or scanned documents.
Question 2
What does image captioning do?
Answer
Generates natural-language descriptions of images.
Question 3
Why are multimodal models useful?
Answer
They can process multiple types of input such as text and images together.
Question 4
Why is fairness important in computer vision?
Answer
To reduce biased or uneven performance across different groups of people.
Practice Exam Questions
Question 1
What is a multimodal AI model?
A. A model that processes only text
B. A model capable of processing multiple types of input such as text and images
C. A model used only for networking
D. A model designed exclusively for spreadsheets
Correct Answer
B. A model capable of processing multiple types of input such as text and images
Explanation
Multimodal models can process and combine different forms of input, including text, images, audio, and video.
Why the Other Answers Are Incorrect
A. A model that processes only text
That describes a text-only model.
C. A model used only for networking
Networking is unrelated to multimodal AI.
D. A model designed exclusively for spreadsheets
This is unrelated to AI modalities.
Question 2
Which Azure service provides computer vision capabilities such as image analysis and OCR?
A. Azure AI Vision
B. Azure Backup
C. Azure Virtual Desktop
D. Azure Monitor
Correct Answer
A. Azure AI Vision
Explanation
Azure AI Vision provides computer vision features including OCR, object detection, and image captioning.
Why the Other Answers Are Incorrect
B. Azure Backup
This is a backup service.
C. Azure Virtual Desktop
This provides desktop virtualization.
D. Azure Monitor
This is used for monitoring and diagnostics.
Question 3
What does OCR stand for?
A. Optical Character Recognition
B. Operational Cloud Routing
C. Object Classification Registry
D. Open Compute Rendering
Correct Answer
A. Optical Character Recognition
Explanation
OCR extracts text from images or scanned documents.
Why the Other Answers Are Incorrect
B. Operational Cloud Routing
This is not an AI vision term.
C. Object Classification Registry
This is not the meaning of OCR.
D. Open Compute Rendering
This is unrelated to text extraction.
Question 4
What is the PRIMARY purpose of object detection?
A. To identify and locate objects within an image
B. To translate speech into text
C. To summarize long documents
D. To improve internet speed
Correct Answer
A. To identify and locate objects within an image
Explanation
Object detection identifies multiple objects and their positions within an image.
Why the Other Answers Are Incorrect
B. To translate speech into text
This is a speech recognition task.
C. To summarize long documents
This is a text analysis task.
D. To improve internet speed
Object detection does not affect networking.
Question 5
What does image captioning do?
A. Generates natural-language descriptions of images
B. Converts text into audio
C. Detects malware in files
D. Compresses images automatically
Correct Answer
A. Generates natural-language descriptions of images
Explanation
Image captioning uses AI to describe visual content in natural language.
Why the Other Answers Are Incorrect
B. Converts text into audio
This is speech synthesis.
C. Detects malware in files
This is unrelated to computer vision.
D. Compresses images automatically
Captioning does not perform compression.
Question 6
How do applications typically communicate with deployed multimodal models?
A. Through APIs and endpoints
B. Through USB-only connections
C. Through monitor drivers
D. Through spreadsheet templates
Correct Answer
A. Through APIs and endpoints
Explanation
Applications use APIs and endpoints to send prompts and images to AI services.
Why the Other Answers Are Incorrect
B. Through USB-only connections
Cloud AI services use network communication.
C. Through monitor drivers
These are unrelated to AI communication.
D. Through spreadsheet templates
This is unrelated to AI integration.
Question 7
Why is authentication important when accessing Azure AI services?
A. To secure access to AI resources
B. To increase image resolution
C. To improve keyboard performance
D. To reduce monitor brightness
Correct Answer
A. To secure access to AI resources
Explanation
Authentication ensures that only authorized users and applications can access Azure AI services.
Why the Other Answers Are Incorrect
B. To increase image resolution
Authentication does not affect image quality.
C. To improve keyboard performance
This is unrelated to AI services.
D. To reduce monitor brightness
Authentication does not control display settings.
Question 8
Which Responsible AI concern is especially important when analyzing images?
A. Protecting personal and sensitive visual information
B. Increasing video frame rates
C. Improving printer output quality
D. Accelerating spreadsheet calculations
Correct Answer
A. Protecting personal and sensitive visual information
Explanation
Images may contain faces, documents, or other sensitive information that must be protected.
Why the Other Answers Are Incorrect
B. Increasing video frame rates
This is unrelated to Responsible AI.
C. Improving printer output quality
Printers are unrelated to computer vision ethics.
D. Accelerating spreadsheet calculations
This is unrelated to image analysis.
Question 9
What are hallucinations in multimodal AI systems?
A. Incorrect or fabricated AI-generated outputs
B. Hardware installation failures
C. Internet connectivity issues
D. Audio recording problems
Correct Answer
A. Incorrect or fabricated AI-generated outputs
Explanation
Hallucinations occur when AI generates inaccurate or invented descriptions or answers.
Why the Other Answers Are Incorrect
B. Hardware installation failures
This is unrelated to AI-generated content.
C. Internet connectivity issues
This is a networking problem.
D. Audio recording problems
This relates to audio hardware or software.
Question 10
Which factor can negatively affect computer vision accuracy?
A. Poor image quality
B. Spreadsheet formatting
C. Screen brightness settings
D. Keyboard layout
Correct Answer
A. Poor image quality
Explanation
Blurry images, poor lighting, and low resolution can reduce computer vision accuracy.
Why the Other Answers Are Incorrect
B. Spreadsheet formatting
This does not affect image analysis.
C. Screen brightness settings
This does not directly affect AI image processing.
D. Keyboard layout
Keyboard settings are unrelated to computer vision.
Final Thoughts
Interpreting visual input using deployed multimodal models is an important topic for the AI-901 certification exam. Microsoft expects candidates to understand the foundational concepts behind computer vision and multimodal AI applications, including image analysis, OCR, object detection, image captioning, APIs, authentication, and Responsible AI principles.
Azure AI Vision and Azure AI Foundry provide powerful tools for building intelligent applications capable of understanding and responding to visual information in real-world scenarios.
Go to the AI-901 Exam Prep Hub main page
