Interpret visual input in prompts by using a deployed multimodal model (AI-901 Exam Prep)

This post is a part of the AI-901: Microsoft Azure AI Fundamentals Exam Prep Hub. 
This topic falls under these sections:
Implement AI solutions by using Microsoft Foundry (55–60%)
--> Implement AI solutions with computer vision and image-generation capabilities by using Foundry
--> Interpret visual input in prompts by using a deployed multimodal model


Note that there are 10 practice questions (with answers and explanations) for each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available on the hub below the exam topics section.

Modern AI systems are increasingly capable of understanding not only text and speech, but also visual information such as images and videos. Multimodal AI models combine multiple forms of input to generate intelligent responses and insights.

For the AI-901 certification exam, candidates should understand the foundational concepts behind interpreting visual input in prompts by using deployed multimodal models through Microsoft Azure AI Foundry and related Azure AI services.

This topic falls under the “Implement AI solutions with computer vision and image-generation capabilities by using Foundry” section of the AI-901 exam objectives.


What Is a Multimodal Model?

A multimodal model is an AI model capable of processing multiple types of input and output.

These modalities may include:

  • Text
  • Images
  • Speech/audio
  • Video

Multimodal models can combine information across different input types to generate responses.


What Is Visual Input?

Visual input refers to image or video data provided to an AI system.

Examples include:

  • Photographs
  • Screenshots
  • Documents
  • Charts
  • Diagrams
  • Videos

Example Visual Prompt

A user uploads a photo and asks:

“What objects are visible in this image?”

The AI analyzes the visual content and generates a response.


Computer Vision

Computer vision is the field of AI focused on enabling systems to interpret and understand visual information.

Computer vision tasks include:

  • Image classification
  • Object detection
  • Facial analysis
  • Optical character recognition (OCR)
  • Image captioning

Azure AI Vision

Azure AI Vision provides computer vision capabilities in Azure.

Features include:

  • Image analysis
  • OCR
  • Object detection
  • Image captioning
  • Face-related analysis

Azure AI Foundry

Azure AI Foundry provides tools for building and managing multimodal AI applications.

Developers can:

  • Deploy AI models
  • Test prompts
  • Analyze images
  • Build AI-powered apps

Deployed Models

A deployed model is an AI model made available for real-time use through a cloud endpoint.

Applications communicate with deployed models using APIs.


Visual Prompt Workflow

A common workflow includes:

  1. User uploads image
  2. Application sends image to multimodal model
  3. Model analyzes visual content
  4. Model generates response
  5. Application displays results

Example Workflow

User Uploads Image

A photo of a dog playing in a park

User Prompt

“Describe this image.”

AI Response

“A brown dog is running through a grassy park.”


Image Classification

Image classification identifies the primary category of an image.


Example

Image

Picture of a cat

Classification

“Cat”


Object Detection

Object detection identifies and locates multiple objects within an image.


Example

Image

Street scene

Detected Objects

  • Car
  • Bicycle
  • Traffic light
  • Pedestrian

Optical Character Recognition (OCR)

OCR extracts text from images or scanned documents.


Example

Image

Photo of a receipt

Extracted Text

  • Store name
  • Total amount
  • Date

Image Captioning

Image captioning generates natural-language descriptions of images.


Example

Image

A child flying a kite

Caption

“A child flying a colorful kite in a field.”


Visual Question Answering

Some multimodal models can answer questions about images.


Example

Prompt

“How many people are in the image?”

The model analyzes the image and generates an answer.


Combining Text and Images

Multimodal systems often combine:

  • Text prompts
  • Visual input

This improves contextual understanding.


Example

Image

A restaurant menu

Prompt

“Which item appears to be vegetarian?”

The AI analyzes both the image and the prompt together.


APIs and Endpoints

Applications communicate with deployed multimodal models through:

  • APIs
  • Endpoints

These allow images and prompts to be submitted programmatically.


Authentication

Applications must securely authenticate before accessing Azure AI services.

Common methods include:

  • API keys
  • Azure credentials
  • Managed identities

User Interface Components

A lightweight visual AI application may include:

  • Image upload area
  • Prompt input box
  • Results display
  • Image preview

Real-Time Processing

Many multimodal applications support near real-time image analysis.

This enables interactive user experiences.


Common Real-World Scenarios


Scenario 1: Accessibility Assistant

Goal

Describe visual content for visually impaired users.

Features

  • Image captioning
  • OCR
  • Voice output

Scenario 2: Retail Product Recognition

Goal

Identify products from images.

Features

  • Object detection
  • Classification
  • Product lookup

Scenario 3: Document Processing

Goal

Extract information from scanned forms.

Features

  • OCR
  • Text extraction
  • Data analysis

Scenario 4: Content Moderation

Goal

Identify harmful or unsafe visual content.

Features

  • Image analysis
  • Safety filtering
  • Automated moderation

Responsible AI Considerations

Visual AI applications should follow Responsible AI principles.

Key considerations include:

  • Privacy
  • Fairness
  • Transparency
  • Inclusiveness
  • Accountability
  • Security

Privacy Concerns

Images may contain:

  • Personal information
  • Faces
  • Sensitive documents

Organizations should protect user data appropriately.


Bias and Fairness

Computer vision systems may perform unevenly across:

  • Skin tones
  • Age groups
  • Lighting conditions
  • Demographics

Organizations should evaluate models carefully for fairness.


Transparency

Users should understand:

  • AI is analyzing images
  • AI-generated descriptions may contain errors
  • Images may be stored or processed in the cloud

Hallucinations

Multimodal AI systems may generate inaccurate visual descriptions.

These incorrect outputs are called hallucinations.

Applications should not assume all AI-generated outputs are accurate.


Error Handling

Applications should handle:

  • Unsupported image formats
  • Low-quality images
  • Network failures
  • Authentication errors
  • Rate limits

Image Quality Challenges

Poor image quality can reduce accuracy.

Examples include:

  • Blurry images
  • Poor lighting
  • Occluded objects
  • Low resolution

Advantages of Visual AI Applications

Benefits include:

  • Automation
  • Faster analysis
  • Accessibility improvements
  • Improved user experiences
  • Scalable image processing

Limitations of Visual AI Applications

Challenges include:

  • Recognition inaccuracies
  • Bias
  • Privacy concerns
  • Hallucinations
  • Sensitivity to image quality

High-Level Workflow

A simplified workflow includes:

  1. Upload image
  2. Send image and prompt to model
  3. Analyze visual content
  4. Generate response
  5. Display results

Example High-Level Pseudocode

image = upload_image()
prompt = get_prompt()
response = analyze_image(image, prompt)
display_response(response)

For AI-901, understanding the workflow is more important than memorizing exact syntax.


Important AI-901 Exam Tips

For the exam, remember these key points:

  • Multimodal models process multiple data types.
  • Visual input includes images and video.
  • Azure AI Vision supports computer vision workloads.
  • OCR extracts text from images.
  • Image captioning generates descriptions of images.
  • Object detection identifies multiple objects in images.
  • APIs and endpoints connect applications to AI services.
  • Authentication secures AI access.
  • Responsible AI principles apply to computer vision systems.
  • Hallucinations are inaccurate AI-generated outputs.

Quick Knowledge Check

Question 1

What is OCR used for?

Answer

Extracting text from images or scanned documents.


Question 2

What does image captioning do?

Answer

Generates natural-language descriptions of images.


Question 3

Why are multimodal models useful?

Answer

They can process multiple types of input such as text and images together.


Question 4

Why is fairness important in computer vision?

Answer

To reduce biased or uneven performance across different groups of people.


Practice Exam Questions

Question 1

What is a multimodal AI model?

A. A model that processes only text
B. A model capable of processing multiple types of input such as text and images
C. A model used only for networking
D. A model designed exclusively for spreadsheets


Correct Answer

B. A model capable of processing multiple types of input such as text and images


Explanation

Multimodal models can process and combine different forms of input, including text, images, audio, and video.


Why the Other Answers Are Incorrect

A. A model that processes only text

That describes a text-only model.

C. A model used only for networking

Networking is unrelated to multimodal AI.

D. A model designed exclusively for spreadsheets

This is unrelated to AI modalities.


Question 2

Which Azure service provides computer vision capabilities such as image analysis and OCR?

A. Azure AI Vision
B. Azure Backup
C. Azure Virtual Desktop
D. Azure Monitor


Correct Answer

A. Azure AI Vision


Explanation

Azure AI Vision provides computer vision features including OCR, object detection, and image captioning.


Why the Other Answers Are Incorrect

B. Azure Backup

This is a backup service.

C. Azure Virtual Desktop

This provides desktop virtualization.

D. Azure Monitor

This is used for monitoring and diagnostics.


Question 3

What does OCR stand for?

A. Optical Character Recognition
B. Operational Cloud Routing
C. Object Classification Registry
D. Open Compute Rendering


Correct Answer

A. Optical Character Recognition


Explanation

OCR extracts text from images or scanned documents.


Why the Other Answers Are Incorrect

B. Operational Cloud Routing

This is not an AI vision term.

C. Object Classification Registry

This is not the meaning of OCR.

D. Open Compute Rendering

This is unrelated to text extraction.


Question 4

What is the PRIMARY purpose of object detection?

A. To identify and locate objects within an image
B. To translate speech into text
C. To summarize long documents
D. To improve internet speed


Correct Answer

A. To identify and locate objects within an image


Explanation

Object detection identifies multiple objects and their positions within an image.


Why the Other Answers Are Incorrect

B. To translate speech into text

This is a speech recognition task.

C. To summarize long documents

This is a text analysis task.

D. To improve internet speed

Object detection does not affect networking.


Question 5

What does image captioning do?

A. Generates natural-language descriptions of images
B. Converts text into audio
C. Detects malware in files
D. Compresses images automatically


Correct Answer

A. Generates natural-language descriptions of images


Explanation

Image captioning uses AI to describe visual content in natural language.


Why the Other Answers Are Incorrect

B. Converts text into audio

This is speech synthesis.

C. Detects malware in files

This is unrelated to computer vision.

D. Compresses images automatically

Captioning does not perform compression.


Question 6

How do applications typically communicate with deployed multimodal models?

A. Through APIs and endpoints
B. Through USB-only connections
C. Through monitor drivers
D. Through spreadsheet templates


Correct Answer

A. Through APIs and endpoints


Explanation

Applications use APIs and endpoints to send prompts and images to AI services.


Why the Other Answers Are Incorrect

B. Through USB-only connections

Cloud AI services use network communication.

C. Through monitor drivers

These are unrelated to AI communication.

D. Through spreadsheet templates

This is unrelated to AI integration.


Question 7

Why is authentication important when accessing Azure AI services?

A. To secure access to AI resources
B. To increase image resolution
C. To improve keyboard performance
D. To reduce monitor brightness


Correct Answer

A. To secure access to AI resources


Explanation

Authentication ensures that only authorized users and applications can access Azure AI services.


Why the Other Answers Are Incorrect

B. To increase image resolution

Authentication does not affect image quality.

C. To improve keyboard performance

This is unrelated to AI services.

D. To reduce monitor brightness

Authentication does not control display settings.


Question 8

Which Responsible AI concern is especially important when analyzing images?

A. Protecting personal and sensitive visual information
B. Increasing video frame rates
C. Improving printer output quality
D. Accelerating spreadsheet calculations


Correct Answer

A. Protecting personal and sensitive visual information


Explanation

Images may contain faces, documents, or other sensitive information that must be protected.


Why the Other Answers Are Incorrect

B. Increasing video frame rates

This is unrelated to Responsible AI.

C. Improving printer output quality

Printers are unrelated to computer vision ethics.

D. Accelerating spreadsheet calculations

This is unrelated to image analysis.


Question 9

What are hallucinations in multimodal AI systems?

A. Incorrect or fabricated AI-generated outputs
B. Hardware installation failures
C. Internet connectivity issues
D. Audio recording problems


Correct Answer

A. Incorrect or fabricated AI-generated outputs


Explanation

Hallucinations occur when AI generates inaccurate or invented descriptions or answers.


Why the Other Answers Are Incorrect

B. Hardware installation failures

This is unrelated to AI-generated content.

C. Internet connectivity issues

This is a networking problem.

D. Audio recording problems

This relates to audio hardware or software.


Question 10

Which factor can negatively affect computer vision accuracy?

A. Poor image quality
B. Spreadsheet formatting
C. Screen brightness settings
D. Keyboard layout


Correct Answer

A. Poor image quality


Explanation

Blurry images, poor lighting, and low resolution can reduce computer vision accuracy.


Why the Other Answers Are Incorrect

B. Spreadsheet formatting

This does not affect image analysis.

C. Screen brightness settings

This does not directly affect AI image processing.

D. Keyboard layout

Keyboard settings are unrelated to computer vision.


Final Thoughts

Interpreting visual input using deployed multimodal models is an important topic for the AI-901 certification exam. Microsoft expects candidates to understand the foundational concepts behind computer vision and multimodal AI applications, including image analysis, OCR, object detection, image captioning, APIs, authentication, and Responsible AI principles.

Azure AI Vision and Azure AI Foundry provide powerful tools for building intelligent applications capable of understanding and responding to visual information in real-world scenarios.


Go to the AI-901 Exam Prep Hub main page

One thought on “Interpret visual input in prompts by using a deployed multimodal model (AI-901 Exam Prep)”

Leave a comment