Describe How Training and Validation Datasets Are Used in Machine Learning (AI-900 Exam Prep)

This section of the AI-900: Microsoft Azure AI Fundamentals exam focuses on understanding how machine learning models learn from data and how their performance is evaluated. Specifically, it covers the role of training datasets and validation datasets, which are core concepts in supervised machine learning.

This topic appears under: Describe fundamental principles of machine learning on Azure (15–20%) → Describe core machine learning concepts

You are not expected to build or tune models for AI-900, but you must be able to describe the purpose of training and validation datasets and how they differ.

Why Datasets Are Split in Machine Learning

In machine learning, using the same data to both train and evaluate a model can lead to misleading results. To avoid this, datasets are commonly split into separate subsets, each with a distinct purpose.

At a minimum, most machine learning workflows use:

A training dataset
A validation dataset

These datasets help ensure that a model can generalize to new, unseen data.

Training Dataset

A training dataset is the portion of data used to teach the machine learning model how to make predictions.

Key Characteristics of Training Data

Contains both features and labels (in supervised learning)
Used to identify patterns and relationships in the data
Typically makes up the largest portion of the dataset

What Happens During Training

The model makes predictions using the features
Predictions are compared to the known labels
The model adjusts its internal parameters to reduce errors

In Azure Machine Learning, this is the phase where the model “learns” from historical data.

Validation Dataset

A validation dataset is used to evaluate how well the model performs on unseen data during the training process.

Key Characteristics of Validation Data

Separate from the training dataset
Contains features and labels
Used to assess model accuracy and generalization

Why Validation Data Is Important

Helps detect overfitting (when a model memorizes training data)
Provides an unbiased evaluation of model performance
Supports decisions about model selection or improvement

For AI-900, the key idea is that validation data is not used to train the model, only to evaluate it.

Training vs Validation: Key Differences

Aspect	Training Dataset	Validation Dataset
Primary purpose	Teach the model	Evaluate the model
Used to adjust model parameters	Yes	No
Seen by the model during learning	Yes	No
Helps detect overfitting	Indirectly	Yes

Understanding this distinction is essential for AI-900 exam questions.

Common Data Split Ratios

While AI-900 does not test exact percentages, common industry practices include:

70% training / 30% validation
80% training / 20% validation

The exact split depends on dataset size and use case, but the concept is what matters for the exam.

Example Scenario

A company is building a model to predict whether customers will cancel a subscription.

Training dataset:
- Used to teach the model using historical customer behavior and known outcomes
Validation dataset:
- Used to test how accurately the model predicts cancellations for customers it has not seen before

This approach helps ensure the model performs well in real-world scenarios.

Overfitting and Generalization

One of the main reasons for using a validation dataset is to avoid overfitting.

Overfitting occurs when a model performs well on training data but poorly on new data
Validation data helps confirm that the model can generalize beyond the training set

For AI-900, you only need to recognize this relationship, not the mathematical details.

Azure Context for AI-900

In Azure Machine Learning:

Training data is used to train machine learning models
Validation data is used to evaluate model performance during development
This separation supports reliable and responsible AI solutions

Exam Tips for AI-900

If the question mentions learning or adjusting the model, think training dataset
If the question mentions evaluation or performance on unseen data, think validation dataset
Validation data is not used to teach the model
AI-900 focuses on understanding why datasets are separated

Key Takeaways

Training datasets are used to teach machine learning models
Validation datasets are used to evaluate model performance
Separating datasets helps prevent overfitting
Understanding these roles is a core AI-900 exam skill

Go to the Practice Exam Questions for this topic.

Go to the AI-900 Exam Prep Hub main page.

The Data Community

Describe How Training and Validation Datasets Are Used in Machine Learning (AI-900 Exam Prep)

Why Datasets Are Split in Machine Learning