Describe How Training and Validation Datasets Are Used in Machine Learning (AI-900 Exam Prep)

This section of the AI-900: Microsoft Azure AI Fundamentals exam focuses on understanding how machine learning models learn from data and how their performance is evaluated. Specifically, it covers the role of training datasets and validation datasets, which are core concepts in supervised machine learning.

This topic appears under: Describe fundamental principles of machine learning on Azure (15–20%) → Describe core machine learning concepts

You are not expected to build or tune models for AI-900, but you must be able to describe the purpose of training and validation datasets and how they differ.


Why Datasets Are Split in Machine Learning

In machine learning, using the same data to both train and evaluate a model can lead to misleading results. To avoid this, datasets are commonly split into separate subsets, each with a distinct purpose.

At a minimum, most machine learning workflows use:

  • A training dataset
  • A validation dataset

These datasets help ensure that a model can generalize to new, unseen data.


Training Dataset

A training dataset is the portion of data used to teach the machine learning model how to make predictions.

Key Characteristics of Training Data

  • Contains both features and labels (in supervised learning)
  • Used to identify patterns and relationships in the data
  • Typically makes up the largest portion of the dataset

What Happens During Training

  • The model makes predictions using the features
  • Predictions are compared to the known labels
  • The model adjusts its internal parameters to reduce errors

In Azure Machine Learning, this is the phase where the model “learns” from historical data.


Validation Dataset

A validation dataset is used to evaluate how well the model performs on unseen data during the training process.

Key Characteristics of Validation Data

  • Separate from the training dataset
  • Contains features and labels
  • Used to assess model accuracy and generalization

Why Validation Data Is Important

  • Helps detect overfitting (when a model memorizes training data)
  • Provides an unbiased evaluation of model performance
  • Supports decisions about model selection or improvement

For AI-900, the key idea is that validation data is not used to train the model, only to evaluate it.


Training vs Validation: Key Differences

AspectTraining DatasetValidation Dataset
Primary purposeTeach the modelEvaluate the model
Used to adjust model parametersYesNo
Seen by the model during learningYesNo
Helps detect overfittingIndirectlyYes

Understanding this distinction is essential for AI-900 exam questions.


Common Data Split Ratios

While AI-900 does not test exact percentages, common industry practices include:

  • 70% training / 30% validation
  • 80% training / 20% validation

The exact split depends on dataset size and use case, but the concept is what matters for the exam.


Example Scenario

A company is building a model to predict whether customers will cancel a subscription.

  • Training dataset:
    • Used to teach the model using historical customer behavior and known outcomes
  • Validation dataset:
    • Used to test how accurately the model predicts cancellations for customers it has not seen before

This approach helps ensure the model performs well in real-world scenarios.


Overfitting and Generalization

One of the main reasons for using a validation dataset is to avoid overfitting.

  • Overfitting occurs when a model performs well on training data but poorly on new data
  • Validation data helps confirm that the model can generalize beyond the training set

For AI-900, you only need to recognize this relationship, not the mathematical details.


Azure Context for AI-900

In Azure Machine Learning:

  • Training data is used to train machine learning models
  • Validation data is used to evaluate model performance during development
  • This separation supports reliable and responsible AI solutions

Exam Tips for AI-900

  • If the question mentions learning or adjusting the model, think training dataset
  • If the question mentions evaluation or performance on unseen data, think validation dataset
  • Validation data is not used to teach the model
  • AI-900 focuses on understanding why datasets are separated

Key Takeaways

  • Training datasets are used to teach machine learning models
  • Validation datasets are used to evaluate model performance
  • Separating datasets helps prevent overfitting
  • Understanding these roles is a core AI-900 exam skill

Go to the Practice Exam Questions for this topic.

Go to the AI-900 Exam Prep Hub main page.

Leave a comment