This section of the AI-900: Microsoft Azure AI Fundamentals exam focuses on understanding how machine learning models learn from data and how their performance is evaluated. Specifically, it covers the role of training datasets and validation datasets, which are core concepts in supervised machine learning.
This topic appears under: Describe fundamental principles of machine learning on Azure (15–20%) → Describe core machine learning concepts
You are not expected to build or tune models for AI-900, but you must be able to describe the purpose of training and validation datasets and how they differ.
Why Datasets Are Split in Machine Learning
In machine learning, using the same data to both train and evaluate a model can lead to misleading results. To avoid this, datasets are commonly split into separate subsets, each with a distinct purpose.
At a minimum, most machine learning workflows use:
- A training dataset
- A validation dataset
These datasets help ensure that a model can generalize to new, unseen data.
Training Dataset
A training dataset is the portion of data used to teach the machine learning model how to make predictions.
Key Characteristics of Training Data
- Contains both features and labels (in supervised learning)
- Used to identify patterns and relationships in the data
- Typically makes up the largest portion of the dataset
What Happens During Training
- The model makes predictions using the features
- Predictions are compared to the known labels
- The model adjusts its internal parameters to reduce errors
In Azure Machine Learning, this is the phase where the model “learns” from historical data.
Validation Dataset
A validation dataset is used to evaluate how well the model performs on unseen data during the training process.
Key Characteristics of Validation Data
- Separate from the training dataset
- Contains features and labels
- Used to assess model accuracy and generalization
Why Validation Data Is Important
- Helps detect overfitting (when a model memorizes training data)
- Provides an unbiased evaluation of model performance
- Supports decisions about model selection or improvement
For AI-900, the key idea is that validation data is not used to train the model, only to evaluate it.
Training vs Validation: Key Differences
| Aspect | Training Dataset | Validation Dataset |
|---|---|---|
| Primary purpose | Teach the model | Evaluate the model |
| Used to adjust model parameters | Yes | No |
| Seen by the model during learning | Yes | No |
| Helps detect overfitting | Indirectly | Yes |
Understanding this distinction is essential for AI-900 exam questions.
Common Data Split Ratios
While AI-900 does not test exact percentages, common industry practices include:
- 70% training / 30% validation
- 80% training / 20% validation
The exact split depends on dataset size and use case, but the concept is what matters for the exam.
Example Scenario
A company is building a model to predict whether customers will cancel a subscription.
- Training dataset:
- Used to teach the model using historical customer behavior and known outcomes
- Validation dataset:
- Used to test how accurately the model predicts cancellations for customers it has not seen before
This approach helps ensure the model performs well in real-world scenarios.
Overfitting and Generalization
One of the main reasons for using a validation dataset is to avoid overfitting.
- Overfitting occurs when a model performs well on training data but poorly on new data
- Validation data helps confirm that the model can generalize beyond the training set
For AI-900, you only need to recognize this relationship, not the mathematical details.
Azure Context for AI-900
In Azure Machine Learning:
- Training data is used to train machine learning models
- Validation data is used to evaluate model performance during development
- This separation supports reliable and responsible AI solutions
Exam Tips for AI-900
- If the question mentions learning or adjusting the model, think training dataset
- If the question mentions evaluation or performance on unseen data, think validation dataset
- Validation data is not used to teach the model
- AI-900 focuses on understanding why datasets are separated
Key Takeaways
- Training datasets are used to teach machine learning models
- Validation datasets are used to evaluate model performance
- Separating datasets helps prevent overfitting
- Understanding these roles is a core AI-900 exam skill
Go to the Practice Exam Questions for this topic.
Go to the AI-900 Exam Prep Hub main page.
