Tag: Data Science

Describe Data and Compute Services for Data Science and Machine Learning (AI-900 Exam Prep)

This topic focuses on understanding which Azure services are used to store data and provide compute power for data science and machine learning workloads — not on how to configure them in depth. For the AI-900 exam, you should recognize what each service is used for and when you would choose one over another.


Why Data and Compute Matter in Machine Learning

Machine learning solutions require two essential components:

  • Data services → where training and inference data is stored and accessed
  • Compute services → where models are trained and executed

Azure provides scalable, cloud-based services for both, allowing organizations to build, train, and deploy machine learning solutions efficiently.


Data Services for Machine Learning on Azure

Azure offers several data storage services commonly used in machine learning scenarios.

Azure Blob Storage

Azure Blob Storage is the most common data store for machine learning.

Key characteristics:

  • Stores unstructured data (files, images, videos, CSVs)
  • Highly scalable and cost-effective
  • Frequently used as the data source for Azure Machine Learning experiments

Typical use cases:

  • Training datasets
  • Model artifacts
  • Logs and output files

👉 On AI-900: If the question mentions large datasets, files, or unstructured data, Blob Storage is usually the answer.


Azure Data Lake Storage Gen2

Azure Data Lake Storage is optimized for big data analytics and machine learning.

Key characteristics:

  • Built on Azure Blob Storage
  • Supports hierarchical namespaces
  • Designed for analytics workloads

Typical use cases:

  • Large-scale machine learning projects
  • Advanced analytics and data science pipelines

👉 On AI-900: Think of Data Lake Storage when big data and analytics are mentioned.


Azure SQL Database

Azure SQL Database stores structured, relational data.

Key characteristics:

  • Table-based storage
  • Uses SQL for querying
  • Suitable for well-defined schemas

Typical use cases:

  • Business and transactional data
  • Structured datasets used in ML training

👉 On AI-900: If the data is relational and structured, Azure SQL Database is a common choice.


Compute Services for Machine Learning on Azure

Compute services provide the processing power needed to train and run machine learning models.


Azure Machine Learning Compute

Azure Machine Learning provides managed compute resources specifically designed for ML workloads.

Key characteristics:

  • Scalable CPU and GPU compute
  • Used for training and inference
  • Managed through Azure Machine Learning workspace

Typical use cases:

  • Model training
  • Experimentation
  • Batch inference

👉 On AI-900: This is the primary compute service for machine learning.


Azure Virtual Machines

Azure Virtual Machines (VMs) offer full control over the compute environment.

Key characteristics:

  • Customizable CPU or GPU configurations
  • Supports specialized ML workloads
  • More management responsibility

Typical use cases:

  • Custom machine learning environments
  • Legacy or specialized ML tools

👉 On AI-900: VMs appear when flexibility or custom configuration is required.


Azure Kubernetes Service (AKS)

AKS is used primarily for deploying machine learning models at scale.

Key characteristics:

  • Container orchestration
  • High availability and scalability
  • Often used for real-time inference

Typical use cases:

  • Production ML model deployment
  • Scalable inference endpoints

👉 On AI-900: AKS is associated with deployment, not training.


How These Services Work Together

In a typical Azure machine learning workflow:

  1. Data is stored in Blob Storage, Data Lake, or SQL Database
  2. Models are trained using Azure Machine Learning compute or VMs
  3. Models are deployed using Azure Machine Learning or AKS
  4. Predictions are generated and consumed by applications

Azure handles scalability, security, and integration across these services.


Key Exam Takeaways

For AI-900, remember:

  • Blob Storage → unstructured ML data
  • Data Lake Storage → big data analytics
  • Azure SQL Database → structured data
  • Azure Machine Learning compute → training and experimentation
  • Virtual Machines → custom compute environments
  • AKS → scalable model deployment

You are not expected to configure these services — only recognize their purpose.


Exam Tip 💡

If a question asks:

  • “Where is ML data stored?”Blob Storage or Data Lake
  • “Where is the model trained?”Azure Machine Learning compute
  • “How is a model deployed at scale?”AKS

Go to the Practice Exam Questions for this topic.

Go to the AI-900 Exam Prep Hub main page.

Glossary – 100 “Data Science” Terms

Below is a glossary that includes 100 “Data Science” terms and phrases, along with their definitions and examples, in alphabetical order. Enjoy!

TermDefinition & Example
A/B TestingComparing two variants. Example: Website layout test.
AccuracyOverall correct predictions rate. Example: 90% accuracy.
Actionable InsightInsight leading to action. Example: Improve onboarding.
AlgorithmProcedure used to train models. Example: Decision trees.
Alternative HypothesisAssumption opposing the null hypothesis. Example: Group A performs better than B.
AUCArea under ROC curve. Example: Model ranking metric.
Bayesian InferenceUpdating probabilities with new evidence. Example: Prior and posterior beliefs.
Bias-Variance TradeoffBalance between simplicity and flexibility. Example: Model tuning.
BootstrappingResampling technique for estimation. Example: Estimating confidence intervals.
Business ProblemDecision-focused question. Example: Why churn increased.
CausationOne variable directly affects another. Example: Price drop causes sales increase.
ClassificationPredicting categories. Example: Spam detection.
ClusteringGrouping similar observations. Example: Market segmentation.
Computer VisionInterpreting images and video. Example: Image classification.
Confidence IntervalRange likely containing the true value. Example: 95% CI for average revenue.
Confusion MatrixTable evaluating classification results. Example: True positives vs false positives.
CorrelationStrength of relationship between variables. Example: Ad spend vs revenue.
Cross-ValidationRepeated training/testing splits. Example: k-fold CV.
Data DriftChange in input data distribution. Example: New demographics.
Data ImputationReplacing missing values. Example: Median imputation.
Data LeakageTraining model with future information. Example: Using post-event data.
Data ScienceInterdisciplinary field combining statistics, programming, and domain knowledge to extract insights from data. Example: Predicting customer churn.
Data StorytellingCommunicating insights effectively. Example: Executive dashboards.
DatasetA structured collection of data for analysis. Example: Customer transactions table.
Deep LearningMulti-layer neural networks. Example: Speech recognition.
Descriptive StatisticsSummary statistics of data. Example: Mean, median.
Dimensionality ReductionReducing number of features. Example: PCA.
Effect SizeMagnitude of difference or relationship. Example: Lift in conversion rate.
Ensemble LearningCombining multiple models. Example: Boosting techniques.
Ethics in Data ScienceResponsible use of data and models. Example: Avoiding biased predictions.
ExperimentationTesting hypotheses with data. Example: A/B testing.
Explainable AI (XAI)Techniques to explain predictions. Example: SHAP values.
Exploratory Data Analysis (EDA)Initial data investigation using statistics and visuals. Example: Distribution plots.
F1 ScoreBalance of precision and recall. Example: Imbalanced datasets.
FeatureAn input variable used in modeling. Example: Customer age.
Feature EngineeringCreating new features from raw data. Example: Tenure calculated from signup date.
ForecastingPredicting future values. Example: Demand forecasting.
GeneralizationModel performance on unseen data. Example: Stable test accuracy.
Hazard FunctionInstantaneous event rate. Example: Churn risk over time.
Holdout SetData reserved for final evaluation. Example: Final test dataset.
HyperparameterPre-set model configuration. Example: Learning rate.
HypothesisA testable assumption about data. Example: Discounts increase conversion rates.
Hypothesis TestingStatistical method to evaluate assumptions. Example: t-test for average sales.
InsightMeaningful analytical finding. Example: High churn among new users.
LabelKnown output used in supervised learning. Example: Fraud or not fraud.
LikelihoodProbability of data given parameters. Example: Used in Bayesian models.
Loss FunctionMeasures prediction error. Example: Mean squared error.
MeanArithmetic average. Example: Average sales value.
MedianMiddle value of ordered data. Example: Median income.
Missing ValuesAbsent data points. Example: Null customer age.
ModeMost frequent value. Example: Most common category.
ModelMathematical representation learned from data. Example: Logistic regression.
Model DriftPerformance degradation over time. Example: Changing customer behavior.
Model InterpretabilityUnderstanding model decisions. Example: Feature importance.
Monte Carlo SimulationRandom sampling to model uncertainty. Example: Risk modeling.
Natural Language Processing (NLP)Analyzing human language. Example: Sentiment analysis.
Neural NetworkModel inspired by the human brain. Example: Image recognition.
Null HypothesisDefault assumption of no effect. Example: No difference between two groups.
OptimizationProcess of minimizing loss. Example: Gradient descent.
OutlierValue significantly different from others. Example: Unusually large purchase.
OverfittingModel memorizes training data. Example: Poor test performance.
PipelineEnd-to-end data science workflow. Example: Ingest → train → deploy.
PopulationEntire group of interest. Example: All customers.
Posterior ProbabilityUpdated belief after observing data. Example: Updated churn likelihood.
PrecisionCorrect positive prediction rate. Example: Fraud detection precision.
Principal Component Analysis (PCA)Linear dimensionality reduction technique. Example: Visualizing high-dimensional data.
Prior ProbabilityInitial belief before observing data. Example: Baseline churn rate.
p-valueProbability of observing results under the null hypothesis. Example: p < 0.05 indicates significance.
RecallAbility to identify all positives. Example: Medical diagnosis.
RegressionPredicting numeric values. Example: Sales forecasting.
Reinforcement LearningLearning via rewards and penalties. Example: Game-playing AI.
ReproducibilityAbility to recreate results. Example: Fixed random seeds.
ROC CurveClassifier performance visualization. Example: Threshold comparison.
SamplingSelecting subset of data. Example: Survey sample.
Sampling BiasNon-representative sampling. Example: Surveying only active users.
SeasonalityRepeating time-based patterns. Example: Holiday sales.
Semi-Structured DataData with flexible structure. Example: JSON files.
StackingEnsemble method using meta-models. Example: Combining classifiers.
Standard DeviationAverage distance from the mean. Example: Price volatility.
StationarityStable statistical properties over time. Example: Mean doesn’t change.
Statistical PowerProbability of detecting a true effect. Example: Larger sample sizes increase power.
Statistical SignificanceEvidence results are unlikely due to chance. Example: Rejecting the null hypothesis.
Structured DataData with a fixed schema. Example: SQL tables.
Supervised LearningLearning with labeled data. Example: Credit risk prediction.
Survival AnalysisModeling time-to-event data. Example: Customer churn timing.
Target VariableThe outcome a model predicts. Example: Loan default indicator.
Test DataData used to evaluate model performance. Example: Held-out validation set.
Text MiningExtracting insights from text. Example: Topic modeling.
Time SeriesData indexed by time. Example: Daily stock prices.
TokenizationSplitting text into units. Example: Words or subwords.
Training DataData used to train a model. Example: Historical transactions.
Transfer LearningReusing pretrained models. Example: Image models for medical scans.
TrendLong-term direction in data. Example: Growing user base.
UnderfittingModel too simple to capture patterns. Example: High bias.
Unstructured DataData without predefined structure. Example: Text, images.
Unsupervised LearningLearning without labels. Example: Customer clustering.
Uplift ModelingMeasuring treatment impact. Example: Marketing campaign effectiveness.
Validation SetData used for tuning models. Example: Hyperparameter selection.
VarianceMeasure of data spread. Example: Sales variability.
Word EmbeddingsNumerical text representations. Example: Word2Vec.

What Exactly Does a Data Scientist Do?

A Data Scientist focuses on using statistical analysis, experimentation, and machine learning to understand complex problems and make predictions about what is likely to happen next. While Data Analysts often explain what has already happened, and Data Engineers build the systems that deliver data, Data Scientists explore patterns, probabilities, and future outcomes.

At their best, Data Scientists help organizations move from descriptive insights to predictive and prescriptive decision-making.


The Core Purpose of a Data Scientist

At its core, the role of a Data Scientist is to:

  • Explore complex and ambiguous problems using data
  • Build models that explain or predict outcomes
  • Quantify uncertainty and risk
  • Inform decisions with probabilistic insights

Data Scientists are not just model builders—they are problem solvers who apply scientific thinking to business questions.


Typical Responsibilities of a Data Scientist

While responsibilities vary by organization and maturity, most Data Scientists work across the following areas.


Framing the Problem and Defining Success

Data Scientists work with stakeholders to:

  • Clarify the business objective
  • Determine whether a data science approach is appropriate
  • Define measurable success criteria
  • Identify constraints and assumptions

A key skill is knowing when not to use machine learning.


Exploring and Understanding Data

Before modeling begins, Data Scientists:

  • Perform exploratory data analysis (EDA)
  • Investigate distributions, correlations, and outliers
  • Identify data gaps and biases
  • Assess data quality and suitability for modeling

This phase often determines whether a project succeeds or fails.


Feature Engineering and Data Preparation

Transforming raw data into meaningful inputs is a major part of the job:

  • Creating features that capture real-world behavior
  • Encoding categorical variables
  • Handling missing or noisy data
  • Scaling and normalizing data where needed

Good features often matter more than complex models.


Building and Evaluating Models

Data Scientists develop and test models such as:

  • Regression and classification models
  • Time-series forecasting models
  • Clustering and segmentation techniques
  • Anomaly detection systems

They evaluate models using appropriate metrics and validation techniques, balancing accuracy with interpretability and robustness.


Communicating Results and Recommendations

A critical responsibility is explaining:

  • What the model does and does not do
  • How confident the predictions are
  • What trade-offs exist
  • How results should be used in decision-making

A model that cannot be understood or trusted will rarely be adopted.


Common Tools Used by Data Scientists

While toolsets vary, Data Scientists commonly use:

  • Programming Languages such as Python or R
  • Statistical & ML Libraries (e.g., scikit-learn, TensorFlow, PyTorch)
  • SQL for data access and exploration
  • Notebooks for experimentation and analysis
  • Visualization Libraries for data exploration
  • Version Control for reproducibility

The emphasis is on experimentation, iteration, and learning.


What a Data Scientist Is Not

Clarifying misconceptions is important.

A Data Scientist is typically not:

  • A report or dashboard developer
  • A data engineer focused on pipelines and infrastructure
  • An AI product that automatically solves business problems
  • A decision-maker replacing human judgment

In practice, Data Scientists collaborate closely with analysts, engineers, and business leaders.


What the Role Looks Like Day-to-Day

A typical day for a Data Scientist may include:

  • Exploring a new dataset or feature
  • Testing model assumptions
  • Running experiments and comparing results
  • Reviewing model performance
  • Discussing findings with stakeholders
  • Iterating based on feedback or new data

Much of the work is exploratory and non-linear.


How the Role Evolves Over Time

As organizations mature, the Data Scientist role often evolves:

  • From ad-hoc modeling → repeatable experimentation
  • From isolated analysis → productionized models
  • From accuracy-focused → impact-focused outcomes
  • From individual contributor → technical or domain expert

Senior Data Scientists often guide model strategy, ethics, and best practices.


Why Data Scientists Are So Important

Data Scientists add value by:

  • Quantifying uncertainty and risk
  • Anticipating future outcomes
  • Enabling proactive decision-making
  • Supporting innovation through experimentation

They help organizations move beyond hindsight and into foresight.


Final Thoughts

A Data Scientist’s job is not simply to build complex models—it is to apply scientific thinking to messy, real-world problems using data.

When Data Scientists succeed, their work informs smarter decisions, better products, and more resilient strategies—always in partnership with engineering, analytics, and the business.

Good luck on your data journey!