Tag: Spark

DP-700, Microsoft Certification, Microsoft Fabric June 3, 2026

Optimize Spark performance (DP-700 Exam Prep)

This post is a part of the DP-700: Implementing Data Engineering Solutions Using Microsoft Fabric Exam Prep Hub.
This topic falls under these sections:
Monitor and optimize an analytics solution (30–35%)
   --> Optimize performance
      --> Optimize Spark performance

Note that there are 10 practice questions (with answers) at the end of each section to help you solidify your knowledge of the material. Also, there are 2 practice tests with 60 questions each available from the hub's main page below the exam topics section.

Introduction

Apache Spark is the primary distributed data processing engine used throughout Microsoft Fabric for large-scale data engineering workloads. Spark powers notebooks, Lakehouse processing, data transformations, machine learning workloads, and streaming analytics. While Spark provides significant scalability and performance advantages, poorly designed Spark jobs can consume excessive compute resources, run slowly, and increase costs.

For the DP-700 exam, candidates must understand how to identify and implement Spark performance optimization techniques in Microsoft Fabric. This includes optimizing:

Data storage formats
Partitioning strategies
Query execution
Cluster utilization
Memory management
Shuffle operations
Caching
Delta Lake performance features

Understanding Spark optimization is essential because Fabric capacities are shared resources. Efficient Spark workloads improve performance while reducing capacity consumption and operational costs.

Understanding Spark Performance Fundamentals

Spark is a distributed computing framework that divides work across multiple executors running on multiple nodes.

Several factors influence performance:

Amount of data processed
Number of partitions
Data storage format
Query complexity
Network communication
Memory availability
Cluster size
Shuffle operations

A common exam theme is recognizing that Spark performance problems are often caused by inefficient data movement rather than insufficient compute power.

Use Efficient Storage Formats

Prefer Delta Lake Format

Microsoft Fabric Lakehouses use Delta Lake as the preferred storage format.

Delta Lake provides:

ACID transactions
Efficient metadata management
Optimized reads
Schema enforcement
Time travel capabilities

Compared to CSV files, Delta tables significantly improve query performance.

Example

Less efficient:

df = spark.read.csv("/Files/sales.csv")

More efficient:

df = spark.read.format("delta").load("/Tables/Sales")

Benefits include:

Faster scans
Better optimization
Improved metadata handling

Avoid CSV for Large Workloads

CSV files:

Lack schema enforcement
Require parsing on every read
Consume more storage

Preferred formats:

Delta Lake
Parquet

These columnar formats reduce I/O and improve Spark execution performance.

Optimize Partitioning

Understand Partitioning

Spark processes data in partitions.

Too few partitions:

Underutilized cluster resources

Too many partitions:

Excessive scheduling overhead

Proper partitioning balances workload distribution across executors.

Repartition When Necessary

Use repartitioning when data is unevenly distributed.

Example:

df = df.repartition(100)

This redistributes data across partitions.

Benefits:

Improved parallelism
Better workload distribution

However, repartitioning causes a shuffle operation and should be used only when beneficial.

Use Coalesce for Reducing Partitions

When decreasing partition counts:

df = df.coalesce(10)

Unlike repartition:

Minimizes data movement
Reduces shuffle overhead

This is useful when writing output files.

Minimize Shuffle Operations

Why Shuffles Are Expensive

A shuffle occurs when Spark moves data between executors.

Examples:

GROUP BY
DISTINCT
JOIN
ORDER BY

Shuffles typically represent one of the largest performance bottlenecks in Spark workloads.

Reduce Unnecessary Shuffles

Instead of repeatedly sorting and grouping data:

Perform filtering early.
Reduce dataset size before joins.
Aggregate before joining when possible.

Example:

Less efficient:

large_df.join(other_df)

More efficient:

			
filtered_df = large_df.filter(...)
filtered_df.join(other_df)

Reducing rows before joins minimizes shuffle volume.

Optimize Joins

Broadcast Small Tables

When joining a large table with a small lookup table, use a broadcast join.

Example:

			
from pyspark.sql.functions import broadcast
result = sales.join(
    broadcast(products),
    "ProductID"
)

		

Benefits:

Eliminates expensive shuffle operations
Improves join performance

This is one of the most commonly tested Spark optimization concepts.

Avoid Skewed Joins

Data skew occurs when certain partition values contain significantly more records than others.

Example:

			
Region
------
US      95%
UK       2%
CA       2%
FR       1%

		

One executor may receive most of the workload, causing bottlenecks.

Optimization techniques:

Repartition data
Salt skewed keys
Filter data before joins

Use Predicate Pushdown

What Is Predicate Pushdown?

Predicate pushdown allows Spark to push filters directly to the storage layer.

Example:

df.filter(df.OrderDate >= "2026-01-01")

Instead of loading all records, Spark retrieves only relevant data.

Benefits:

Reduced I/O
Faster scans
Lower memory consumption

Delta and Parquet formats support predicate pushdown.

Filter Early

One of the most important Spark optimization principles is:

Filter as early as possible.

Example:

Less efficient:

df.join(customers).filter(df.Region == "East")

More efficient:

			
east = df.filter(df.Region == "East")
east.join(customers)

Benefits:

Smaller datasets
Reduced memory usage
Faster joins

Cache Frequently Used Data

Why Cache?

When a DataFrame is reused multiple times, Spark may recompute it repeatedly.

Example:

sales.cache()

Benefits:

Faster repeated access
Reduced recomputation

Use caching when:

Data is reused multiple times
Data fits comfortably in memory

Avoid Excessive Caching

Caching everything can:

Consume memory
Cause executor pressure
Reduce overall performance

Cache only data that provides measurable benefit.

Optimize Delta Lake Tables

Use OPTIMIZE

Over time, Delta tables can accumulate many small files.

Small file problems include:

Slower reads
Increased metadata operations

Use:

OPTIMIZE Sales

Benefits:

File compaction
Improved query performance

This is a highly important Fabric-specific optimization technique.

Use V-Order Optimization

Microsoft Fabric supports V-Order optimization.

Benefits:

Improved compression
Faster reads
Better query performance

V-Order is particularly beneficial for analytics workloads and large-scale scans.

Use Z-Ordering When Appropriate

Z-Ordering physically organizes data based on frequently filtered columns.

Example:

			
OPTIMIZE Sales
ZORDER BY (CustomerID)

Benefits:

Reduced data scanning
Faster filtering performance

Best used on frequently queried columns.

Optimize Notebook Execution

Avoid Excessive Actions

Spark transformations are lazy.

Actions trigger execution:

			
count()
collect()
show()
display()

Repeated actions can cause repeated computations.

Instead:

Cache reusable data
Minimize unnecessary actions

Use Notebook Monitoring

Fabric provides Spark monitoring capabilities that allow engineers to review:

Job execution
Stage duration
Shuffle size
Executor utilization
Memory consumption

Monitoring helps identify bottlenecks before scaling resources unnecessarily.

Optimize Spark Pools and Compute Resources

Right-Size Compute Resources

Larger clusters are not always faster.

Consider:

Dataset size
Concurrency requirements
Transformation complexity

Overprovisioning increases costs without necessarily improving performance.

Use Autoscaling

Autoscaling allows Spark resources to expand and contract based on workload demand.

Benefits:

Reduced idle resources
Improved cost efficiency
Better workload management

Monitor Spark Performance

Important metrics include:

Job Duration

Measures overall execution time.

Stage Duration

Identifies bottlenecks within jobs.

Shuffle Read/Write

High values often indicate optimization opportunities.

Executor Utilization

Shows whether compute resources are effectively used.

Memory Usage

Helps identify memory pressure and caching issues.

Task Failures

May indicate:

Data skew
Resource exhaustion
Code inefficiencies

Common DP-700 Exam Scenarios

You may encounter questions involving:

Small-file problems solved with OPTIMIZE
Join optimization using broadcast joins
Delta Lake versus CSV performance
Predicate pushdown benefits
Data skew troubleshooting
Repartition versus coalesce decisions
Spark monitoring interpretation
Shuffle reduction techniques
V-Order optimization
Caching strategies

Exam Tips

Remember these key points:

Delta Lake generally performs better than CSV.
Filter data as early as possible.
Broadcast small lookup tables.
Minimize shuffle operations.
Use repartition carefully.
Use coalesce when reducing partitions.
Cache only frequently reused datasets.
Run OPTIMIZE on fragmented Delta tables.
Monitor shuffle metrics and stage durations.
Use V-Order and Z-Ordering to improve query performance.

Practice Exam Questions

Question 1

A Spark job repeatedly joins a 5 TB sales table with a 5 MB product lookup table. Query performance is poor.

Which optimization should be implemented?

A. Broadcast the product lookup table

B. Increase the number of Delta tables

C. Convert the lookup table to CSV

D. Disable partitioning

Correct Answer: A

Explanation: Broadcasting the small lookup table eliminates expensive shuffle operations and significantly improves join performance.

Question 2

A Delta table contains thousands of small files after months of incremental loads.

Which command should be used?

A. REBUILD

B. OPTIMIZE

C. VACUUM

D. MERGE

Correct Answer: B

Explanation: OPTIMIZE compacts small files into larger files, improving read performance and reducing metadata overhead.

Question 3

Which operation is most likely to trigger a costly Spark shuffle?

A. SELECT specific columns

B. Filter rows

C. GROUP BY

D. Read a Delta table

Correct Answer: C

Explanation: GROUP BY redistributes data across executors and typically requires a shuffle operation.

Question 4

A DataFrame is reused multiple times throughout a notebook.

Which optimization is most appropriate?

A. Repartition it repeatedly

B. Convert it to CSV

C. Cache it

D. Broadcast it

Correct Answer: C

Explanation: Caching prevents repeated recomputation and improves performance when data is reused.

Question 5

What is the primary purpose of predicate pushdown?

A. Increase memory usage

B. Push filters to the storage layer

C. Increase shuffle operations

D. Create additional partitions

Correct Answer: B

Explanation: Predicate pushdown allows filters to be applied at the storage layer, reducing the amount of data read.

Question 6

A data engineer wants to reduce partitions before writing output files while minimizing data movement.

Which operation should be used?

A. Broadcast

B. OPTIMIZE

C. Repartition

D. Coalesce

Correct Answer: D

Explanation: Coalesce reduces partitions efficiently with less data movement than repartition.

Question 7

Which storage format generally provides the best Spark performance in Microsoft Fabric Lakehouses?

A. TXT

B. CSV

C. Delta Lake

D. XML

Correct Answer: C

Explanation: Delta Lake provides optimized storage, metadata management, ACID transactions, and better query performance.

Question 8

A Spark job shows one executor consistently running much longer than all others.

What is the most likely cause?

A. Predicate pushdown

B. Data skew

C. V-Order optimization

D. Delta caching

Correct Answer: B

Explanation: Data skew causes uneven workload distribution and often results in one executor becoming a bottleneck.

Question 9

Which Fabric optimization improves data compression and analytical query performance for Delta tables?

A. V-Order

B. Autoscaling

C. Caching

D. Broadcast joins

Correct Answer: A

Explanation: V-Order optimizes data layout for analytics workloads, improving compression and read performance.

Question 10

A Spark job performs several joins and aggregations before filtering rows.

What optimization would likely improve performance?

A. Increase retention

B. Apply filtering earlier in the workflow

C. Increase notebook count

D. Export data to CSV first

Correct Answer: B

Explanation: Early filtering reduces dataset size before expensive operations such as joins and aggregations, decreasing shuffle volume and improving overall performance.

Go to the DP-700 Exam Prep Hub main page.

Big Data August 6, 2016January 1, 2026

The Apache Hadoop Ecosystem

Apache Hadoop, simply termed Hadoop, is an increasingly popular open-source framework for distributed computing. It has had a major impact on the business intelligence / data analytics / data warehousing space, spawning a new practice in this space, referred to as Big Data. Hadoop’s core architecture consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. It provides a reliable, scalable, and cost-effective means for storing and processing large data sets, and it does so like no other software frameworks before its time.

It is cost-effective and scalable because it is designed to run on commodity hardware servers that can be scaled from one to hundreds, or even thousands, therefore avoiding the cost of the expensive super-computers (which eventually hits limits). With Hadoop, you are able to add commodity servers as needed without much difficulty at minimal costs.

It is reliable because all the modules in Hadoop are designed with a fundamental assumption that hardware failures will occur and these failures should be automatically handled in software by the Hadoop framework.

Beyond the core components, the Hadoop eco-system has grown to include a number of additional packages that run on top of or alongside the core Hadoop components, including but not limited to, Apache Hive, Apache Pig, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm, Apache Mahout, Ambari, Apache Drill, Tez, and others. This post will serve as a quick look-up for the components of the eco-system to allow you to quickly identify what the components are and understand what they do.

Hadoop component	Component Category	Purpose / Usage
Hadoop	The ecosystem	The core Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce
Hadoop Common	Software Libraries shared across the ecosystem	Hadoop Common contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS)	Distributed Storage	HDFS is a distributed file system that is the foundational storage component of Hadoop and it sits on top of the file system of the commodity hardware that Hadoop runs on. It stores data on these commodity servers and provides high bandwidth and throughput across the cluster of servers.
Hadoop YARN	Resource Management & Scheduling	YARN (which stands for Yet Another Resource Negotiator) is a resource-management platform that manages computing resources in Hadoop clusters and uses them to schedule users’ applications.
Hadoop MapReduce	Distributed Processing	MapReduce is a programming and processing paradigm that pairs with HDFS for large scale data processing. It is a distributed computational algorithm comprised of a Map() procedure and a Reduce() procedure that pushes computation down to each server in the Hadoop cluster. The Map procedure performs functions such as filtering and sorting of data; while the Reduce() procedure performs summary / aggregate type operations on the data.
Hive	MapReduce Abstraction / Analysis / Querying	Apache Hive is a data warehouse infrastructure that provides an abstraction layer on top of MapReduce. It provides a SQL-like language called HiveQL and transparently converts queries to MapReduce, Apache Tez, and Apache Spark jobs. It can handle analysis of large datasets and provides functionality for indexing, data summarization, query, and analysis of the data stored in HDFS or other compatible file systems.
Pig	MapReduce Abstraction / Analysis / Querying	Pig is a functional programming interface that allows you to use a higher level scripting language (called Pig Latin) to create MapReduce code for Hadoop. Pig is similar to PL/SQL and can be extended using UDF’s written in Java, Python and other languages. It was originally developed to provide analysts an ad-hoc way of creating and executing map-reduce jobs on very large data sets.
Ambari	Monitoring & Management of Clusters	Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It includes support for many of the key components of the Hadoop eco-system, such as, Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a user-friendly dashboard for viewing cluster health and MapReduce, Pig and Hive applications, And it also provides features to diagnose the performance of the various components.
HBase	Storage / database	HBase is a non-relational (NoSQL) distributed, fast, and scalable database that runs on top of HDFS. It is modeled after Google’s Big Table, providing BigTable-like capabilities to Hadoop, and is written in Java. It provides fault-tolerant storage and retrieval of huge quantities of sparse data – such as top 10 out of 10 billion records or the 0.1% of recrods that are non-zero. HBase features include compression, and in-memory operation. HBase tables can be used as the input and output for MapReduce jobs run in Hadoop, and are accessed through APIs. HBase can be integrated with BI and Analytics applications through drivers and through Apache Phoenix’s SQL layer. However, HBase is not a RDBMS replacement.
Hue	Web GUI	Hue is an open-source Web interface for end users that supports Apache Hadoop and its ecosystem. Hue provides a single interface for the most common Apache Hadoop components with an emphasis on user experience. Its main goal is to have the users make the most of Hadoop without worrying about the underlying complexity or using a command line.
Sqoop	Data Integration	Sqoop, named from a combination of SQL+Hadoop, is an application with a command-line interface that pulls and pushes data from/to relational data sources, to/from Hadoop. It supports compression, incremental loads of a single table, or a free form SQL query. You can also save jobs which can be run multiple times to perform the incremental loads. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Several software vendors provide Sqoop-based functionality into their database and BI/analytics products.
Flume	Data Integration	Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data (such as web logs or sensor data) into and out of Hadoop (HDFS). It features include fault tolerance and a simple extensible data model that supports streaming data flows and allows for online analysis.
Impala	Analysis / Querying	Cloudera Impala is a massively parallel processing, low-latency SQL Query engine that runs on Hadoop and communicates directly with HDFS, bypassing MapReduce. It allows you to run SQL queries in lower data volume scenarios on data stored in HDFS and HBase, and returns results much quicker than Pig and Hive. Impala is designed and integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software, which allows for both large scale data processing and interactive queries to be done on the same system. Impala is great for data analysts and scientists to perform analytics on data stored in Hadoop via SQL or other business intelligence tools.
Avro	Data Integration	Avro is a data interchange protocol/framework that provides data serialization and de-serialization in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
Storm	Data Integration	Apache Storm is a distributed computation framework, written predominantly in the Clojure programming language that moves streaming data into and out of Hadoop. It allows for the definition of information sources and manipulations to allow batch, distributed processing of streaming data. Storm’s architecture acts as a data transformation pipeline. At a very high level the general architecture is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches.
Oozie	Workflow Builder	Apache Oozie is a server-based workflow scheduling system, built using Java, to manage Hadoop jobs. It chains together MapReduce jobs, and data import/export scripts. Workflows in Oozie are defined as a collection of control flow and action nodes. Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions including Hadoop MapReduce, HDFS operations, Pig, SSH, and email; and it can be extended to support additional types of actions.
Mahout	Machine Learning	Apache Mahout is a set of libraries for distributed, scalable machine learning; data mining; and mathematical algorithms that run primarily on the Hadoop, and focused primarily in the areas of collaborative filtering, clustering and classification. Mahout’s core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, but is not restricted to Hadoop-based implementations.
ZooKeeper	Coordination	ZooKeeper is a high-performance, high-availability coordination service for distributed applications. It provides a distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper is used by open source enterprise search systems like Solr.
Spark	Data Integration, Processing, Machine Learning	Apache Spark sits directly on top of HDFS, bypassing MapReduce, and is a fast, general compute engine for Hadoop. It is said that Spark could eventually replace MapReduce because it provides solutions for everything MapReduce does, plus adds a lot more functionality. It uses a different paradigm from MapReduce (synonymous to rows vs sets processing in SQL), and uses more in-memory capabilities which makes it typically faster than MapReduce. In contrast to MapReduce’s two-stage disk-based paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. Spark is very versatile and provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark requires a cluster manager and supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos; and also requires a distributed storage system such as Hadoop Distributed File System (HDFS), Cassandra, Amazon S3, or even custom systems; but it does support a pseudo-distributed local mode (for development and testing). Spark is one of the most active projects in the Apache Software Foundation.
Phoenix	Data Manipulation	Apache Phoenix is a massively parallel, relational database layer on top of noSQL stores such as Apache HBase. Phoenix provides a JDBC driver that hides the complexities of the noSQL store enabling users to use the familiar SQL to create, delete, and alter SQL tables, views, indexes, and sequences; insert, update, and delete rows singly and in bulk; and query data. Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores.
Cassandra	Storage / database	Apache Cassandra is an open source, scalable, multi-master, high-performance, distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra supports clusters spanning multiple datacenters, with asynchronous masterless replication allowing for low latency operations.
Solr	Search	Solr (pronounced “solar”) is an open source enterprise search platform, written in Java, and runs as a standalone full-text search server. Its features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL capabilities, and rich document (e.g., Word, PDF) handling. Solr is designed for scalability and Fault tolerance with distributed search and index replication. Solr is a very popular enterprise search engine. Solr has APIs and a plugin architecture that makes it customizable using various programming languages. There is the flexibility to root the data being brought in by SQOOP and FLUME directly into SOLR to do indexing on the fly. But you can also tell SOLR to index the data in batches.
MongoDB	Storage / database	MongoDB (from humongous) is an open-source, cross-platform document-oriented, NoSQL database. It uses a JSON-like structure, called BSON, with dynamic schemas which makes the integration of data in certain types of applications easier and faster. MongoDB is one of the most popular type of database management systems, and is said to be the most popular for document stores.
Kafka	Data Integration	Apache Kafka is an open-source message broker project written in Scala. It provides a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs. Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced in early 2011.
Accumulo	Storage / database	Apache Accumulo is a data store with a sorted, distributed key/value and, like HBase, is based on the BigTable technology from Google. It is written in Java, and is built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Accumulo is said to be third most popular NoSQL wide column store behind Apache Cassandra and HBase as of 2015.
Chukwa	Data Integration	Chukwa is a data collection system for managing large distributed systems.
Tez	Processing	Tez is a flexible data-flow programming framework, built on Hadoop YARN, that processes both in batch and interactive modes. It is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.
Drill	Processing	Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets, even across multiple data stores, at amazing speed. It is the open source version of Google’s Dremel system which is available as a service called Google BigQuery, and it supports a variety of NoSQL databases and file systems, including HBase, MongoDB, HDFS, Amazon S3, Azure Blob Storage, local files, and more.
Apache Sentry	Security	Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.

You can get more information about Apache Hadoop here: http://hadoop.apache.org/

Introduction

Understanding Spark Performance Fundamentals

Use Efficient Storage Formats

Prefer Delta Lake Format

Example

Avoid CSV for Large Workloads

Optimize Partitioning

Understand Partitioning

Repartition When Necessary

Use Coalesce for Reducing Partitions

Minimize Shuffle Operations

Why Shuffles Are Expensive

Reduce Unnecessary Shuffles

Optimize Joins

Broadcast Small Tables

Avoid Skewed Joins

Use Predicate Pushdown

What Is Predicate Pushdown?

Filter Early

Cache Frequently Used Data

Why Cache?

Avoid Excessive Caching

Optimize Delta Lake Tables

Use OPTIMIZE

Use V-Order Optimization

Use Z-Ordering When Appropriate

Optimize Notebook Execution

Avoid Excessive Actions

Use Notebook Monitoring

Optimize Spark Pools and Compute Resources

Right-Size Compute Resources

Use Autoscaling

Monitor Spark Performance

Job Duration

Stage Duration

Shuffle Read/Write

Executor Utilization

Memory Usage

Task Failures

Common DP-700 Exam Scenarios

Exam Tips

Practice Exam Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Information and resources for the data professionals' community