Distributed Computing and PySpark

This course takes you from the moment a single machine runs out of memory to a distributed pipeline you can build and defend. Across Modules 0 to 7 it covers Spark architecture, the DataFrame API and Spark SQL, streaming with Kafka, and machine learning with MLlib. The core path runs on free Databricks Free Edition, and every sample on it is tested to run there; one optional track, H2O Sparkling Water, needs classic compute. It is built to prepare you for the Databricks Certified Associate Developer for Apache Spark.

Course map

The main path runs top to bottom. The H2O track is optional and needs classic compute. The two extras are practice and reference you can use anytime. The full module list, with links, is below the map.

graph TD O["0.0 Course Overview"] --> G["0.1 Getting Started"] G --> M1["1 Big Data and Spark"] M1 --> M2["2 Architecture and RDDs"] M2 --> M3["3 DataFrames and Spark SQL"] M3 --> M4["4 Data Sources and Sinks"] M4 --> S0["5.0 Spark Streaming"] S0 --> S1["5.1 Deploying Kafka"] S1 --> S2["5.2 Producers and Consumers"] S2 --> S3["5.3 Structured Streaming"] S3 --> ML0["6.0 MLlib Fundamentals"] ML0 --> ML1["6.1 Supervised Learning"] ML1 --> ML2["6.2 Unsupervised and Recommendation"] ML2 --> ML3["6.3 ML in Production"] ML3 --> C["7 Capstone Project"] S3 -. optional, classic compute .-> H4["6.4 H2O Sparkling Water"] H4 --> H5["6.5 Distributed ML with H2O"] H5 --> H6["6.6 Running H2O"] subgraph EXTRAS["Practice and reference, use anytime"] EX["Reviewing AI-Generated Code"] TU["Tuning Essentials"] end M4 -. extras .-> EXTRAS classDef setup fill:#ecfdf5,stroke:#10b981,color:#065f46; classDef found fill:#eff6ff,stroke:#3b82f6,color:#1e3a8a; classDef stream fill:#fffbeb,stroke:#f59e0b,color:#92400e; classDef ml fill:#f5f3ff,stroke:#8b5cf6,color:#5b21b6; classDef h2o fill:#f1f5f9,stroke:#94a3b8,color:#334155; classDef cap fill:#fee2e2,stroke:#ef4444,color:#991b1b; classDef supp fill:#f8fafc,stroke:#cbd5e1,color:#475569; class O,G setup; class M1,M2,M3,M4 found; class S0,S1,S2,S3 stream; class ML0,ML1,ML2,ML3 ml; class H4,H5,H6 h2o; class C cap; class EX,TU supp; click O "/pyspark-fundamentals-and-advanced-topics/" "0.0 Course Overview" click G "/pyspark-getting-started-on-databricks/" "0.1 Getting Started" click M1 "/pyspark-introduction-to-big-data-and-apache-spark/" "Module 1" click M2 "/pyspark-spark-architecture-and-rdds/" "Module 2" click M3 "/pyspark-dataframes-and-spark-sql/" "Module 3" click M4 "/pyspark-data-sources-and-sinks/" "Module 4" click S0 "/pyspark-spark-streaming-introduction/" "Module 5.0" click S1 "/pyspark-deploying-apache-kafka-single-node/" "Module 5.1" click S2 "/pyspark-kafka-python-producers-and-consumers/" "Module 5.2" click S3 "/pyspark-kafka-spark-structured-streaming/" "Module 5.3" click ML0 "/pyspark-mllib-fundamentals/" "Module 6.0" click ML1 "/pyspark-supervised-learning-at-scale/" "Module 6.1" click ML2 "/pyspark-unsupervised-learning-and-recommendation/" "Module 6.2" click ML3 "/pyspark-ml-pipelines-in-production/" "Module 6.3" click C "/pyspark-capstone-project/" "Module 7 Capstone" click H4 "/pyspark-scalable-machine-learning-h2o-sparkling-water/" "Module 6.4" click H5 "/pyspark-introduction-to-distributed-machine-learning/" "Module 6.5" click H6 "/pyspark-h2o-sparkling-water-setup/" "Module 6.6" click EX "/pyspark-review-an-ai-generated-pipeline/" "Reviewing AI-Generated Code" click TU "/pyspark-tuning-essentials/" "Tuning Essentials"
PySpark - Module 0.0: Course Overview
PySpark - Module 0.0: Course Overview

Learn pyspark, for Data Science and Data Engineers (in Databricks)

PySpark - Module 0.1: Getting Started on Databricks
PySpark - Module 0.1: Getting Started on Databricks

Create a free Databricks account and confirm your setup before the first session

PySpark - Module 1: Big Data and Apache Spark
PySpark - Module 1: Big Data and Apache Spark

Learn pyspark, Introduction to Big Data and Apache Spark

PySpark - Module 2: Spark Architecture and RDDs
PySpark - Module 2: Spark Architecture and RDDs

Learn pyspark, Spark Architecture and RDDs

PySpark - Module 3: DataFrames and Spark SQL
PySpark - Module 3: DataFrames and Spark SQL

Learn pyspark, DataFrames and Spark SQL (in Databricks)

PySpark - Module 4: Data Sources and Sinks
PySpark - Module 4: Data Sources and Sinks

Learn pyspark, Data Sources and Sinks

PySpark - Module 5.0: Spark Streaming
PySpark - Module 5.0: Spark Streaming

Learn pyspark, Spark Streaming

PySpark - Module 5.1: Deploying Kafka
PySpark - Module 5.1: Deploying Kafka

Deploying Apache Kafka on a Single Node

PySpark - Module 5.2: Kafka Producers and Consumers
PySpark - Module 5.2: Kafka Producers and Consumers

Python-Based Producers and Consumers

PySpark - Module 5.3: Kafka and Structured Streaming
PySpark - Module 5.3: Kafka and Structured Streaming

Apache Spark Structured Streaming

PySpark - Module 6.0: MLlib Fundamentals
PySpark - Module 6.0: MLlib Fundamentals

MLlib fundamentals: the Pipeline API and feature engineering on Databricks

PySpark - Module 6.1: Supervised Learning at Scale
PySpark - Module 6.1: Supervised Learning at Scale

Supervised learning at scale: trees, ensembles, evaluation, and tuning with CrossValidator

PySpark - Module 6.2: Unsupervised Learning and Recommendation
PySpark - Module 6.2: Unsupervised Learning and Recommendation

Unsupervised learning and recommendation: K-Means clustering and ALS collaborative filtering

PySpark - Module 6.3: ML Pipelines in Production
PySpark - Module 6.3: ML Pipelines in Production

ML pipelines in production: saving models, batch inference, MLflow tracking, and drift

PySpark - Module 6.4: H2O Sparkling Water
PySpark - Module 6.4: H2O Sparkling Water

Scalable Machine Learning with H2O and Sparkling Water

PySpark - Module 6.5: Distributed ML with H2O
PySpark - Module 6.5: Distributed ML with H2O

Introduction to Distributed Machine Learning

PySpark - Module 6.6: Running H2O Sparkling Water
PySpark - Module 6.6: Running H2O Sparkling Water

Running H2O Sparkling Water (requires classic Databricks compute)

PySpark - Module 7: Capstone Project
PySpark - Module 7: Capstone Project

The capstone: apply the full stack to a real problem, on one of two tracks

PySpark - Exercise: Reviewing AI-Generated Code
PySpark - Exercise: Reviewing AI-Generated Code

A hands-on exercise: find and fix the performance traps in an AI-generated Spark pipeline

PySpark - Tuning Essentials
PySpark - Tuning Essentials

Tuning essentials: reading the plan, partitioning, and caching, on serverless and classic