Distributed Computing and PySpark

This course takes you from the moment a single machine runs out of memory to a distributed pipeline you can build and defend. Across Modules 0 to 7 it covers Spark architecture, the DataFrame API and Spark SQL, streaming with Kafka, and machine learning with MLlib. The core path runs on free Databricks Free Edition, and every sample on it is tested to run there; one optional track, H2O Sparkling Water, needs classic compute. It is built to prepare you for the Databricks Certified Associate Developer for Apache Spark.

Course map

The main path runs top to bottom. The H2O track is optional and needs classic compute. The two extras are practice and reference you can use anytime. The full module list, with links, is below the map.

graph TD O["0.0 Course Overview"] --> G["0.1 Getting Started"] G --> M1["1 Big Data and Spark"] M1 --> M2["2 Architecture and RDDs"] M2 --> M3["3 DataFrames and Spark SQL"] M3 --> M4["4 Data Sources and Sinks"] M4 --> S0["5.0 Spark Streaming"] S0 --> S1["5.1 Deploying Kafka"] S1 --> S2["5.2 Producers and Consumers"] S2 --> S3["5.3 Structured Streaming"] S3 --> ML0["6.0 MLlib Fundamentals"] ML0 --> ML1["6.1 Supervised Learning"] ML1 --> ML2["6.2 Unsupervised and Recommendation"] ML2 --> ML3["6.3 ML in Production"] ML3 --> C["7 Capstone Project"] S3 -. optional, classic compute .-> H4["6.4 H2O Sparkling Water"] H4 --> H5["6.5 Distributed ML with H2O"] H5 --> H6["6.6 Running H2O"] subgraph EXTRAS["Practice and reference, use anytime"] EX["Reviewing AI-Generated Code"] TU["Tuning Essentials"] end M4 -. extras .-> EXTRAS classDef setup fill:#ecfdf5,stroke:#10b981,color:#065f46; classDef found fill:#eff6ff,stroke:#3b82f6,color:#1e3a8a; classDef stream fill:#fffbeb,stroke:#f59e0b,color:#92400e; classDef ml fill:#f5f3ff,stroke:#8b5cf6,color:#5b21b6; classDef h2o fill:#f1f5f9,stroke:#94a3b8,color:#334155; classDef cap fill:#fee2e2,stroke:#ef4444,color:#991b1b; classDef supp fill:#f8fafc,stroke:#cbd5e1,color:#475569; class O,G setup; class M1,M2,M3,M4 found; class S0,S1,S2,S3 stream; class ML0,ML1,ML2,ML3 ml; class H4,H5,H6 h2o; class C cap; class EX,TU supp; click O "/pyspark-fundamentals-and-advanced-topics/" "0.0 Course Overview" click G "/pyspark-getting-started-on-databricks/" "0.1 Getting Started" click M1 "/pyspark-introduction-to-big-data-and-apache-spark/" "Module 1" click M2 "/pyspark-spark-architecture-and-rdds/" "Module 2" click M3 "/pyspark-dataframes-and-spark-sql/" "Module 3" click M4 "/pyspark-data-sources-and-sinks/" "Module 4" click S0 "/pyspark-spark-streaming-introduction/" "Module 5.0" click S1 "/pyspark-deploying-apache-kafka-single-node/" "Module 5.1" click S2 "/pyspark-kafka-python-producers-and-consumers/" "Module 5.2" click S3 "/pyspark-kafka-spark-structured-streaming/" "Module 5.3" click ML0 "/pyspark-mllib-fundamentals/" "Module 6.0" click ML1 "/pyspark-supervised-learning-at-scale/" "Module 6.1" click ML2 "/pyspark-unsupervised-learning-and-recommendation/" "Module 6.2" click ML3 "/pyspark-ml-pipelines-in-production/" "Module 6.3" click C "/pyspark-capstone-project/" "Module 7 Capstone" click H4 "/pyspark-scalable-machine-learning-h2o-sparkling-water/" "Module 6.4" click H5 "/pyspark-introduction-to-distributed-machine-learning/" "Module 6.5" click H6 "/pyspark-h2o-sparkling-water-setup/" "Module 6.6" click EX "/pyspark-review-an-ai-generated-pipeline/" "Reviewing AI-Generated Code" click TU "/pyspark-tuning-essentials/" "Tuning Essentials"