PySpark

Learn pyspark, for Data Science and Data Engineers (in Databricks)

Create a free Databricks account and confirm your setup before the first session

Learn pyspark, Introduction to Big Data and Apache Spark

Learn pyspark, Spark Architecture and RDDs

Learn pyspark, DataFrames and Spark SQL (in Databricks)

Learn pyspark, Data Sources and Sinks

Learn pyspark, Spark Streaming

Deploying Apache Kafka on a Single Node

Python-Based Producers and Consumers

Apache Spark Structured Streaming

MLlib fundamentals: the Pipeline API and feature engineering on Databricks

Supervised learning at scale: trees, ensembles, evaluation, and tuning with CrossValidator

Unsupervised learning and recommendation: K-Means clustering and ALS collaborative filtering

ML pipelines in production: saving models, batch inference, MLflow tracking, and drift

Scalable Machine Learning with H2O and Sparkling Water

Introduction to Distributed Machine Learning

Running H2O Sparkling Water (requires classic Databricks compute)

The capstone: apply the full stack to a real problem, on one of two tracks

A hands-on exercise: find and fix the performance traps in an AI-generated Spark pipeline

Tuning essentials: reading the plan, partitioning, and caching, on serverless and classic