PySpark - Module 0.0: Course Overview

PySpark - Module 0.0: Course Overview

PySpark Fundamentals and Advanced Topics

Distributed computing is at the heart of all the recent advancements of Data science and AI models. We can now scale out as an alternative to scale up. And that makes it possible to retrieve data, prepare the data and train our models not only at a faster speed, but also cheaper.

One of the pioneer commercial players on the filed of distributed computing is Databricks. This the company behind the open source Spark community.

Can you answer these by the end of this module?

Not a quiz. Think of it as a preview of where this course goes. If these feel out of reach right now, that is exactly the point.

  • Why does a Python UDF so often wreck your Spark performance, and what should you reach for before you write one?
  • Your job runs fine on sample data but crawls in production because one key holds most of the rows. What is data skew, and how do you deal with it?
  • When does checkpointing actually save you, and how is it different from caching?
  • What is Delta Lake, and why did “just write Parquet” stop being good enough?
  • What does “data lakehouse” mean, and which problem is it solving that neither the warehouse nor the lake could on its own?

Course Objective

Here is the situation this course is built around. Your pandas script ran fine on a sample, but on the full dataset it has been running for six hours, the kernel just died on an out-of-memory error, and the deadline is tomorrow. The point where the data no longer fits in one machine’s memory is where distributed computing begins. PySpark lets you keep writing familiar DataFrame code while Spark spreads the work across a cluster, so the same job finishes in minutes instead of hours. I teach it on Databricks because that is where most teams actually run Spark, and because it lets you spend your hours on the ideas instead of debugging a local install.

The published material is roughly 31 hours of distributed computing concepts and hands-on exercises across Modules 1 to 6, built around free online resources and Databricks notebooks.

How this course uses AI

You will use an AI assistant in this course, and you are expected to. It can write a join, a window function, or a streaming job faster than you can type. What it cannot see is your cluster, your data skew, or your cost per query. So it will hand you code that looks correct and runs slowly, or runs fine on a sample and falls over on the full dataset. Your job is to read what it produces, run it, and judge it: trace how it actually ran, find the stage that is costing you, and know why. Every module here is built to give you that judgment. The syntax is the easy part now. Knowing when the generated answer is wrong for the data in front of you is the part that is actually hard. You will practise exactly that in the exercise Reviewing AI-Generated Code.

Course Highlights

  • Fundamentals of PySpark: Learn the basics of PySpark, including its architecture, Resilient Distributed Datasets (RDDs), and DataFrames.
  • Data Processing and SQL: Master the use of DataFrames and Spark SQL for data manipulation and querying.
  • Streaming and Real-Time Data: Understand how to process real-time data using Spark Streaming.
  • Machine Learning: Build and evaluate distributed models with Spark MLlib (Modules 6.0 to 6.3).
  • Capstone Project: Apply the full stack to a real-world problem on a Data Science or Data Engineering track.

Learning Outcomes

After this module you can:

  • Read a production Spark pipeline in Databricks and explain what each stage does and why.
  • Take a dataset larger than memory and build a working pipeline on Databricks: ingest it, clean and transform it, and write it back out as Delta.
  • Run a real-time path with Kafka feeding Spark Structured Streaming, reason about what happens when a node fails mid-stream, and build and evaluate a distributed model with Spark MLlib. An optional H2O Sparkling Water track shows the same on classic compute.

Programme Competencies

  • Configure a Spark workload on Databricks and explain the driver and executor split that makes it run.
  • Build and debug DataFrame and Spark SQL transformations on datasets larger than memory.
  • Deploy a Structured Streaming job from a Kafka source and evaluate its fault tolerance and checkpointing behaviour.
  • Read, write and partition data across CSV, Parquet and Delta Lake, and justify the format choice.
  • Train and evaluate a distributed model with Spark MLlib, and explain when scaling out beats a single-machine library.

Time Commitment

The estimate below is total self-paced study across Modules 1 to 6: the module pages, the code you run on Databricks, the in-page exercises, and the linked readings and notebooks each module is built around. It comes to about 31 hours, weighted toward streaming and machine learning, where most of the hands-on work is.

  • Module 1: Introduction to Big Data and Apache Spark: 2 hours
  • Module 2: Spark Architecture and RDDs: 3 hours
  • Module 3: DataFrames and Spark SQL: 5 hours
  • Module 4: Data Sources and Sinks: 2 hours
  • Module 5: Spark Streaming: 9 hours (Modules 5.0 to 5.3)
  • Module 6: Machine Learning (MLlib): 10 hours (Modules 6.0 to 6.3)

Roughly 40 hours end to end: about 31 across Modules 1 to 6, plus 9 for the capstone project. The optional H2O Sparkling Water track (6.4 to 6.6) adds 3 to 4 hours and needs classic compute.

Getting Started

To begin, make sure you can install and configure PySpark. Familiarize yourself with the course materials, including the recommended readings and online resources. Each module might contain practical assignments and quizzes to reinforce your learning, so take your time to complete these exercises when available.

Course Outline: PySpark Fundamentals and Advanced Topics

Module 1: Introduction to Big Data and Apache Spark

  1. Introduction to Big Data
    • What is Big Data?
    • Challenges of Big Data
  2. Overview of Apache Spark
    • History and evolution
    • Components of the Spark ecosystem
  3. Installing PySpark
    • System requirements
    • Installing PySpark on a Linux environment (WSL)
    • Configuring PySpark and Jupyter Notebook

Module 2: Spark Architecture and RDDs

  1. Spark Architecture
    • Spark driver and executors
    • SparkContext and SparkSession
    • Cluster managers (Standalone, YARN, Mesos, Kubernetes)
  2. Resilient Distributed Datasets (RDDs)
    • What are RDDs?
    • Creating RDDs
    • Transformations and Actions
    • RDD Lineage and Persistence

Module 3: DataFrames and Spark SQL

  1. Introduction to DataFrames

    • Difference between RDDs and DataFrames
    • Creating DataFrames
    • Schema inference and manual schema definition
  2. DataFrame Operations

    • Selecting, filtering, and transforming data.
    • Aggregations and Grouping
    • Joins and unions
    • Working with semi-structured data
  3. Spark SQL

    • SQL queries with Spark
    • Registering DataFrames as tables
    • Using SQL queries to manipulate DataFrames

Module 4: Data Sources and Sinks

  1. Reading Data
    • Reading from CSV, JSON, Parquet, and other formats
    • Reading from databases (JDBC)
  2. Writing Data
    • Writing to CSV, JSON, Parquet, and other formats
    • Writing to databases (JDBC)
  3. Data Sources API
    • Introduction to the Data Sources API
    • Custom data sources

Module 5: Spark Streaming

Module 6: Machine Learning with PySpark

Module 6 has two tracks. The MLlib track is the main path and runs on Free Edition with environment version 4. The H2O Sparkling Water track is optional and needs classic compute.

MLlib (Free Edition, environment version 4):

H2O Sparkling Water (optional, classic compute):

Module 7: Real-World Applications and Capstone Project

The capstone brief is available now. Pick a Data Science or Data Engineering track and apply the full stack to a real problem.

Learning Resources

  • Documentation and Tutorials
    • Apache Spark official documentation
    • PySpark API reference
  • Books and Online Courses
    • “Learning PySpark” by Tomasz Drabas and Denny Lee
    • Online courses on platforms like Coursera, Udacity, and edX
  • Community and Support
    • GitHub repositories for sample projects

Assessment

This is a pass or fail elective. You pass in one of two ways.

Option A: Certification. Pass the Databricks Certified Associate Developer for Apache Spark exam. This course is built to prepare you for it. The exam is optional and paid (about 200 USD), and a pass is a clear demonstration of mastery, so it earns a pass on its own.

Option B: Capstone plus one deliverable. Complete the capstone project on either the Data Science or Data Engineering track, together with one of these two pieces:

  • A cheat sheet that summarises the key concepts and methods from the course. Submit it as a Jupyter notebook, a Databricks notebook, or a PDF of at most two pages. If you submit a PDF, include the source notebook as well.
  • The AI-review exercise: fix the AI-generated pipeline and hand in the corrected notebook, the before and after run times, and a short diagnosis of each problem.

Option B costs nothing and is the standard route. Option A is there for anyone who also wants an industry credential.

Quizzes and the “try it yourself” exercises through the modules are there to help you learn and check your own understanding. They are not graded.

By the end of this material you will be able to provide answers to the questions bellow:

  1. Describe the PySpark architecture.
  2. What are RDDs in PySpark?
  3. Explain the concept of lazy evaluation in PySpark.
  4. How does PySpark differ from Apache Hadoop?
  5. What are DataFrames in PySpark?
  6. How do you initialize a SparkSession?
  7. What is the significance of the SparkContext?
  8. Describe the types of transformations in PySpark.
  9. How do you read a CSV file into a PySpark DataFrame?
  10. What are actions in PySpark, and how do they differ from transformations?
  11. How can you filter rows in a DataFrame?
  12. Explain how to perform joins in PySpark.
  13. How do you aggregate data in PySpark?
  14. What are UDFs (User Defined Functions), and how are they used?
  15. How can you handle missing or null values in PySpark?
  16. How do you repartition a DataFrame, and why?
  17. Describe how to cache a DataFrame. Why is it useful?
  18. How do you save a DataFrame to a file?
  19. Explain the concept of partitioning in PySpark.
  20. How can broadcast variables improve performance?
  21. What are accumulators, and how are they used?
  22. How does PySpark handle data skewness?
  23. Explain how checkpointing works in PySpark.
  24. What is delta lake? Look at here and here!
  25. What is data lakehouse architecture.