PySpark - Module 2: Spark Architecture and RDDs

PySpark - Module 2: Spark Architecture and RDDs

Spark Architecture and RDDs

A gentle intro to Spark Architecture

Look at the Youtube video bellow carefully.

the above video explores and explains the architecture of Apache Spark, emphasizing the roles of the driver and executors in both local and cluster mode applications. The driver manages the application and can run on the local machine or within a cluster. In client mode, the driver operates locally, while in cluster mode, it starts in the Application Master container, requesting containers from the cluster manager (like Apache YARN) to launch executors. The video explains how Spark can operate in local mode, where everything runs in a single JVM, or in cluster mode, where the driver and executors communicate via the Application Master, with resource allocation handled by the cluster manager. A demonstration shows starting a Spark application in cluster mode with specific executors and utilizing YARN’s dynamic allocation feature to release idle executors.

The video bellow covers how Spark breaks down and executes applications at the executors. It covers the internal workings of Apache Spark, focusing on how it processes code in parallel. The video covers how Spark reads data, processes it, and writes results back to a destination. Three key data structures in Spark namely DataFrames, Datasets, and RDDs (Resilient Distributed Datasets) are presented. Although the emphasis is on DataFrames and Datasets, RDDs are crucial as they form the underlying structure to which DataFrames and Datasets are compiled. RDDs are resilient, partitioned, distributed, and immutable collections of data that can be created from a source or transformed from another RDD.

The video demonstrates how to create an RDD from a file and control the number of partitions. It explains the difference between transformations and actions in Spark—transformations are lazy operations that create new distributed datasets without sending results back to the driver, while actions are non-lazy and send results back to the driver. It also touches on topics such as Spark’s way of handling grouping and counting operations on distributed data, which involves repartitioning the data and triggering shuffle and sort activities. The number of parallel tasks in Spark is influenced by the number of partitions and available executors. Concept of functional programming in Spark is briefly touched on as well.

RDDs

We have seen the RDDs and their properties. If you need a refresher go ahead and see the Youtube video bellow.

The RDD notebooks in this module say “run it locally” on purpose. The RDD API needs sparkContext, which Databricks Free Edition serverless does not expose, because serverless runs on Spark Connect. So run these on a local Spark install or on classic Databricks compute. Everywhere else in the course, the DataFrame API and Spark SQL run fine on Free Edition, and that is what you will use day to day. RDDs are here so you understand what DataFrames compile down to, not because you will write them often.

As we have seen RDDS are closely linked to the concepts of Lazy evaluation and resilience. When you are done download 01-Creating-RDDs.ipynb from this link and follow the instruction and run it locally. Make sure you could run all the cells. The notebook should have sufficient comments in it and that should provide you with some hints about the code.

The html version of the notebook is available here online.

Afterwards download 02-RDD-Lineage-n-Persistence.ipynb from this link and run it locally. The html version of the notebook is available here online for immediate access.

Transformations and Actions in spark

Now continue watching the Youtube video bellow entitled “Spark RDD Transformations and Actions”. The video should give you a sample of various Transformations and Actions in spark.

Download 03-Transformations-n-Actions.ipynb from this link and run it locally. The html version of the notebook is available here online for immediate access.

Spark Connect

Everything above describes the classic setup: your Python process runs inside the Spark driver, right next to the JVM. Newer Spark, and all of Databricks serverless, runs a different way called Spark Connect.

With Spark Connect your code is a thin client. When you build a DataFrame, the client turns it into a query plan and sends that plan over the network (gRPC) to a Spark server, which runs it and sends the results back. The client never holds the JVM driver itself. This is what lets a platform run your notebook on shared serverless compute and still keep each user’s code isolated and governed.

For the work you do most, nothing changes. DataFrames, Spark SQL, and Structured Streaming all build query plans, so they run the same over Spark Connect. What you lose is the things that reach straight into the JVM: sparkContext, the low-level RDD API, and libraries that need JVM access. That is exactly why the RDD notebooks in this module run locally, and why H2O Sparkling Water needs classic compute.

One behaviour to know for the certification: Spark Connect defers analysis and name resolution to execution time. A mistake the classic API would catch as you write the line may not surface until you run an action.

You can always tell which mode you are on:

print(type(spark).__module__)
# pyspark.sql.connect.session  -> Spark Connect (serverless)
# pyspark.sql.session          -> classic

Previous: Module 1: Big Data and Apache Spark | Next: Module 3: DataFrames and Spark SQL