PySpark - Module 2

PySpark - Module 2

Spark Architecture and RDDs

A gentle intro to Spark Architecture

Look at the Youtube video bellow carefully.

the above video explores and explains the architecture of Apache Spark, emphasizing the roles of the driver and executors in both local and cluster mode applications. The driver manages the application and can run on the local machine or within a cluster. In client mode, the driver operates locally, while in cluster mode, it starts in the Application Master container, requesting containers from the cluster manager (like Apache YARN) to launch executors. The video explains how Spark can operate in local mode, where everything runs in a single JVM, or in cluster mode, where the driver and executors communicate via the Application Master, with resource allocation handled by the cluster manager. A demonstration shows starting a Spark application in cluster mode with specific executors and utilizing YARN’s dynamic allocation feature to release idle executors.

The video bellow covers how Spark breaks down and executes applications at the executors. It covers the internal workings of Apache Spark, focusing on how it processes code in parallel. The video covers how Spark reads data, processes it, and writes results back to a destination. Three key data structures in Spark namely DataFrames, Datasets, and RDDs (Resilient Distributed Datasets) are presented. Although the emphasis is on DataFrames and Datasets, RDDs are crucial as they form the underlying structure to which DataFrames and Datasets are compiled. RDDs are resilient, partitioned, distributed, and immutable collections of data that can be created from a source or transformed from another RDD.

The video demonstrates how to create an RDD from a file and control the number of partitions. It explains the difference between transformations and actions in Spark—transformations are lazy operations that create new distributed datasets without sending results back to the driver, while actions are non-lazy and send results back to the driver. It also touches on topics such as Spark’s way of handling grouping and counting operations on distributed data, which involves repartitioning the data and triggering shuffle and sort activities. The number of parallel tasks in Spark is influenced by the number of partitions and available executors. Concept of functional programming in Spark is briefly touched on as well.

RDDs

We have seen the RDDs and their properties. If you need a refresher go ahead and see the Youtube video bellow.

As we have seen RDDS are closely linked to the concepts of Lazy evaluation and resilience. When you are done download 01-Creating-RDDs.ipynb from this link and follow the instruction and run it locally. Make sure you could run all the cells. The notebook should have sufficient comments in it and that should provide you with some hints about the code.

The html version of the notebook is available here online.

Afterwards download 02-RDD-Lineage-n-Persistence.ipynb from this link and run it locally. The html version of the notebook is available here online for immediate access.

Transformations and Actions in spark

Now continue watching the Youtube video bellow entitled “Spark RDD Transformations and Actions”. The video should give you a sample of various Transformations and Actions in spark.

Download 03-Transformations-n-Actions.ipynb from this link and run it locally. The html version of the notebook is available here online for immediate access.