PySpark - Module 1-1

PySpark - Module 1-1

In this guide, we’ll walk you through the steps to set up Apache Spark, Python, and Jupyter Notebook on a freshly installed WSL Ubuntu 22.04. You’ll also learn how to make Jupyter Notebook accessible from the host machine.

Prerequisites

  1. A Windows machine with WSL installed.
  2. WSL set to run Ubuntu 22.04.
  3. Basic knowledge of command-line operations.

Let’s get started!


Step 1: Install Python and Pip

WSL comes with Python pre-installed. To ensure you have the latest version of Python and pip, update your packages:

sudo apt update && sudo apt upgrade -y
sudo apt install python3-pip -y

Verify that Python and pip are installed correctly:

python3 --version
pip3 --version

Step 2: Install Java

Apache Spark requires Java to run. We’ll install OpenJDK:

sudo apt install openjdk-11-jdk -y

Verify the Java installation:

java -version

You should see output indicating that Java 11 is installed.

Step 3: Download and Install Apache Spark

Now, let’s download and install Apache Spark. We’ll be using Spark version 3.5.2 with Hadoop 3:

wget https://downloads.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz

Extract the downloaded file:

tar xvf spark-3.5.2-bin-hadoop3.tgz
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark

Step 4: Set Up Environment Variables

To make Spark available globally, we need to set up the necessary environment variables. Open your .bashrc file:

vim ~/.bashrc

Add the following lines at the end of the file:

# Spark environment variables
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and exit the file. Exit the VM and run it again to load the changes.

To verify that Spark is set up correctly, run:

spark-shell

This should launch the Spark shell.

Step 5: Install Jupyter Notebook

Now that we have Python and Spark installed, we can set up Jupyter Notebook:

pip3 install notebook

Step 6: Configure Jupyter Notebook for Remote Access

To access Jupyter Notebook from the host machine, you’ll need to set up a configuration that allows remote access.

Generate the Jupyter Notebook configuration file:

jupyter notebook --generate-config

Now, edit the config file (You might need to exit the vm and relaunch it again):

vim ~/.jupyter/jupyter_notebook_config.py

Add the following lines,

c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

If you want to add password protection for your Jupyter Notebook, you can generate a hashed password using the following command:

python3 -c "from notebook.auth import passwd; print(passwd())"

Copy the output and add it to your Jupyter configuration file as follows:

c.NotebookApp.password = '<your-hashed-password>'

Save the changes.

Step 7: Launch Jupyter Notebook

Start Jupyter Notebook using the following command:

jupyter notebook

You can now access Jupyter Notebook from your host machine by visiting http://localhost:8888 in your browser.

Step 8: Verify the PySpark Setup

To ensure PySpark is set up correctly, you can create a new notebook in Jupyter and try the following code:

!pip install pyspark
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test").getOrCreate()
df = spark.createDataFrame([(1, 'foo'), (2, 'bar')], ['id', 'value'])
df.show()

If everything is set up correctly, you should see a small DataFrame printed in the notebook.

Troubleshooting

  • If Jupyter Notebook does not start or is not accessible, ensure that your firewall allows access to port 8888.
  • Make sure WSL is configured properly, and networking is enabled between WSL and your Windows host.