PySpark - Module 1-1
In this guide, we’ll walk you through the steps to set up Apache Spark, Python, and Jupyter Notebook on a freshly installed WSL Ubuntu 22.04. You’ll also learn how to make Jupyter Notebook accessible from the host machine.
Prerequisites
- A Windows machine with WSL installed.
- WSL set to run Ubuntu 22.04.
- Basic knowledge of command-line operations.
Let’s get started!
Step 1: Install Python and Pip
WSL comes with Python pre-installed. To ensure you have the latest version of Python and pip, update your packages:
sudo apt update && sudo apt upgrade -y
sudo apt install python3-pip -y
Verify that Python and pip are installed correctly:
python3 --version
pip3 --version
Step 2: Install Java
Apache Spark requires Java to run. We’ll install OpenJDK:
sudo apt install openjdk-11-jdk -y
Verify the Java installation:
java -version
You should see output indicating that Java 11 is installed.
Step 3: Download and Install Apache Spark
Now, let’s download and install Apache Spark. We’ll be using Spark version 3.5.2 with Hadoop 3:
wget https://downloads.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
Extract the downloaded file:
tar xvf spark-3.5.2-bin-hadoop3.tgz
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
Step 4: Set Up Environment Variables
To make Spark available globally, we need to set up the necessary environment variables. Open your .bashrc file:
vim ~/.bashrc
Add the following lines at the end of the file:
# Spark environment variables
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save and exit the file. Exit the VM and run it again to load the changes.
To verify that Spark is set up correctly, run:
spark-shell
This should launch the Spark shell.
Step 5: Install Jupyter Notebook
Now that we have Python and Spark installed, we can set up Jupyter Notebook:
pip3 install notebook
Step 6: Configure Jupyter Notebook for Remote Access
To access Jupyter Notebook from the host machine, you’ll need to set up a configuration that allows remote access.
Generate the Jupyter Notebook configuration file:
jupyter notebook --generate-config
Now, edit the config file (You might need to exit the vm and relaunch it again):
vim ~/.jupyter/jupyter_notebook_config.py
Add the following lines,
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
If you want to add password protection for your Jupyter Notebook, you can generate a hashed password using the following command:
python3 -c "from notebook.auth import passwd; print(passwd())"
Copy the output and add it to your Jupyter configuration file as follows:
c.NotebookApp.password = '<your-hashed-password>'
Save the changes.
Step 7: Launch Jupyter Notebook
Start Jupyter Notebook using the following command:
jupyter notebook
You can now access Jupyter Notebook from your host machine by visiting http://localhost:8888 in your browser.
Step 8: Verify the PySpark Setup
To ensure PySpark is set up correctly, you can create a new notebook in Jupyter and try the following code:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
df = spark.createDataFrame([(1, 'foo'), (2, 'bar')], ['id', 'value'])
df.show()
If everything is set up correctly, you should see a small DataFrame printed in the notebook.
Troubleshooting
- If Jupyter Notebook does not start or is not accessible, ensure that your firewall allows access to port 8888.
- Make sure WSL is configured properly, and networking is enabled between WSL and your Windows host.