PySpark - Module 4 | Your Data Science Mentor

Teaching

PySpark - Module 4

Data Sources and Sinks

By now, you should have a foundational understanding of Spark and its key concepts, including RDDs and lazy evaluation. We also learned how to use Databricks PySpark for creating and manipulating Spark DataFrames, and querying them with Spark SQL. We performed various DataFrame operations like selecting, filtering, and joining data. Additionally, we explored ways to handle semi-structured data.

In this module we will study how to read and write from/into CSV, JSON, Parquet, and other formats. We will then continue by making jdbc connection for reading from / writing to databases.

Reading and writing from/into CSV, JSON, Parquet, and other formats

First watch this video bellow.

Now clone this Databricks notebook into your workspace. Lets put it into a new folder, lets say Module 4 and rename it into 01 - Reading and writing from-into for consistency reasons. Follow the steps and cells. The material should be self explanatory.

Reading form and Writing into databases using JDBC

Watch the Youtube video bellow.

Now clone this Databricks notebook into your workspace. Rename the file it into 02 - JDBC connector for consistency reasons. Follow the steps and cells. The material should be self explanatory.

Last updated on Aug 18, 2024

← PySpark - Module 5.0 Aug 19, 2024

PySpark - Module 3 Aug 17, 2024 →