PySpark - Module 5.3

PySpark - Module 5.3

While we experimented with setting up a Kafka service and accessing it programmatically, the solution provided in Module 5.2 does not leverage the capabilities of Spark and distributed computing as it focuses primarily on Python-based producers and consumers, without integrating Spark’s distributed computing power.

Integration of Kafka with Apache Spark Structured Streaming:

Reading Data from Kafka in Spark: Using PySpark for Streaming

  • Configure Kafka Consumer: Use PySpark’s spark.readStream method to configure a Kafka consumer. Specify the Kafka bootstrap servers and the topic you want to read from.

  • Stream Data Processing: Define the schema for the incoming data and apply any necessary transformations using PySpark DataFrame operations.

  • Start Streaming: Use writeStream to output the results to a desired sink, like console, HDFS, or another Kafka topic, and start the streaming process.

Import this notebook to the same folder, rename it to Kafka-to-Delta Streaming Pipeline, and follow the instructions.

Writing Data to Kafka from Spark!
How to write processed data back to Kafka topics using PySpark? This is a topic that I encourage motivated students to study it by themselves.

Windowed Aggregations: Performing windowed aggregations on streaming data from Kafka using PySpark

Import this notebook to the same folder and rename in to Windowed Aggregations on Streaming Data. Follow the content and try to understand what is going on there.

Assignment

Import this notebook to the same folder and look into the results. Can you explain what is the results you see in table windowed_aggregations?

Whats next? Module 5.4 (for year 2025-2026) will be dedicated to the best practices for kafka in databricks! Do you absolutely want to have it this year? Write Mohsen and email.