PySpark - Module 5.3: Kafka and Structured Streaming | Your Data Science Mentor

PySpark - Module 5.3: Kafka and Structured Streaming

While we experimented with setting up a Kafka service and accessing it programmatically, the solution provided in Module 5.2 does not leverage the capabilities of Spark and distributed computing as it focuses primarily on Python-based producers and consumers, without integrating Spark’s distributed computing power.

Integration of Kafka with Apache Spark Structured Streaming:

Reading Data from Kafka in Spark: Using PySpark for Streaming

Configure Kafka Consumer: Use PySpark’s spark.readStream method to configure a Kafka consumer. Specify the Kafka bootstrap servers and the topic you want to read from.
Stream Data Processing: Define the schema for the incoming data and apply any necessary transformations using PySpark DataFrame operations.
Start Streaming: Use writeStream to output the results to a desired sink, like console, HDFS, or another Kafka topic, and start the streaming process.

Import this notebook to the same folder, rename it to Kafka-to-Delta Streaming Pipeline, and follow the instructions.

Writing Data to Kafka from Spark!

How to write processed data back to Kafka topics using PySpark? This is a topic that I encourage motivated students to study it by themselves.

Windowed Aggregations: Performing windowed aggregations on streaming data from Kafka using PySpark

Import this notebook to the same folder and rename in to Windowed Aggregations on Streaming Data. Follow the content and try to understand what is going on there.

Try it yourself

Import this notebook to the same folder and look into the results. Can you explain the results you see in the windowed_aggregations table? This is practice to check your understanding, not a graded hand-in.

This wraps up the streaming track. Next we move into machine learning with Spark.

Previous: Module 5.2: Kafka Producers and Consumers | Next: Module 6.0: MLlib Fundamentals

Last updated on Jun 17, 2026

← PySpark - Module 5.2: Kafka Producers and Consumers Jun 7, 2026

PySpark - Module 6.0: MLlib Fundamentals Jun 2, 2026 →