PySpark - Module 5.3: Kafka and Structured Streaming
While we experimented with setting up a Kafka service and accessing it programmatically, the solution provided in Module 5.2 does not leverage the capabilities of Spark and distributed computing as it focuses primarily on Python-based producers and consumers, without integrating Spark’s distributed computing power.
Integration of Kafka with Apache Spark Structured Streaming:
Reading Data from Kafka in Spark: Using PySpark for Streaming
Configure Kafka Consumer: Use PySpark’s spark.readStream method to configure a Kafka consumer. Specify the Kafka bootstrap servers and the topic you want to read from.
Stream Data Processing: Define the schema for the incoming data and apply any necessary transformations using PySpark DataFrame operations.
Start Streaming: Use writeStream to output the results to a desired sink, like console, HDFS, or another Kafka topic, and start the streaming process.
Import this notebook to the same folder, rename it to Kafka-to-Delta Streaming Pipeline, and follow the instructions.
Writing Data to Kafka from Spark!
Windowed Aggregations: Performing windowed aggregations on streaming data from Kafka using PySpark
Import this notebook to the same folder and rename in to Windowed Aggregations on Streaming Data. Follow the content and try to understand what is going on there.
Try it yourself
Import this notebook to the same folder and look into the results. Can you explain the results you see in the windowed_aggregations table? This is practice to check your understanding, not a graded hand-in.
This wraps up the streaming track. Next we move into machine learning with Spark.
Previous: Module 5.2: Kafka Producers and Consumers | Next: Module 6.0: MLlib Fundamentals