PySpark - Module 7: Capstone Project
Capstone Project
The capstone is where the separate pieces become one piece of work. You take a real problem and carry it the whole way: get the data in, shape it, model it or stream it, and produce something a colleague could use. It is an individual or pair deliverable, scoped to about one day. The point is not size. It is whether you can join the stages together and explain your choices.
Pick one of the two tracks.
Track A: Data Science
Goal. Build a distributed machine learning pipeline on a dataset of at least one million rows, end to end, in Databricks.
What you deliver. One Databricks notebook that:
- reads the raw data,
- engineers features with the MLlib pipeline tools from Module 6.0,
- trains and evaluates a model, with the metric chosen to fit the problem,
- and scores a fresh batch with the saved model.
Platform. This track uses MLlib, so run it on serverless environment version 4 and save your model to a Unity Catalog Volume, as shown in Modules 6.0 to 6.3.
Suggested datasets. NYC Taxi trips, OpenAQ air quality, Yelp reviews, or data from your own company if you can share it. Pick something where a prediction would actually be useful.
You are assessed on: whether the pipeline runs end to end, whether the evaluation is a fair measure of how good the model really is, the clarity of the code, and how well you explain each choice.
Track B: Data Engineering
Goal. Build an end-to-end streaming pipeline that moves data from a source into a Delta Lake table, continuously.
What you deliver. A Databricks notebook with the streaming job, plus the producer or source setup, showing live data flowing in and landing in Delta.
Source. A Kafka topic is the target shape, set up as in Module 5. If you cannot run a Kafka broker, a file source or the built-in rate source into Delta is an acceptable stand-in, as long as the pipeline is genuinely streaming. The Kafka path may need classic compute; see the platform notes on the Getting Started page.
Suggested scenarios. A simulated IoT sensor stream, clickstream events, or financial tick data.
You are assessed on: whether the pipeline keeps running, how it handles the schema and bad records, whether you can show it recovering from a restart, and the clarity of the code.
A few real examples
A short look at where this shape shows up in production.
- Logistics. A parcel carrier scores every scan event as it arrives, predicts which shipments will miss their delivery window, and flags them before they are late. Streaming ingestion feeds a model that updates a dashboard. What Spark components made this possible?
- Retail. An online shop rebuilds product recommendations each night from the day’s clicks and purchases, joining behaviour against a catalogue of millions of items. Batch DataFrame work at scale, then a model. What Spark components made this possible?
- Media. A streaming service counts plays per track across hundreds of millions of events to drive its charts and royalty reports, with late events arriving for hours. Structured Streaming with windowing and watermarks. What Spark components made this possible?
Questions to answer in your write-up
- What would you do differently if your dataset were ten times larger?
- How did you decide between batch and streaming for your problem?
- What would a monitoring strategy look like for your pipeline in production?
Previous: Module 6.3: ML Pipelines in Production