MLlib | Your Data Science Mentor - Mohsen Davarynejad

PySpark - Module 6.0: MLlib Fundamentals

Tue, 02 Jun 2026 00:00:00 +0000

MLlib Fundamentals

This is where the course turns from moving and shaping data to learning from it. MLlib is Spark’s machine learning library. The difference from scikit-learn is the same one you have met all course: scikit-learn trains on a single machine and the data has to fit in its memory, while MLlib trains across the cluster on a Spark DataFrame, so it keeps working when the data does not fit on one machine.

If you already know scikit-learn, the ideas carry over. You still split the data, fit a model, and score it. What changes is that every step becomes a stage in a Spark pipeline, and the work runs distributed.

Before you start: switch to environment version 4

MLlib uses Spark’s JVM engine, and on Free Edition serverless you have to opt in to it by selecting environment version 4.

Open a notebook.
On the right, open the Environment panel.
Set the version to 4 and apply.

Without this, importing an MLlib transformer fails with a Py4J security error. With it, the code below runs. On serverless a trained model is capped at 100 MB, which is plenty for everything in this module. The serverless environment version 4 notes have the details.

Transformers, Estimators, and the Pipeline

MLlib has three building blocks, and almost everything you do is one of them.

A Transformer takes a DataFrame and returns a new one with extra columns. VectorAssembler is a Transformer: it reads several columns and writes one features column.
An Estimator has a fit method. It studies data and produces a Transformer. LogisticRegression is an Estimator: calling fit returns a trained model that can transform new data.
A Pipeline chains stages in order. Calling fit on the Pipeline fits each Estimator in turn and passes the result down the chain. The output is a PipelineModel, which is itself a Transformer you apply to new data.

Why this matters: the same sequence of steps you fit on training data is applied, identically, to test data and to production data. You cannot accidentally prepare features one way in training and another way at scoring time.

Feature engineering tools

Three Transformers cover most tabular feature work:

StringIndexer turns a string category column into numeric indexes (for example "a" and "b" become 0.0 and 1.0).
OneHotEncoder turns those indexes into one-hot vectors, so the model does not read the categories as ordered numbers.
VectorAssembler collects your numeric columns and encoded vectors into the single features column that every MLlib model expects.

Hands-on: a binary classification pipeline

We generate a small dataset so you can run this immediately, with no file to download: two numeric columns, one category column, and a binary label.

from pyspark.sql import functions as F
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# A small synthetic dataset
df = (spark.range(2000)
 .withColumn("x1", F.rand() * 10)
 .withColumn("x2", F.rand() * 5)
 .withColumn("cat", F.when(F.rand() > 0.5, "a").otherwise("b"))
 .withColumn("label", F.when(F.col("x1") + F.col("x2") + F.rand() > 8, 1).otherwise(0)))

# Split first, so the test set stays unseen
train, test = df.randomSplit([0.8, 0.2], seed=42)

# Feature engineering stages
indexer = StringIndexer(inputCol="cat", outputCol="cat_idx")
encoder = OneHotEncoder(inputCols=["cat_idx"], outputCols=["cat_ohe"])
assembler = VectorAssembler(inputCols=["x1", "x2", "cat_ohe"], outputCol="features")

# The model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# One pipeline, fit once
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
model = pipeline.fit(train)

# Score the held-out test set
predictions = model.transform(test)
auc = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC").evaluate(predictions)
print("AUC:", round(auc, 4))

Run it. You should see an AUC close to 1.0, because the label is an easy function of the features. On real data you will fight for every point of AUC, but the shape of the code does not change.

Notice what you did not do: you did not convert the DataFrame to pandas, and you did not loop over rows. The encoders, the assembler, and the model all ran as Spark stages across the cluster. Swap spark.range(2000) for a fifty million row table and the same code still runs.

Try it yourself

Change the label rule and watch how the AUC moves.
Add a second category column, and extend the StringIndexer and OneHotEncoder stages to cover it.
Print model.stages and look at what each fitted stage is.

Key questions you can now answer

What is the difference between a Transformer and an Estimator in MLlib?
How does a Pipeline stop training and scoring from preparing features differently?
Why do you one-hot encode a category instead of feeding its string index straight to the model?
What does VectorAssembler produce, and why does every MLlib model need it?
Why can this pipeline train on data that would not fit in scikit-learn?

Previous: Module 5.3: Kafka and Structured Streaming | Next: Module 6.1: Supervised Learning at Scale

PySpark - Module 6.1: Supervised Learning at Scale

Mon, 01 Jun 2026 00:00:00 +0000

Supervised Learning at Scale

In Module 6.0 you built one pipeline with one model. Here you swap models, measure them properly, and let Spark search for good settings. The pipeline you already know does not change. You only replace the final stage and add a way to compare results.

This module assumes the feature pipeline from 6.0: a StringIndexer, a OneHotEncoder, and a VectorAssembler that produce a features column. Keep the same notebook on environment version 4.

Swapping the model

Every MLlib classifier takes the same featuresCol and labelCol, so moving from one to another is a one-line change to the last stage.

from pyspark.ml import Pipeline
from pyspark.ml.classification import (
 LogisticRegression, DecisionTreeClassifier,
 RandomForestClassifier, GBTClassifier,
)

models = {
 "logreg": LogisticRegression(featuresCol="features", labelCol="label"),
 "tree": DecisionTreeClassifier(featuresCol="features", labelCol="label"),
 "forest": RandomForestClassifier(featuresCol="features", labelCol="label"),
 "gbt": GBTClassifier(featuresCol="features", labelCol="label", maxIter=10),
}

fitted = {name: Pipeline(stages=[indexer, encoder, assembler, clf]).fit(train)
 for name, clf in models.items()}

A short guide to when each one earns its place:

Logistic Regression is fast, and its coefficients are easy to explain. Reach for it first as a baseline.
Decision Tree captures non-linear splits and is readable, but a single tree overfits.
Random Forest averages many trees. It is the dependable default for tabular data.
Gradient Boosted Trees often score highest, at the cost of longer training and more careful tuning.

Measuring a model

A model is only as good as the metric you judge it by. MLlib gives you one evaluator per problem type.

from pyspark.ml.evaluation import (
 BinaryClassificationEvaluator, MulticlassClassificationEvaluator,
)

auc = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
f1 = MulticlassClassificationEvaluator(labelCol="label", metricName="f1")

for name, model in fitted.items():
 pred = model.transform(test)
 print(f"{name:7s} AUC={auc.evaluate(pred):.3f} F1={f1.evaluate(pred):.3f}")

Use BinaryClassificationEvaluator (area under ROC or PR) for two-class problems, MulticlassClassificationEvaluator (f1, accuracy) when there are more than two classes, and RegressionEvaluator (RMSE, R2) when the label is a number.

Which features matter

Tree models report how much each feature contributed. The importances line up with the columns that went into the VectorAssembler.

forest = fitted["forest"].stages[-1]
print(forest.featureImportances)

Tuning with CrossValidator

Picking numTrees or maxDepth by hand is guessing. CrossValidator trains every combination in a ParamGridBuilder across several folds and keeps the best.

On serverless, CrossValidator needs a scratch location, so create a Unity Catalog Volume once and point Spark ML at it. This is also the Volume you will reuse in 6.3 to save models.

import os

cat = spark.sql("SELECT current_catalog()").first()[0]
sch = spark.sql("SELECT current_schema()").first()[0]
spark.sql(f"CREATE VOLUME IF NOT EXISTS {cat}.{sch}.ml_models")
os.environ["SPARKML_TEMP_DFS_PATH"] = f"/Volumes/{cat}/{sch}/ml_models"

Now run the search:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

forest = RandomForestClassifier(featuresCol="features", labelCol="label")
grid = (ParamGridBuilder()
 .addGrid(forest.numTrees, [10, 30])
 .addGrid(forest.maxDepth, [3, 6])
 .build())

cv = CrossValidator(
 estimator=Pipeline(stages=[indexer, encoder, assembler, forest]),
 estimatorParamMaps=grid,
 evaluator=auc,
 numFolds=3,
)
cv_model = cv.fit(train)
print("Best AUC:", round(auc.evaluate(cv_model.transform(test)), 3))

cv_model.bestModel is a normal PipelineModel, fitted with the winning settings, ready to score new data.

A note on tuning libraries

You will read about Hyperopt and Optuna for hyperparameter search. Here is what works where:

CrossValidator is Spark’s own distributed search. It runs on Free Edition serverless and is what this module uses.
Hyperopt can distribute trials with SparkTrials, but that path needs the Spark JVM context, which serverless does not expose, so it runs only on classic compute.
Optuna runs single-machine on the driver. It is fine for a small search, but it does not distribute the trials across the cluster.

For this course, stay with CrossValidator. It is distributed, it is built in, and it runs where your students do.

Try it yourself

Add forest.minInstancesPerNode to the grid and see whether it changes the winner.
Switch the evaluator metric from areaUnderROC to areaUnderPR and compare.
Sort the feature importances and name the two columns that drive the prediction.

Key questions you can now answer

How do you change which algorithm a pipeline uses, and what stays the same?
Which evaluator do you use for two classes, many classes, and a numeric target?
How does CrossValidator choose its best model, and why does it need a Volume on serverless?
Why can Hyperopt’s distributed mode not run on Free Edition, and what do you use instead?
What do feature importances tell you, and what do they not tell you?

Previous: Module 6.0: MLlib Fundamentals | Next: Module 6.2: Unsupervised Learning and Recommendation

PySpark - Module 6.2: Unsupervised Learning and Recommendation

Sun, 31 May 2026 00:00:00 +0000

Unsupervised Learning and Recommendation

Not every problem comes with a label. Sometimes you want to group rows that look alike, and sometimes you want to predict what a person will like from what other people liked. MLlib covers both, and both keep the same pipeline shape you already know.

Keep working on environment version 4.

K-Means clustering

Clustering puts each row into one of k groups based on how close its features are. There is no label. You pick k, and K-Means finds the group centres.

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

df = spark.range(1000).withColumn("x", F.rand() * 10).withColumn("y", F.rand() * 10)
data = VectorAssembler(inputCols=["x", "y"], outputCol="features").transform(df)

model = KMeans(k=3, seed=42, featuresCol="features").fit(data)
clustered = model.transform(data) # adds a 'prediction' column: the cluster id

silhouette = ClusteringEvaluator(featuresCol="features").evaluate(clustered)
print("Silhouette:", round(silhouette, 3))
for centre in model.clusterCenters():
 print(centre)

The silhouette score runs from -1 to 1 and tells you how well separated the clusters are. Higher is better. The usual way to choose k is to fit a few values and keep the one with the best silhouette, with an eye on whether the groups mean anything to you.

Cluster on Spark, or in scikit-learn?

If the data fits comfortably in memory, scikit-learn’s K-Means is simpler and faster to iterate on. Reach for Spark’s K-Means when the data does not fit on one machine, or when it already lives in a Spark DataFrame and you do not want to pull it down. The decision is the same one you have made all course: scale out only when scaling up runs out.

Recommendation with ALS

Collaborative filtering predicts how a user would rate an item from the ratings everyone has given so far. MLlib’s ALS (Alternating Least Squares) learns a small vector for each user and each item, and a rating is their dot product.

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

ratings = spark.createDataFrame(
 [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 5.0),
 (2, 0, 1.0), (2, 2, 4.0), (0, 2, 3.0), (1, 0, 2.0)],
 ["userId", "itemId", "rating"])

als = ALS(userCol="userId", itemCol="itemId", ratingCol="rating",
 coldStartStrategy="drop", rank=4, maxIter=5, seed=42)
model = als.fit(ratings)

predictions = model.transform(ratings)
rmse = RegressionEvaluator(labelCol="rating", predictionCol="prediction",
 metricName="rmse").evaluate(predictions)
print("RMSE:", round(rmse, 3))

coldStartStrategy="drop" tells ALS to drop predictions for users or items it has never seen, so they do not turn into NaN and poison your RMSE.

On Free Edition serverless, model.transform(...) predicts ratings for user and item pairs you give it, and that is what we use here. The batch top-N methods recommendForAllUsers and recommendForAllItems rely on Spark higher-order functions that Unity Catalog does not allow on serverless, so they run only on classic compute.

Try it yourself

Fit K-Means for k in 2, 3, 4, 5 and plot the silhouette against k.
Pull the clustered data into pandas with .toPandas() and draw a scatter plot coloured by cluster (it is small enough here).
Add a few more ratings to the ALS data and watch the RMSE change.

Key questions you can now answer

What does K-Means optimise, and what does the silhouette score measure?
How do you decide between Spark K-Means and scikit-learn K-Means?
What does ALS learn for each user and item, and how does it turn that into a rating?
Why set coldStartStrategy="drop"?
Which ALS method does not run on Free Edition serverless, and where would you run it instead?

Previous: Module 6.1: Supervised Learning at Scale | Next: Module 6.3: ML Pipelines in Production

PySpark - Module 6.3: ML Pipelines in Production

Sat, 30 May 2026 00:00:00 +0000

ML Pipelines in Production

A model that lives in the notebook where you trained it is a prototype. Production means four more things: save the model, load it somewhere else, score new data with it, and keep an eye on it as the world changes. MLlib and MLflow cover all four, and none of it leaves Databricks.

Keep working on environment version 4, and reuse the Unity Catalog Volume from 6.1. On serverless, saving and loading models always goes through a Volume.

import os

cat = spark.sql("SELECT current_catalog()").first()[0]
sch = spark.sql("SELECT current_schema()").first()[0]
spark.sql(f"CREATE VOLUME IF NOT EXISTS {cat}.{sch}.ml_models")
volume = f"/Volumes/{cat}/{sch}/ml_models"
os.environ["SPARKML_TEMP_DFS_PATH"] = volume

Saving and loading a model

A fitted PipelineModel saves to the Volume as a folder. The whole pipeline goes with it: the indexers, the encoder, the assembler, and the trained model. Whoever loads it gets the exact same feature steps, so there is no way to score with different preparation than you trained with.

from pyspark.ml import PipelineModel

model.write().overwrite().save(f"{volume}/credit_model")

loaded = PipelineModel.load(f"{volume}/credit_model")

Batch inference

Batch inference is the everyday production job: load the saved model, point it at a fresh batch of data, and write the scores out. Because the model is a Transformer, scoring a million rows is the same call as scoring ten.

scored = loaded.transform(new_data)
scored.select("prediction", "probability").show(5)

The new batch only needs the same input columns the pipeline was trained on. The encoders and the assembler rebuild the features column on their own.

Tracking experiments with MLflow

Once you train more than one model, you need a record of what you tried and how it did. MLflow keeps that record. Each run stores the parameters you logged, the metrics you logged, and the model itself, and they show up in the Experiments tab next to your notebook.

import mlflow

with mlflow.start_run() as run:
 mlflow.log_param("model", "logreg")
 mlflow.log_metric("auc", round(auc, 4))
 mlflow.spark.log_model(model, "model", dfs_tmpdir=volume)

The dfs_tmpdir=volume argument is the serverless detail: MLflow stages the Spark model through the Volume, the same place SparkML writes to. Without it, logging a Spark model on serverless fails.

Watching for drift

A model is trained on a snapshot of the world, and the world moves. Data drift is when the incoming data starts to look different from the training data: a feature’s average shifts, a new category appears, a range widens. The model still runs, but its predictions quietly get worse.

You do not need a special tool to start. Compare the new batch against the training data with the Spark aggregations you already know.

from pyspark.sql import functions as F

for col in ["x1", "x2"]:
 t = train.select(F.mean(col).alias("mean"), F.stddev(col).alias("std")).first()
 n = new_data.select(F.mean(col).alias("mean"), F.stddev(col).alias("std")).first()
 print(f"{col}: train mean={t['mean']:.2f} std={t['std']:.2f} | new mean={n['mean']:.2f} std={n['std']:.2f}")

When the new numbers drift away from the training numbers, that is your signal to look closer, and often to retrain. The same idea scales up to proper monitoring: track these summaries over time, and alert when they move past a threshold.

Try it yourself

Save a model, load it in a fresh cell, and confirm the loaded model scores the same rows the same way.
Log two models to MLflow with different settings and compare their AUC in the Experiments tab.
Shift new_data (add a constant to x1) and watch the drift check pick it up.

Key questions you can now answer

Why does saving the whole PipelineModel, not just the final estimator, matter?
What is batch inference, and why is scoring a million rows the same code as scoring ten?
What does MLflow record per run, and why does logging a Spark model on serverless need dfs_tmpdir?
What is data drift, and how can you detect it with plain Spark aggregations?
Where do saved models and staged artifacts have to live on Free Edition serverless, and why?

Previous: Module 6.2: Unsupervised Learning and Recommendation | Next: Module 7: Capstone Project