PySpark - Module 6.2: Unsupervised Learning and Recommendation

PySpark - Module 6.2: Unsupervised Learning and Recommendation

Unsupervised Learning and Recommendation

Not every problem comes with a label. Sometimes you want to group rows that look alike, and sometimes you want to predict what a person will like from what other people liked. MLlib covers both, and both keep the same pipeline shape you already know.

Keep working on environment version 4.

K-Means clustering

Clustering puts each row into one of k groups based on how close its features are. There is no label. You pick k, and K-Means finds the group centres.

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

df = spark.range(1000).withColumn("x", F.rand() * 10).withColumn("y", F.rand() * 10)
data = VectorAssembler(inputCols=["x", "y"], outputCol="features").transform(df)

model = KMeans(k=3, seed=42, featuresCol="features").fit(data)
clustered = model.transform(data)   # adds a 'prediction' column: the cluster id

silhouette = ClusteringEvaluator(featuresCol="features").evaluate(clustered)
print("Silhouette:", round(silhouette, 3))
for centre in model.clusterCenters():
    print(centre)

The silhouette score runs from -1 to 1 and tells you how well separated the clusters are. Higher is better. The usual way to choose k is to fit a few values and keep the one with the best silhouette, with an eye on whether the groups mean anything to you.

Cluster on Spark, or in scikit-learn?

If the data fits comfortably in memory, scikit-learn’s K-Means is simpler and faster to iterate on. Reach for Spark’s K-Means when the data does not fit on one machine, or when it already lives in a Spark DataFrame and you do not want to pull it down. The decision is the same one you have made all course: scale out only when scaling up runs out.

Recommendation with ALS

Collaborative filtering predicts how a user would rate an item from the ratings everyone has given so far. MLlib’s ALS (Alternating Least Squares) learns a small vector for each user and each item, and a rating is their dot product.

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

ratings = spark.createDataFrame(
    [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 5.0),
     (2, 0, 1.0), (2, 2, 4.0), (0, 2, 3.0), (1, 0, 2.0)],
    ["userId", "itemId", "rating"])

als = ALS(userCol="userId", itemCol="itemId", ratingCol="rating",
          coldStartStrategy="drop", rank=4, maxIter=5, seed=42)
model = als.fit(ratings)

predictions = model.transform(ratings)
rmse = RegressionEvaluator(labelCol="rating", predictionCol="prediction",
                           metricName="rmse").evaluate(predictions)
print("RMSE:", round(rmse, 3))

coldStartStrategy="drop" tells ALS to drop predictions for users or items it has never seen, so they do not turn into NaN and poison your RMSE.

On Free Edition serverless, model.transform(...) predicts ratings for user and item pairs you give it, and that is what we use here. The batch top-N methods recommendForAllUsers and recommendForAllItems rely on Spark higher-order functions that Unity Catalog does not allow on serverless, so they run only on classic compute.

Try it yourself

  • Fit K-Means for k in 2, 3, 4, 5 and plot the silhouette against k.
  • Pull the clustered data into pandas with .toPandas() and draw a scatter plot coloured by cluster (it is small enough here).
  • Add a few more ratings to the ALS data and watch the RMSE change.

Key questions you can now answer

  • What does K-Means optimise, and what does the silhouette score measure?
  • How do you decide between Spark K-Means and scikit-learn K-Means?
  • What does ALS learn for each user and item, and how does it turn that into a rating?
  • Why set coldStartStrategy="drop"?
  • Which ALS method does not run on Free Edition serverless, and where would you run it instead?

Previous: Module 6.1: Supervised Learning at Scale | Next: Module 6.3: ML Pipelines in Production