PySpark - Module 6.2: Unsupervised Learning and Recommendation
Unsupervised Learning and Recommendation
Not every problem comes with a label. Sometimes you want to group rows that look alike, and sometimes you want to predict what a person will like from what other people liked. MLlib covers both, and both keep the same pipeline shape you already know.
Keep working on environment version 4.
K-Means clustering
Clustering puts each row into one of k groups based on how close its features are. There is no label. You pick k, and K-Means finds the group centres.
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
df = spark.range(1000).withColumn("x", F.rand() * 10).withColumn("y", F.rand() * 10)
data = VectorAssembler(inputCols=["x", "y"], outputCol="features").transform(df)
model = KMeans(k=3, seed=42, featuresCol="features").fit(data)
clustered = model.transform(data) # adds a 'prediction' column: the cluster id
silhouette = ClusteringEvaluator(featuresCol="features").evaluate(clustered)
print("Silhouette:", round(silhouette, 3))
for centre in model.clusterCenters():
print(centre)
The silhouette score runs from -1 to 1 and tells you how well separated the clusters are. Higher is better. The usual way to choose k is to fit a few values and keep the one with the best silhouette, with an eye on whether the groups mean anything to you.
Cluster on Spark, or in scikit-learn?
If the data fits comfortably in memory, scikit-learn’s K-Means is simpler and faster to iterate on. Reach for Spark’s K-Means when the data does not fit on one machine, or when it already lives in a Spark DataFrame and you do not want to pull it down. The decision is the same one you have made all course: scale out only when scaling up runs out.
Recommendation with ALS
Collaborative filtering predicts how a user would rate an item from the ratings everyone has given so far. MLlib’s ALS (Alternating Least Squares) learns a small vector for each user and each item, and a rating is their dot product.
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
ratings = spark.createDataFrame(
[(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 5.0),
(2, 0, 1.0), (2, 2, 4.0), (0, 2, 3.0), (1, 0, 2.0)],
["userId", "itemId", "rating"])
als = ALS(userCol="userId", itemCol="itemId", ratingCol="rating",
coldStartStrategy="drop", rank=4, maxIter=5, seed=42)
model = als.fit(ratings)
predictions = model.transform(ratings)
rmse = RegressionEvaluator(labelCol="rating", predictionCol="prediction",
metricName="rmse").evaluate(predictions)
print("RMSE:", round(rmse, 3))
coldStartStrategy="drop" tells ALS to drop predictions for users or items it has never seen, so they do not turn into NaN and poison your RMSE.
model.transform(...) predicts ratings for user and item pairs you give it, and that is what we use here. The batch top-N methods recommendForAllUsers and recommendForAllItems rely on Spark higher-order functions that Unity Catalog does not allow on serverless, so they run only on classic compute.Try it yourself
- Fit K-Means for
kin 2, 3, 4, 5 and plot the silhouette againstk. - Pull the clustered data into pandas with
.toPandas()and draw a scatter plot coloured by cluster (it is small enough here). - Add a few more ratings to the ALS data and watch the RMSE change.
Key questions you can now answer
- What does K-Means optimise, and what does the silhouette score measure?
- How do you decide between Spark K-Means and scikit-learn K-Means?
- What does ALS learn for each user and item, and how does it turn that into a rating?
- Why set
coldStartStrategy="drop"? - Which ALS method does not run on Free Edition serverless, and where would you run it instead?
Previous: Module 6.1: Supervised Learning at Scale | Next: Module 6.3: ML Pipelines in Production