<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>MLlib | Your Data Science Mentor - Mohsen Davarynejad</title><link>https://dataqubed.io/tags/mllib/</link><atom:link href="https://dataqubed.io/tags/mllib/index.xml" rel="self" type="application/rss+xml"/><description>MLlib</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 02 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://dataqubed.io/media/icon_hu_913fe0962b0e757d.png</url><title>MLlib</title><link>https://dataqubed.io/tags/mllib/</link></image><item><title>PySpark - Module 6.0: MLlib Fundamentals</title><link>https://dataqubed.io/pyspark-mllib-fundamentals/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://dataqubed.io/pyspark-mllib-fundamentals/</guid><description>&lt;h2 id="mllib-fundamentals">MLlib Fundamentals&lt;/h2>
&lt;p>This is where the course turns from moving and shaping data to learning from it. MLlib is Spark&amp;rsquo;s machine learning library. The difference from scikit-learn is the same one you have met all course: scikit-learn trains on a single machine and the data has to fit in its memory, while MLlib trains across the cluster on a Spark DataFrame, so it keeps working when the data does not fit on one machine.&lt;/p>
&lt;p>If you already know scikit-learn, the ideas carry over. You still split the data, fit a model, and score it. What changes is that every step becomes a stage in a Spark pipeline, and the work runs distributed.&lt;/p>
&lt;h3 id="before-you-start-switch-to-environment-version-4">Before you start: switch to environment version 4&lt;/h3>
&lt;p>MLlib uses Spark&amp;rsquo;s JVM engine, and on Free Edition serverless you have to opt in to it by selecting &lt;strong>environment version 4&lt;/strong>.&lt;/p>
&lt;ol>
&lt;li>Open a notebook.&lt;/li>
&lt;li>On the right, open the &lt;strong>Environment&lt;/strong> panel.&lt;/li>
&lt;li>Set the version to &lt;strong>4&lt;/strong> and apply.&lt;/li>
&lt;/ol>
&lt;p>Without this, importing an MLlib transformer fails with a Py4J security error. With it, the code below runs. On serverless a trained model is capped at 100 MB, which is plenty for everything in this module. The &lt;a href="https://docs.databricks.com/aws/en/release-notes/serverless/environment-version/four" target="_blank" rel="noopener">serverless environment version 4 notes&lt;/a> have the details.&lt;/p>
&lt;h3 id="transformers-estimators-and-the-pipeline">Transformers, Estimators, and the Pipeline&lt;/h3>
&lt;p>MLlib has three building blocks, and almost everything you do is one of them.&lt;/p>
&lt;ul>
&lt;li>A &lt;strong>Transformer&lt;/strong> takes a DataFrame and returns a new one with extra columns. &lt;code>VectorAssembler&lt;/code> is a Transformer: it reads several columns and writes one &lt;code>features&lt;/code> column.&lt;/li>
&lt;li>An &lt;strong>Estimator&lt;/strong> has a &lt;code>fit&lt;/code> method. It studies data and produces a Transformer. &lt;code>LogisticRegression&lt;/code> is an Estimator: calling &lt;code>fit&lt;/code> returns a trained model that can transform new data.&lt;/li>
&lt;li>A &lt;strong>Pipeline&lt;/strong> chains stages in order. Calling &lt;code>fit&lt;/code> on the Pipeline fits each Estimator in turn and passes the result down the chain. The output is a &lt;code>PipelineModel&lt;/code>, which is itself a Transformer you apply to new data.&lt;/li>
&lt;/ul>
&lt;p>Why this matters: the same sequence of steps you fit on training data is applied, identically, to test data and to production data. You cannot accidentally prepare features one way in training and another way at scoring time.&lt;/p>
&lt;h3 id="feature-engineering-tools">Feature engineering tools&lt;/h3>
&lt;p>Three Transformers cover most tabular feature work:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>StringIndexer&lt;/code>&lt;/strong> turns a string category column into numeric indexes (for example &lt;code>&amp;quot;a&amp;quot;&lt;/code> and &lt;code>&amp;quot;b&amp;quot;&lt;/code> become &lt;code>0.0&lt;/code> and &lt;code>1.0&lt;/code>).&lt;/li>
&lt;li>&lt;strong>&lt;code>OneHotEncoder&lt;/code>&lt;/strong> turns those indexes into one-hot vectors, so the model does not read the categories as ordered numbers.&lt;/li>
&lt;li>&lt;strong>&lt;code>VectorAssembler&lt;/code>&lt;/strong> collects your numeric columns and encoded vectors into the single &lt;code>features&lt;/code> column that every MLlib model expects.&lt;/li>
&lt;/ul>
&lt;h3 id="hands-on-a-binary-classification-pipeline">Hands-on: a binary classification pipeline&lt;/h3>
&lt;p>We generate a small dataset so you can run this immediately, with no file to download: two numeric columns, one category column, and a binary label.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Pipeline&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">StringIndexer&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">OneHotEncoder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.classification&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">LogisticRegression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.evaluation&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">BinaryClassificationEvaluator&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># A small synthetic dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2000&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;x1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;x2&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cat&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="mf">0.5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;a&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;b&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;x1&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;x2&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="mi">8&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Split first, so the test set stays unseen&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randomSplit&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="mf">0.8&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">0.2&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">seed&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">42&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Feature engineering stages&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">indexer&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">StringIndexer&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cat&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cat_idx&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">encoder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">OneHotEncoder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCols&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;cat_idx&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">outputCols&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;cat_ohe&amp;#34;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">assembler&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCols&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;x1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;x2&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;cat_ohe&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># The model&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lr&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">LogisticRegression&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># One pipeline, fit once&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">pipeline&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Pipeline&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">stages&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">indexer&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">encoder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">assembler&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">lr&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pipeline&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Score the held-out test set&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">predictions&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">auc&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">BinaryClassificationEvaluator&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">metricName&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;areaUnderROC&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;AUC:&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">4&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Run it. You should see an AUC close to 1.0, because the label is an easy function of the features. On real data you will fight for every point of AUC, but the shape of the code does not change.&lt;/p>
&lt;p>Notice what you did not do: you did not convert the DataFrame to pandas, and you did not loop over rows. The encoders, the assembler, and the model all ran as Spark stages across the cluster. Swap &lt;code>spark.range(2000)&lt;/code> for a fifty million row table and the same code still runs.&lt;/p>
&lt;h3 id="try-it-yourself">Try it yourself&lt;/h3>
&lt;ul>
&lt;li>Change the label rule and watch how the AUC moves.&lt;/li>
&lt;li>Add a second category column, and extend the &lt;code>StringIndexer&lt;/code> and &lt;code>OneHotEncoder&lt;/code> stages to cover it.&lt;/li>
&lt;li>Print &lt;code>model.stages&lt;/code> and look at what each fitted stage is.&lt;/li>
&lt;/ul>
&lt;h3 id="key-questions-you-can-now-answer">Key questions you can now answer&lt;/h3>
&lt;ul>
&lt;li>What is the difference between a Transformer and an Estimator in MLlib?&lt;/li>
&lt;li>How does a Pipeline stop training and scoring from preparing features differently?&lt;/li>
&lt;li>Why do you one-hot encode a category instead of feeding its string index straight to the model?&lt;/li>
&lt;li>What does &lt;code>VectorAssembler&lt;/code> produce, and why does every MLlib model need it?&lt;/li>
&lt;li>Why can this pipeline train on data that would not fit in scikit-learn?&lt;/li>
&lt;/ul>
&lt;hr>
&lt;!-- nav -->
&lt;p>&lt;strong>Previous:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-kafka-spark-structured-streaming/" target="_blank" rel="noopener">Module 5.3: Kafka and Structured Streaming&lt;/a> | &lt;strong>Next:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-supervised-learning-at-scale/" target="_blank" rel="noopener">Module 6.1: Supervised Learning at Scale&lt;/a>&lt;/p></description></item><item><title>PySpark - Module 6.1: Supervised Learning at Scale</title><link>https://dataqubed.io/pyspark-supervised-learning-at-scale/</link><pubDate>Mon, 01 Jun 2026 00:00:00 +0000</pubDate><guid>https://dataqubed.io/pyspark-supervised-learning-at-scale/</guid><description>&lt;h2 id="supervised-learning-at-scale">Supervised Learning at Scale&lt;/h2>
&lt;p>In Module 6.0 you built one pipeline with one model. Here you swap models, measure them properly, and let Spark search for good settings. The pipeline you already know does not change. You only replace the final stage and add a way to compare results.&lt;/p>
&lt;p>This module assumes the feature pipeline from 6.0: a &lt;code>StringIndexer&lt;/code>, a &lt;code>OneHotEncoder&lt;/code>, and a &lt;code>VectorAssembler&lt;/code> that produce a &lt;code>features&lt;/code> column. Keep the same notebook on &lt;strong>environment version 4&lt;/strong>.&lt;/p>
&lt;h3 id="swapping-the-model">Swapping the model&lt;/h3>
&lt;p>Every MLlib classifier takes the same &lt;code>featuresCol&lt;/code> and &lt;code>labelCol&lt;/code>, so moving from one to another is a one-line change to the last stage.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Pipeline&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.classification&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">LogisticRegression&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">DecisionTreeClassifier&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">RandomForestClassifier&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">GBTClassifier&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">models&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;logreg&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">LogisticRegression&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;tree&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">DecisionTreeClassifier&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;forest&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">RandomForestClassifier&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;gbt&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">GBTClassifier&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">maxIter&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fitted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">{&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Pipeline&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">stages&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">indexer&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">encoder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">assembler&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">clf&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">clf&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">models&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">items&lt;/span>&lt;span class="p">()}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A short guide to when each one earns its place:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Logistic Regression&lt;/strong> is fast, and its coefficients are easy to explain. Reach for it first as a baseline.&lt;/li>
&lt;li>&lt;strong>Decision Tree&lt;/strong> captures non-linear splits and is readable, but a single tree overfits.&lt;/li>
&lt;li>&lt;strong>Random Forest&lt;/strong> averages many trees. It is the dependable default for tabular data.&lt;/li>
&lt;li>&lt;strong>Gradient Boosted Trees&lt;/strong> often score highest, at the cost of longer training and more careful tuning.&lt;/li>
&lt;/ul>
&lt;h3 id="measuring-a-model">Measuring a model&lt;/h3>
&lt;p>A model is only as good as the metric you judge it by. MLlib gives you one evaluator per problem type.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.evaluation&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">BinaryClassificationEvaluator&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">MulticlassClassificationEvaluator&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">auc&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">BinaryClassificationEvaluator&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">metricName&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;areaUnderROC&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">MulticlassClassificationEvaluator&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">metricName&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;f1&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">model&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fitted&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">items&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">pred&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">7s&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> AUC=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">pred&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.3f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> F1=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">f1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">pred&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.3f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Use &lt;code>BinaryClassificationEvaluator&lt;/code> (area under ROC or PR) for two-class problems, &lt;code>MulticlassClassificationEvaluator&lt;/code> (f1, accuracy) when there are more than two classes, and &lt;code>RegressionEvaluator&lt;/code> (RMSE, R2) when the label is a number.&lt;/p>
&lt;h3 id="which-features-matter">Which features matter&lt;/h3>
&lt;p>Tree models report how much each feature contributed. The importances line up with the columns that went into the &lt;code>VectorAssembler&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">forest&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fitted&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;forest&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stages&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">forest&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">featureImportances&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="tuning-with-crossvalidator">Tuning with CrossValidator&lt;/h3>
&lt;p>Picking &lt;code>numTrees&lt;/code> or &lt;code>maxDepth&lt;/code> by hand is guessing. &lt;code>CrossValidator&lt;/code> trains every combination in a &lt;code>ParamGridBuilder&lt;/code> across several folds and keeps the best.&lt;/p>
&lt;p>On serverless, CrossValidator needs a scratch location, so create a Unity Catalog Volume once and point Spark ML at it. This is also the Volume you will reuse in 6.3 to save models.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">cat&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_catalog()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_schema()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;CREATE VOLUME IF NOT EXISTS &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.ml_models&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;SPARKML_TEMP_DFS_PATH&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;/Volumes/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/ml_models&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now run the search:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.tuning&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">CrossValidator&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ParamGridBuilder&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">forest&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">RandomForestClassifier&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">grid&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">ParamGridBuilder&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">addGrid&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">forest&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">numTrees&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">30&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">addGrid&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">forest&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">maxDepth&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">6&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">build&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">cv&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">CrossValidator&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">estimator&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">Pipeline&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">stages&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">indexer&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">encoder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">assembler&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">forest&lt;/span>&lt;span class="p">]),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">estimatorParamMaps&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">grid&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">evaluator&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">numFolds&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">cv_model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cv&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Best AUC:&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cv_model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>cv_model.bestModel&lt;/code> is a normal &lt;code>PipelineModel&lt;/code>, fitted with the winning settings, ready to score new data.&lt;/p>
&lt;h3 id="a-note-on-tuning-libraries">A note on tuning libraries&lt;/h3>
&lt;p>You will read about Hyperopt and Optuna for hyperparameter search. Here is what works where:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>CrossValidator&lt;/strong> is Spark&amp;rsquo;s own distributed search. It runs on Free Edition serverless and is what this module uses.&lt;/li>
&lt;li>&lt;strong>Hyperopt&lt;/strong> can distribute trials with &lt;code>SparkTrials&lt;/code>, but that path needs the Spark JVM context, which serverless does not expose, so it runs only on classic compute.&lt;/li>
&lt;li>&lt;strong>Optuna&lt;/strong> runs single-machine on the driver. It is fine for a small search, but it does not distribute the trials across the cluster.&lt;/li>
&lt;/ul>
&lt;p>For this course, stay with CrossValidator. It is distributed, it is built in, and it runs where your students do.&lt;/p>
&lt;h3 id="try-it-yourself">Try it yourself&lt;/h3>
&lt;ul>
&lt;li>Add &lt;code>forest.minInstancesPerNode&lt;/code> to the grid and see whether it changes the winner.&lt;/li>
&lt;li>Switch the evaluator metric from &lt;code>areaUnderROC&lt;/code> to &lt;code>areaUnderPR&lt;/code> and compare.&lt;/li>
&lt;li>Sort the feature importances and name the two columns that drive the prediction.&lt;/li>
&lt;/ul>
&lt;h3 id="key-questions-you-can-now-answer">Key questions you can now answer&lt;/h3>
&lt;ul>
&lt;li>How do you change which algorithm a pipeline uses, and what stays the same?&lt;/li>
&lt;li>Which evaluator do you use for two classes, many classes, and a numeric target?&lt;/li>
&lt;li>How does CrossValidator choose its best model, and why does it need a Volume on serverless?&lt;/li>
&lt;li>Why can Hyperopt&amp;rsquo;s distributed mode not run on Free Edition, and what do you use instead?&lt;/li>
&lt;li>What do feature importances tell you, and what do they not tell you?&lt;/li>
&lt;/ul>
&lt;hr>
&lt;!-- nav -->
&lt;p>&lt;strong>Previous:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-mllib-fundamentals/" target="_blank" rel="noopener">Module 6.0: MLlib Fundamentals&lt;/a> | &lt;strong>Next:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-unsupervised-learning-and-recommendation/" target="_blank" rel="noopener">Module 6.2: Unsupervised Learning and Recommendation&lt;/a>&lt;/p></description></item><item><title>PySpark - Module 6.2: Unsupervised Learning and Recommendation</title><link>https://dataqubed.io/pyspark-unsupervised-learning-and-recommendation/</link><pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate><guid>https://dataqubed.io/pyspark-unsupervised-learning-and-recommendation/</guid><description>&lt;h2 id="unsupervised-learning-and-recommendation">Unsupervised Learning and Recommendation&lt;/h2>
&lt;p>Not every problem comes with a label. Sometimes you want to group rows that look alike, and sometimes you want to predict what a person will like from what other people liked. MLlib covers both, and both keep the same pipeline shape you already know.&lt;/p>
&lt;p>Keep working on &lt;strong>environment version 4&lt;/strong>.&lt;/p>
&lt;h3 id="k-means-clustering">K-Means clustering&lt;/h3>
&lt;p>Clustering puts each row into one of &lt;code>k&lt;/code> groups based on how close its features are. There is no label. You pick &lt;code>k&lt;/code>, and K-Means finds the group centres.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.clustering&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">KMeans&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.evaluation&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">ClusteringEvaluator&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1000&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;x&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;y&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rand&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCols&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;x&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;y&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">KMeans&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">k&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">seed&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">42&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">clustered&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># adds a &amp;#39;prediction&amp;#39; column: the cluster id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">silhouette&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">ClusteringEvaluator&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">clustered&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Silhouette:&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">silhouette&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">centre&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">clusterCenters&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">centre&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The &lt;strong>silhouette score&lt;/strong> runs from -1 to 1 and tells you how well separated the clusters are. Higher is better. The usual way to choose &lt;code>k&lt;/code> is to fit a few values and keep the one with the best silhouette, with an eye on whether the groups mean anything to you.&lt;/p>
&lt;h3 id="cluster-on-spark-or-in-scikit-learn">Cluster on Spark, or in scikit-learn?&lt;/h3>
&lt;p>If the data fits comfortably in memory, scikit-learn&amp;rsquo;s K-Means is simpler and faster to iterate on. Reach for Spark&amp;rsquo;s K-Means when the data does not fit on one machine, or when it already lives in a Spark DataFrame and you do not want to pull it down. The decision is the same one you have made all course: scale out only when scaling up runs out.&lt;/p>
&lt;h3 id="recommendation-with-als">Recommendation with ALS&lt;/h3>
&lt;p>Collaborative filtering predicts how a user would rate an item from the ratings everyone has given so far. MLlib&amp;rsquo;s &lt;code>ALS&lt;/code> (Alternating Least Squares) learns a small vector for each user and each item, and a rating is their dot product.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.recommendation&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">ALS&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.evaluation&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">RegressionEvaluator&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ratings&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">createDataFrame&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">[(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">4.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">2.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">3.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">5.0&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">1.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">4.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">3.0&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">2.0&lt;/span>&lt;span class="p">)],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;userId&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;itemId&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;rating&amp;#34;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">als&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">ALS&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">userCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;userId&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">itemCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;itemId&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ratingCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;rating&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">coldStartStrategy&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;drop&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">rank&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">maxIter&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">seed&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">42&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">als&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ratings&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">predictions&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ratings&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rmse&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">RegressionEvaluator&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labelCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;rating&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predictionCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;prediction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">metricName&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;rmse&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;RMSE:&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">rmse&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>coldStartStrategy=&amp;quot;drop&amp;quot;&lt;/code> tells ALS to drop predictions for users or items it has never seen, so they do not turn into &lt;code>NaN&lt;/code> and poison your RMSE.&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">On Free Edition serverless, &lt;code>model.transform(...)&lt;/code> predicts ratings for user and item pairs you give it, and that is what we use here. The batch top-N methods &lt;code>recommendForAllUsers&lt;/code> and &lt;code>recommendForAllItems&lt;/code> rely on Spark higher-order functions that Unity Catalog does not allow on serverless, so they run only on classic compute.&lt;/span>
&lt;/div>
&lt;h3 id="try-it-yourself">Try it yourself&lt;/h3>
&lt;ul>
&lt;li>Fit K-Means for &lt;code>k&lt;/code> in 2, 3, 4, 5 and plot the silhouette against &lt;code>k&lt;/code>.&lt;/li>
&lt;li>Pull the clustered data into pandas with &lt;code>.toPandas()&lt;/code> and draw a scatter plot coloured by cluster (it is small enough here).&lt;/li>
&lt;li>Add a few more ratings to the ALS data and watch the RMSE change.&lt;/li>
&lt;/ul>
&lt;h3 id="key-questions-you-can-now-answer">Key questions you can now answer&lt;/h3>
&lt;ul>
&lt;li>What does K-Means optimise, and what does the silhouette score measure?&lt;/li>
&lt;li>How do you decide between Spark K-Means and scikit-learn K-Means?&lt;/li>
&lt;li>What does ALS learn for each user and item, and how does it turn that into a rating?&lt;/li>
&lt;li>Why set &lt;code>coldStartStrategy=&amp;quot;drop&amp;quot;&lt;/code>?&lt;/li>
&lt;li>Which ALS method does not run on Free Edition serverless, and where would you run it instead?&lt;/li>
&lt;/ul>
&lt;hr>
&lt;!-- nav -->
&lt;p>&lt;strong>Previous:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-supervised-learning-at-scale/" target="_blank" rel="noopener">Module 6.1: Supervised Learning at Scale&lt;/a> | &lt;strong>Next:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-ml-pipelines-in-production/" target="_blank" rel="noopener">Module 6.3: ML Pipelines in Production&lt;/a>&lt;/p></description></item><item><title>PySpark - Module 6.3: ML Pipelines in Production</title><link>https://dataqubed.io/pyspark-ml-pipelines-in-production/</link><pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate><guid>https://dataqubed.io/pyspark-ml-pipelines-in-production/</guid><description>&lt;h2 id="ml-pipelines-in-production">ML Pipelines in Production&lt;/h2>
&lt;p>A model that lives in the notebook where you trained it is a prototype. Production means four more things: save the model, load it somewhere else, score new data with it, and keep an eye on it as the world changes. MLlib and MLflow cover all four, and none of it leaves Databricks.&lt;/p>
&lt;p>Keep working on &lt;strong>environment version 4&lt;/strong>, and reuse the Unity Catalog Volume from 6.1. On serverless, saving and loading models always goes through a Volume.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">cat&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_catalog()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_schema()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;CREATE VOLUME IF NOT EXISTS &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.ml_models&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">volume&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;/Volumes/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/ml_models&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;SPARKML_TEMP_DFS_PATH&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">volume&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="saving-and-loading-a-model">Saving and loading a model&lt;/h3>
&lt;p>A fitted &lt;code>PipelineModel&lt;/code> saves to the Volume as a folder. The whole pipeline goes with it: the indexers, the encoder, the assembler, and the trained model. Whoever loads it gets the exact same feature steps, so there is no way to score with different preparation than you trained with.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">PipelineModel&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">overwrite&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">save&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/credit_model&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loaded&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PipelineModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/credit_model&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="batch-inference">Batch inference&lt;/h3>
&lt;p>Batch inference is the everyday production job: load the saved model, point it at a fresh batch of data, and write the scores out. Because the model is a Transformer, scoring a million rows is the same call as scoring ten.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">scored&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">loaded&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">new_data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">scored&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;prediction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;probability&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The new batch only needs the same input columns the pipeline was trained on. The encoders and the assembler rebuild the &lt;code>features&lt;/code> column on their own.&lt;/p>
&lt;h3 id="tracking-experiments-with-mlflow">Tracking experiments with MLflow&lt;/h3>
&lt;p>Once you train more than one model, you need a record of what you tried and how it did. MLflow keeps that record. Each run stores the parameters you logged, the metrics you logged, and the model itself, and they show up in the Experiments tab next to your notebook.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">mlflow&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">start_run&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">run&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_param&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;logreg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_metric&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;auc&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">4&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dfs_tmpdir&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The &lt;code>dfs_tmpdir=volume&lt;/code> argument is the serverless detail: MLflow stages the Spark model through the Volume, the same place SparkML writes to. Without it, logging a Spark model on serverless fails.&lt;/p>
&lt;h3 id="watching-for-drift">Watching for drift&lt;/h3>
&lt;p>A model is trained on a snapshot of the world, and the world moves. &lt;strong>Data drift&lt;/strong> is when the incoming data starts to look different from the training data: a feature&amp;rsquo;s average shifts, a new category appears, a range widens. The model still runs, but its predictions quietly get worse.&lt;/p>
&lt;p>You do not need a special tool to start. Compare the new batch against the training data with the Spark aggregations you already know.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">col&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;x1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;x2&amp;#34;&lt;/span>&lt;span class="p">]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;mean&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stddev&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;std&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">n&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">new_data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;mean&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stddev&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;std&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">: train mean=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;mean&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> std=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;std&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> | new mean=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;mean&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> std=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;std&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When the new numbers drift away from the training numbers, that is your signal to look closer, and often to retrain. The same idea scales up to proper monitoring: track these summaries over time, and alert when they move past a threshold.&lt;/p>
&lt;h3 id="try-it-yourself">Try it yourself&lt;/h3>
&lt;ul>
&lt;li>Save a model, load it in a fresh cell, and confirm the loaded model scores the same rows the same way.&lt;/li>
&lt;li>Log two models to MLflow with different settings and compare their AUC in the Experiments tab.&lt;/li>
&lt;li>Shift &lt;code>new_data&lt;/code> (add a constant to &lt;code>x1&lt;/code>) and watch the drift check pick it up.&lt;/li>
&lt;/ul>
&lt;h3 id="key-questions-you-can-now-answer">Key questions you can now answer&lt;/h3>
&lt;ul>
&lt;li>Why does saving the whole &lt;code>PipelineModel&lt;/code>, not just the final estimator, matter?&lt;/li>
&lt;li>What is batch inference, and why is scoring a million rows the same code as scoring ten?&lt;/li>
&lt;li>What does MLflow record per run, and why does logging a Spark model on serverless need &lt;code>dfs_tmpdir&lt;/code>?&lt;/li>
&lt;li>What is data drift, and how can you detect it with plain Spark aggregations?&lt;/li>
&lt;li>Where do saved models and staged artifacts have to live on Free Edition serverless, and why?&lt;/li>
&lt;/ul>
&lt;hr>
&lt;!-- nav -->
&lt;p>&lt;strong>Previous:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-unsupervised-learning-and-recommendation/" target="_blank" rel="noopener">Module 6.2: Unsupervised Learning and Recommendation&lt;/a> | &lt;strong>Next:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-capstone-project/" target="_blank" rel="noopener">Module 7: Capstone Project&lt;/a>&lt;/p></description></item></channel></rss>