<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>MLflow | Your Data Science Mentor - Mohsen Davarynejad</title><link>https://dataqubed.io/tags/mlflow/</link><atom:link href="https://dataqubed.io/tags/mlflow/index.xml" rel="self" type="application/rss+xml"/><description>MLflow</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 30 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://dataqubed.io/media/icon_hu_913fe0962b0e757d.png</url><title>MLflow</title><link>https://dataqubed.io/tags/mlflow/</link></image><item><title>PySpark - Module 6.3: ML Pipelines in Production</title><link>https://dataqubed.io/pyspark-ml-pipelines-in-production/</link><pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate><guid>https://dataqubed.io/pyspark-ml-pipelines-in-production/</guid><description>&lt;h2 id="ml-pipelines-in-production">ML Pipelines in Production&lt;/h2>
&lt;p>A model that lives in the notebook where you trained it is a prototype. Production means four more things: save the model, load it somewhere else, score new data with it, and keep an eye on it as the world changes. MLlib and MLflow cover all four, and none of it leaves Databricks.&lt;/p>
&lt;p>Keep working on &lt;strong>environment version 4&lt;/strong>, and reuse the Unity Catalog Volume from 6.1. On serverless, saving and loading models always goes through a Volume.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">cat&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_catalog()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT current_schema()&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;CREATE VOLUME IF NOT EXISTS &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">.ml_models&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">volume&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;/Volumes/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">cat&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">sch&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/ml_models&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;SPARKML_TEMP_DFS_PATH&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">volume&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="saving-and-loading-a-model">Saving and loading a model&lt;/h3>
&lt;p>A fitted &lt;code>PipelineModel&lt;/code> saves to the Volume as a folder. The whole pipeline goes with it: the indexers, the encoder, the assembler, and the trained model. Whoever loads it gets the exact same feature steps, so there is no way to score with different preparation than you trained with.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">PipelineModel&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">overwrite&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">save&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/credit_model&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loaded&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PipelineModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">/credit_model&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="batch-inference">Batch inference&lt;/h3>
&lt;p>Batch inference is the everyday production job: load the saved model, point it at a fresh batch of data, and write the scores out. Because the model is a Transformer, scoring a million rows is the same call as scoring ten.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">scored&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">loaded&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">new_data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">scored&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;prediction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;probability&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The new batch only needs the same input columns the pipeline was trained on. The encoders and the assembler rebuild the &lt;code>features&lt;/code> column on their own.&lt;/p>
&lt;h3 id="tracking-experiments-with-mlflow">Tracking experiments with MLflow&lt;/h3>
&lt;p>Once you train more than one model, you need a record of what you tried and how it did. MLflow keeps that record. Each run stores the parameters you logged, the metrics you logged, and the model itself, and they show up in the Experiments tab next to your notebook.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">mlflow&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">start_run&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">run&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_param&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;logreg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_metric&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;auc&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">auc&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">4&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mlflow&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">log_model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dfs_tmpdir&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">volume&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The &lt;code>dfs_tmpdir=volume&lt;/code> argument is the serverless detail: MLflow stages the Spark model through the Volume, the same place SparkML writes to. Without it, logging a Spark model on serverless fails.&lt;/p>
&lt;h3 id="watching-for-drift">Watching for drift&lt;/h3>
&lt;p>A model is trained on a snapshot of the world, and the world moves. &lt;strong>Data drift&lt;/strong> is when the incoming data starts to look different from the training data: a feature&amp;rsquo;s average shifts, a new category appears, a range widens. The model still runs, but its predictions quietly get worse.&lt;/p>
&lt;p>You do not need a special tool to start. Compare the new batch against the training data with the Spark aggregations you already know.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">col&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;x1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;x2&amp;#34;&lt;/span>&lt;span class="p">]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;mean&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stddev&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;std&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">n&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">new_data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;mean&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stddev&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;std&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">first&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">: train mean=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;mean&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> std=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;std&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> | new mean=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;mean&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> std=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;std&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When the new numbers drift away from the training numbers, that is your signal to look closer, and often to retrain. The same idea scales up to proper monitoring: track these summaries over time, and alert when they move past a threshold.&lt;/p>
&lt;h3 id="try-it-yourself">Try it yourself&lt;/h3>
&lt;ul>
&lt;li>Save a model, load it in a fresh cell, and confirm the loaded model scores the same rows the same way.&lt;/li>
&lt;li>Log two models to MLflow with different settings and compare their AUC in the Experiments tab.&lt;/li>
&lt;li>Shift &lt;code>new_data&lt;/code> (add a constant to &lt;code>x1&lt;/code>) and watch the drift check pick it up.&lt;/li>
&lt;/ul>
&lt;h3 id="key-questions-you-can-now-answer">Key questions you can now answer&lt;/h3>
&lt;ul>
&lt;li>Why does saving the whole &lt;code>PipelineModel&lt;/code>, not just the final estimator, matter?&lt;/li>
&lt;li>What is batch inference, and why is scoring a million rows the same code as scoring ten?&lt;/li>
&lt;li>What does MLflow record per run, and why does logging a Spark model on serverless need &lt;code>dfs_tmpdir&lt;/code>?&lt;/li>
&lt;li>What is data drift, and how can you detect it with plain Spark aggregations?&lt;/li>
&lt;li>Where do saved models and staged artifacts have to live on Free Edition serverless, and why?&lt;/li>
&lt;/ul>
&lt;hr>
&lt;!-- nav -->
&lt;p>&lt;strong>Previous:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-unsupervised-learning-and-recommendation/" target="_blank" rel="noopener">Module 6.2: Unsupervised Learning and Recommendation&lt;/a> | &lt;strong>Next:&lt;/strong> &lt;a href="https://dataqubed.io/pyspark-capstone-project/" target="_blank" rel="noopener">Module 7: Capstone Project&lt;/a>&lt;/p></description></item></channel></rss>