PySpark - Module 6.6: Running H2O Sparkling Water

PySpark - Module 6.6: Running H2O Sparkling Water

Setting Up H2O Sparkling Water

H2O Sparkling Water does not run on Databricks Free Edition. Free Edition gives you serverless compute (Spark Connect), which does not expose the SparkContext and locks down the low-level JVM access that Sparkling Water relies on. We tested this directly: H2OContext.getOrCreate() fails on serverless. To follow this module you need classic Databricks compute: a trial or paid workspace where you can create a standard cluster. The rest of the course runs on Free Edition, so treat this H2O track as an optional, advanced extra.

Sparkling Water lets H2O run inside a Spark cluster, so you prepare data at Spark scale and train H2O models without moving it anywhere. On a workspace with classic compute, the setup is short.

Step 1: Create a classic cluster

Create a standard (non-serverless) cluster and note its Databricks Runtime version. You will match the Sparkling Water build to its Spark version.

Step 2: Install the Sparkling Water library

On the cluster, go to Libraries → Install new → PyPI and install the build that matches your Spark version, for example h2o-pysparkling-3.5 for a Spark 3.5 runtime.

Step 3: Start the H2O context

In a notebook attached to the cluster:

from pysparkling import H2OContext
hc = H2OContext.getOrCreate()
hc

This starts an H2O node inside the Spark executors and prints a summary of the H2O cluster: its name, the number of nodes, and the memory available.

Step 4: Move data between Spark and H2O

h2o_frame = hc.asH2OFrame(spark_df)         # Spark DataFrame to H2O Frame
spark_df_again = hc.asSparkFrame(h2o_frame) # H2O Frame back to Spark DataFrame

Prepare and clean your data with Spark, convert it to an H2O Frame, and train with H2O AutoML or its algorithms. Convert the predictions back to a Spark DataFrame when you want to write them out.

For the machine learning you can do on Free Edition itself, see the MLlib modules (6.0 onward), which run on serverless compute.

You might also want to explore the H2O Wave web app framework for building AI web applications. See the video below.

Or look into the Wave Git repo if you are interested in building ML web applications.

References:


Previous: Module 6.5: Distributed ML with H2O