PySpark - Module 0.1: Getting Started on Databricks

PySpark - Module 0.1: Getting Started on Databricks

Getting Started: Your Databricks Account

This course runs on Databricks Free Edition. It is free, runs in your browser, and needs no local installation. There is no Java or PATH to configure. Do this before the first session so we can start with Spark, not setup.

Step 1: Sign up and choose Free Edition

Start the sign-up at databricks.com. You will be asked what you will use Databricks for. Pick For personal use and click Get Free Edition.

This is the one step where it is easy to go wrong. Do not pick “For work”: that starts a 14-day trial that expires. Free Edition does not expire.

Choose For personal use, then Get Free Edition

Step 2: Skip the onboarding questions

Databricks asks what you want to do first. Answer it or click Skip; it does not change your access.

Onboarding question, safe to skip

Step 3: You are in

You land in your own workspace, with notebooks, compute, and the Jobs and Pipelines tools we use through the course.

The Databricks Free Edition workspace

Get to know the workspace

This course teaches the distributed computing ideas behind Spark. It does not spend time on where the buttons are, because Databricks already teaches that well and keeps it current.

Before the first session, spend twenty minutes in Databricks’ own onboarding. In your workspace, open Learn in the left sidebar, or the Get started panel on the home page. You will find short courses and interactive tours:

  • A Workspace Tour of the home page, navigation, Catalog, and Compute.
  • Get Started with Databricks Free Edition, which walks through Unity Catalog, uploading data, and the SQL editor.
  • Short paths for Machine Learning, Data Engineering, and AI/BI, plus interactive demos.

Think of it as a division of labour: Databricks shows you how to drive the workspace, and this course explains what is happening underneath when your code runs across a cluster.

You can also reach these outside the workspace:

Step 4: Confirm it works

Create a new notebook (New → Notebook) and run this in the first cell:

print(spark.range(5).count())

If it prints 5, Spark is running on serverless compute and you are ready. You never start a cluster yourself; Free Edition gives you serverless compute on demand.

Stuck on any step? Bring it to the first session and we will sort it out together.

What Free Edition runs, and what it does not

Free Edition gives you serverless compute, which runs on Spark Connect. That powers most of this course well, but a few things need a different setup. Here is a clear map of what runs where, so you do not lose time on something that cannot work here.

Runs well on Free Edition serverless: DataFrames, Spark SQL, and Structured Streaming. This is the core of the course and of the Spark certification.

CapabilityOn Free EditionWhat to do
RDD API (sparkContext, .rdd)Not availableUse the DataFrame API. RDDs are not exposed on Spark Connect.
MLlib (pyspark.ml)Works on environment version 4Switch the notebook to environment version 4. See Module 6 for the setup and its limits.
H2O Sparkling WaterNot availableIt needs JVM access on the executors, which serverless blocks. Use classic compute (a Databricks trial or paid workspace with a standard cluster), or use MLlib instead.
Anything reaching the Spark JVMNot availableUse DataFrame and Spark SQL, or move to classic compute.

If you need the full Spark engine (RDDs, Sparkling Water, large models), create classic compute in a Databricks trial or paid workspace.

References

Working toward the certification

If you want a credential, this elective maps closely to the Databricks Certified Associate Developer for Apache Spark (Python, DataFrame-focused). The course covers Spark architecture, the DataFrame API and Spark SQL, data sources and sinks, and Structured Streaming, which is most of the exam. To be fully exam-ready, also work through the Spark Connect section in Module 2 and the Tuning Essentials page.