PySpark - Tuning Essentials

PySpark - Tuning Essentials

Tuning Essentials

Most Spark code that is slow is not slow because Spark is slow. It is slow because of how the work is laid out: a shuffle you did not need, a partition count that does not fit the cluster, or a DataFrame recomputed three times. This page covers the few tools that catch most of it. The certification asks about all of them.

Read the plan with explain()

Before you guess why something is slow, look at what Spark will actually do. explain() prints the physical plan.

df.groupBy("country").count().explain()

Two things to look for:

  • Exchange is a shuffle: data moving across the network to line up keys. Shuffles are the usual reason a job is slow. A join or a groupBy that you expected to be cheap but shows an Exchange over a large input is worth a second look.
  • BroadcastHashJoin means Spark broadcast the smaller side instead of shuffling it. That is what you want for a join against a small table. If you see a plain SortMergeJoin against a tiny table, a broadcast hint will help (see the AI-review exercise).

On serverless you also have the query profile, opened from a cell’s output or the Query History. It shows each stage’s time, rows, and memory, and flags spills to disk and skew. It is the first place to look when a query is slow.

Partitioning

A DataFrame is split into partitions, and Spark runs one task per partition. Too few partitions and the cluster sits idle; too many and you drown in task overhead.

df.repartition(8)     # full shuffle to exactly 8 partitions, evens out skew
df.coalesce(2)        # merge down to 2 partitions without a full shuffle

Use repartition to increase partitions or rebalance skew (it shuffles). Use coalesce to reduce partitions cheaply, for example before writing a small result so you do not produce hundreds of tiny files.

When you write, partition the output by a column you filter on often:

df.write.mode("overwrite").partitionBy("country").parquet(path)

A later query that filters country = 'nl' then reads only that folder, not the whole dataset. This is partition pruning, and on large tables it is one of the biggest wins available.

Caching, and why serverless does it for you

If you use the same DataFrame in several actions, Spark recomputes it from scratch each time unless you cache it. On classic compute you cache it yourself:

df.cache()        # keep df in memory for reuse
df.count()        # first action materialises the cache
df.count()        # second action reads the cache
df.unpersist()    # release it when done
On Free Edition serverless, manual cache() and persist() are not available (the call returns NOT_SUPPORTED_WITH_SERVERLESS). Serverless manages caching for you, so you do not call it by hand. Reach for manual caching on classic compute, where you control the memory.

Key questions you can now answer

  • What does an Exchange in a query plan tell you, and why does it matter?
  • When do you use repartition versus coalesce?
  • What is partition pruning, and how does partitionBy on write enable it?
  • Why would you cache a DataFrame, and where can you actually do it?
  • Where do you look first when a query on serverless is slow?