PySpark - Tuning Essentials
Tuning Essentials
Most Spark code that is slow is not slow because Spark is slow. It is slow because of how the work is laid out: a shuffle you did not need, a partition count that does not fit the cluster, or a DataFrame recomputed three times. This page covers the few tools that catch most of it. The certification asks about all of them.
Read the plan with explain()
Before you guess why something is slow, look at what Spark will actually do. explain() prints the physical plan.
df.groupBy("country").count().explain()
Two things to look for:
Exchangeis a shuffle: data moving across the network to line up keys. Shuffles are the usual reason a job is slow. Ajoinor agroupBythat you expected to be cheap but shows anExchangeover a large input is worth a second look.BroadcastHashJoinmeans Spark broadcast the smaller side instead of shuffling it. That is what you want for a join against a small table. If you see a plainSortMergeJoinagainst a tiny table, a broadcast hint will help (see the AI-review exercise).
On serverless you also have the query profile, opened from a cell’s output or the Query History. It shows each stage’s time, rows, and memory, and flags spills to disk and skew. It is the first place to look when a query is slow.
Partitioning
A DataFrame is split into partitions, and Spark runs one task per partition. Too few partitions and the cluster sits idle; too many and you drown in task overhead.
df.repartition(8) # full shuffle to exactly 8 partitions, evens out skew
df.coalesce(2) # merge down to 2 partitions without a full shuffle
Use repartition to increase partitions or rebalance skew (it shuffles). Use coalesce to reduce partitions cheaply, for example before writing a small result so you do not produce hundreds of tiny files.
When you write, partition the output by a column you filter on often:
df.write.mode("overwrite").partitionBy("country").parquet(path)
A later query that filters country = 'nl' then reads only that folder, not the whole dataset. This is partition pruning, and on large tables it is one of the biggest wins available.
Caching, and why serverless does it for you
If you use the same DataFrame in several actions, Spark recomputes it from scratch each time unless you cache it. On classic compute you cache it yourself:
df.cache() # keep df in memory for reuse
df.count() # first action materialises the cache
df.count() # second action reads the cache
df.unpersist() # release it when done
cache() and persist() are not available (the call returns NOT_SUPPORTED_WITH_SERVERLESS). Serverless manages caching for you, so you do not call it by hand. Reach for manual caching on classic compute, where you control the memory.Key questions you can now answer
- What does an
Exchangein a query plan tell you, and why does it matter? - When do you use
repartitionversuscoalesce? - What is partition pruning, and how does
partitionByon write enable it? - Why would you cache a DataFrame, and where can you actually do it?
- Where do you look first when a query on serverless is slow?