Data Partitioning Best Practices
Intermediate
Partition Spark DataFrames correctly — choose partition columns wisely, avoid data skew, set appropriate partition counts, and use repartition vs coalesce correctly.
File Patterns
**/*.scala**/*.py**/spark-defaults.conf
This rule applies to files matching the patterns above.
Rule Content
rule-content.md
# Data Partitioning Best Practices
## Rule
Every Spark job MUST have intentional partitioning. Use 2-4x the number of cores for partition count. Avoid skewed partition keys. Use coalesce to reduce partitions, repartition to increase.
## Partition Count Guidelines
| Cluster Size | Recommended Partitions | Formula |
|-------------|----------------------|---------|
| 10 cores | 20-40 partitions | cores * 2-4 |
| 100 cores | 200-400 partitions | cores * 2-4 |
| 1000 cores | 2000-4000 partitions | cores * 2-4 |
## Good Examples
```python
# Repartition before expensive operations
df = (
spark.read.parquet("s3://data/events/")
.repartition(200, "user_id") # Hash partition on join key
)
# Coalesce before writing (reduce small files)
(
df.filter(col("date") == "2024-01-01")
.coalesce(10) # Reduce to 10 output files
.write
.mode("overwrite")
.parquet("s3://output/filtered/")
)
# Partition on write for query performance
(
df.write
.partitionBy("year", "month")
.mode("append")
.parquet("s3://data/events_partitioned/")
)
# Check partition count
print(f"Partitions: {df.rdd.getNumPartitions()}")
```
```scala
// Repartition before join to align partitions
val users = spark.read.parquet("users/").repartition(200, col("user_id"))
val orders = spark.read.parquet("orders/").repartition(200, col("user_id"))
val joined = users.join(orders, "user_id")
```
## Bad Examples
```python
# BAD: Default partitions (often 200) regardless of data size
df = spark.read.parquet("s3://huge-dataset/") # 200 partitions for 1TB?
# BAD: Repartition(1) for small output — use coalesce
df.repartition(1).write.parquet("output/") # Full shuffle!
# Use: df.coalesce(1).write.parquet("output/")
# BAD: Skewed partition key
df.repartition("country") # US partition is 100x larger than others
# BAD: Too many partitions on write
df.write.partitionBy("user_id").parquet("output/")
# Millions of user_ids = millions of tiny directories
```
## Enforcement
- Monitor partition sizes in Spark UI — flag partitions > 1GB
- Alert on tasks taking 10x longer than average (skew indicator)
- Set spark.sql.shuffle.partitions based on cluster sizeFAQ
Discussion
Loading comments...