Spark Configuration Standards

Advanced

Configure Spark applications properly — memory allocation, executor sizing, shuffle settings, and serialization to maximize performance and prevent out-of-memory failures.

File Patterns

**/*.scala**/*.py**/spark-defaults.conf**/spark-submit*

This rule applies to files matching the patterns above.

Rule Content

rule-content.md

# Spark Configuration Standards

## Rule
Every Spark application MUST have explicit memory, executor, and shuffle configuration. Never rely on defaults for production workloads. Size executors based on data volume and cluster capacity.

## Key Configuration
| Property | Purpose | Guideline |
|----------|---------|-----------|
| spark.executor.memory | Executor heap | 4-8GB typical |
| spark.executor.cores | Cores per executor | 4-5 cores |
| spark.executor.memoryOverhead | Off-heap | 10% of executor.memory |
| spark.sql.shuffle.partitions | Shuffle partitions | 2-4x total cores |
| spark.serializer | Serialization | KryoSerializer |
| spark.sql.adaptive.enabled | AQE | true (always) |

## Good Examples
```python
# spark-defaults.conf or SparkSession config
spark = (
    SparkSession.builder
    .appName("daily-etl")
    .config("spark.executor.memory", "8g")
    .config("spark.executor.cores", "4")
    .config("spark.executor.memoryOverhead", "1g")
    .config("spark.sql.shuffle.partitions", "400")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.adaptive.skewJoin.enabled", "true")
    .config("spark.sql.files.maxPartitionBytes", "128mb")
    .getOrCreate()
)
```

```bash
# spark-submit with explicit configuration
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 8g \
  --executor-cores 4 \
  --num-executors 50 \
  --conf spark.sql.shuffle.partitions=400 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.adaptive.enabled=true \
  etl_job.py
```

## Bad Examples
```python
# BAD: No configuration — all defaults
spark = SparkSession.builder.appName("my-job").getOrCreate()
# Default 1g memory, 200 shuffle partitions — wrong for most workloads

# BAD: Too much memory per executor
spark.conf.set("spark.executor.memory", "64g")
# GC pauses become catastrophic — use more executors with less memory

# BAD: 1 core per executor
spark.conf.set("spark.executor.cores", "1")
# No parallelism within executor — wasteful overhead
```

## Enforcement
- Require configuration review for production job submissions
- Monitor Spark UI for GC time (>10% = memory pressure)
- Alert on spill to disk (indicates insufficient memory)
- Log and audit configuration for all production jobs

FAQ

Discussion

Loading comments...

# Spark Configuration Standards ## Rule Every Spark application MUST have explicit memory, executor, and shuffle configuration. Never rely on defaults for production workloads. Size executors based on data volume and cluster capacity. ## Key Configuration | Property | Purpose | Guideline | |----------|---------|-----------| | spark.executor.memory | Executor heap | 4-8GB typical | | spark.executor.cores | Cores per executor | 4-5 cores | | spark.executor.memoryOverhead | Off-heap | 10% of executor.memory | | spark.sql.shuffle.partitions | Shuffle partitions | 2-4x total cores | | spark.serializer | Serialization | KryoSerializer | | spark.sql.adaptive.enabled | AQE | true (always) | ## Good Examples ```python # spark-defaults.conf or SparkSession config spark = ( SparkSession.builder .appName("daily-etl") .config("spark.executor.memory", "8g") .config("spark.executor.cores", "4") .config("spark.executor.memoryOverhead", "1g") .config("spark.sql.shuffle.partitions", "400") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.adaptive.enabled", "true") .config("spark.sql.adaptive.coalescePartitions.enabled", "true") .config("spark.sql.adaptive.skewJoin.enabled", "true") .config("spark.sql.files.maxPartitionBytes", "128mb") .getOrCreate() ) ``` ```bash # spark-submit with explicit configuration spark-submit \ --master yarn \ --deploy-mode cluster \ --executor-memory 8g \ --executor-cores 4 \ --num-executors 50 \ --conf spark.sql.shuffle.partitions=400 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.adaptive.enabled=true \ etl_job.py ``` ## Bad Examples ```python # BAD: No configuration — all defaults spark = SparkSession.builder.appName("my-job").getOrCreate() # Default 1g memory, 200 shuffle partitions — wrong for most workloads # BAD: Too much memory per executor spark.conf.set("spark.executor.memory", "64g") # GC pauses become catastrophic — use more executors with less memory # BAD: 1 core per executor spark.conf.set("spark.executor.cores", "1") # No parallelism within executor — wasteful overhead ``` ## Enforcement - Require configuration review for production job submissions - Monitor Spark UI for GC time (>10% = memory pressure) - Alert on spill to disk (indicates insufficient memory) - Log and audit configuration for all production jobs