Apache Spark Rules

Apache Spark commands for distributed data processing, SQL queries, streaming, and large-scale analytics.

3 rules

DataFrame Caching and Persistence Rules

Cache Spark DataFrames strategically — cache only reused DataFrames, choose the right storage level, unpersist when done, and monitor memory usage to prevent spills.

globs: **/*.scala, **/*.py, **/spark-defaults.conf

caching, persistence, memory-management, storage-levels

View Rule

Spark Configuration Standards

Advanced

Configure Spark applications properly — memory allocation, executor sizing, shuffle settings, and serialization to maximize performance and prevent out-of-memory failures.

globs: **/*.scala, **/*.py, **/spark-defaults.conf, **/spark-submit*

configuration, memory, executors, shuffle

View Rule

Data Partitioning Best Practices

Intermediate

Partition Spark DataFrames correctly — choose partition columns wisely, avoid data skew, set appropriate partition counts, and use repartition vs coalesce correctly.

globs: **/*.scala, **/*.py, **/spark-defaults.conf

partitioning, repartition, coalesce, data-skew

View Rule