Apache Spark Rules
Apache Spark commands for distributed data processing, SQL queries, streaming, and large-scale analytics.
3 rules
DataFrame Caching and Persistence Rules
Intermediate
Cache Spark DataFrames strategically — cache only reused DataFrames, choose the right storage level, unpersist when done, and monitor memory usage to prevent spills.
globs: **/*.scala, **/*.py, **/spark-defaults.conf
caching, persistence, memory-management, storage-levels
View Rule
Spark Configuration Standards
Advanced
Configure Spark applications properly — memory allocation, executor sizing, shuffle settings, and serialization to maximize performance and prevent out-of-memory failures.
globs: **/*.scala, **/*.py, **/spark-defaults.conf, **/spark-submit*
configuration, memory, executors, shuffle
View Rule
Data Partitioning Best Practices
Intermediate
Partition Spark DataFrames correctly — choose partition columns wisely, avoid data skew, set appropriate partition counts, and use repartition vs coalesce correctly.
globs: **/*.scala, **/*.py, **/spark-defaults.conf
partitioning, repartition, coalesce, data-skew
View Rule