Apache Spark

Apache Spark commands for distributed data processing, SQL queries, streaming, and large-scale analytics.

42 commands

Browse by Topic

Getting Started

Quick setup and installation

3commands

Submit

Job submission

9commands

Shell

Interactive shells

7commands

SQL

Spark SQL CLI

7commands

Config

Configuration

7commands

Cluster

Cluster management

9commands

Install Apache Spark (macOS)

$ brew install apache-spark

Install Apache Spark using Homebrew on macOS

Check Spark version

$ spark-shell --version

Display the installed Apache Spark version

Launch PySpark shell

$ pyspark

Start an interactive PySpark shell for running Spark with Python

Submit Python application

$ spark-submit app.py

Submit a PySpark application to the cluster.

Submit with master

$ spark-submit --master spark://host:7077 app.py

Submit to a specific Spark standalone master.

Submit to YARN cluster

$ spark-submit --master yarn --deploy-mode cluster app.py

Submit to YARN with driver running on cluster.

Configure executor resources

$ spark-submit --executor-memory 4g --executor-cores 2 --num-executors 10 app.py

Set executor memory, cores, and count.

Submit JAR application

$ spark-submit --class com.example.Main app.jar arg1 arg2

Submit Scala/Java application with main class.

Add dependencies

$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 app.py

Include Maven packages as dependencies.

Add Python files

$ spark-submit --py-files utils.zip,helpers.py app.py

Include additional Python files or archives.

Enable dynamic allocation

$ spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=2 app.py

Auto-scale executors based on workload.

Submit to Kubernetes

$ spark-submit --master k8s://https://k8s-api:443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark:3.5 app.py

Submit to Kubernetes cluster.

Start Scala shell

$ spark-shell

Launch interactive Scala REPL with Spark context.

Start PySpark shell

$ pyspark

Launch interactive Python shell with Spark context.

PySpark with Jupyter

$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark

Launch PySpark with Jupyter notebook interface.

Shell with custom master

$ spark-shell --master spark://host:7077

Connect shell to specific Spark master.

Shell with packages

$ pyspark --packages org.apache.spark:spark-avro_2.12:3.5.0

Start shell with additional Maven dependencies.

Shell with memory config

$ spark-shell --driver-memory 4g --executor-memory 8g

Start shell with custom memory settings.

Start SparkR shell

$ sparkR

Launch interactive R shell with Spark context.

Start Spark SQL CLI

$ spark-sql

Launch interactive SQL command line interface.

Execute SQL file

$ spark-sql -f query.sql

Execute SQL statements from a file.

Execute inline SQL

$ spark-sql -e "SELECT count(*) FROM mydb.mytable"

Execute a SQL statement directly.

Connect to Hive metastore

$ spark-sql --conf spark.sql.catalogImplementation=hive

Enable Hive metastore support for table metadata.

Set database

$ spark-sql --database mydb

Start CLI with specific database selected.

Enable adaptive query

$ spark-sql --conf spark.sql.adaptive.enabled=true

Enable Adaptive Query Execution for optimization.

Set shuffle partitions

$ spark-sql --conf spark.sql.shuffle.partitions=200

Configure number of shuffle partitions.

Set config via CLI

$ spark-submit --conf spark.executor.memory=4g --conf spark.executor.cores=2 app.py

Pass configuration options via command line.

Use properties file

$ spark-submit --properties-file spark-defaults.conf app.py

Load configuration from a properties file.

Enable Kryo serializer

$ spark-submit --conf spark.serializer=org.apache.spark.serializer.KryoSerializer app.py

Use faster Kryo serialization instead of Java.

Configure memory fraction

$ spark-submit --conf spark.memory.fraction=0.8 --conf spark.memory.storageFraction=0.3 app.py

Tune execution vs storage memory balance.

Enable speculation

$ spark-submit --conf spark.speculation=true app.py

Enable speculative execution for straggler mitigation.

Set driver memory

$ spark-submit --driver-memory 8g --driver-cores 4 app.py

Configure driver resource allocation.

Configure broadcast threshold

$ spark-submit --conf spark.sql.autoBroadcastJoinThreshold=100m app.py

Set threshold for automatic broadcast joins.

Start standalone master

$ $SPARK_HOME/sbin/start-master.sh

Start Spark standalone cluster master.

Start worker

$ $SPARK_HOME/sbin/start-worker.sh spark://master:7077

Start worker and connect to master.

Start all workers

$ $SPARK_HOME/sbin/start-workers.sh

Start workers on all machines in conf/workers file.

Stop cluster

$ $SPARK_HOME/sbin/stop-all.sh

Stop master and all workers.

Submit with supervise

$ spark-submit --deploy-mode cluster --supervise --master spark://host:7077 app.py

Enable automatic driver restart on failure.

Kill application

$ spark-submit --kill driver-20240101120000-0001 --master spark://host:7077

Kill a running application on standalone cluster.

YARN queue submission

$ spark-submit --master yarn --queue production --deploy-mode cluster app.py

Submit to specific YARN queue.

K8s with service account

$ spark-submit --master k8s://https://k8s-api:443 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark app.py

Submit to Kubernetes with service account.

Check cluster status

$ curl http://spark-master:8080/json/

Query Spark master REST API for cluster status.

Discussion

Loading comments...

Apache Spark

Apache Spark commands for distributed data processing, SQL queries, streaming, and large-scale analytics.

42 commands

Browse by Topic

Getting Started

Quick setup and installation

3commands

Submit

Job submission

9commands

Shell

Interactive shells

7commands

SQL

Spark SQL CLI

7commands

Config

Configuration

7commands

Cluster

Cluster management

9commands

Install Apache Spark (macOS)

$ brew install apache-spark

Install Apache Spark using Homebrew on macOS

Check Spark version

$ spark-shell --version

Display the installed Apache Spark version

Launch PySpark shell

$ pyspark

Start an interactive PySpark shell for running Spark with Python

Submit Python application

$ spark-submit app.py

Submit a PySpark application to the cluster.

Submit with master

$ spark-submit --master spark://host:7077 app.py

Submit to a specific Spark standalone master.

Submit to YARN cluster

$ spark-submit --master yarn --deploy-mode cluster app.py

Submit to YARN with driver running on cluster.

Configure executor resources

$ spark-submit --executor-memory 4g --executor-cores 2 --num-executors 10 app.py

Set executor memory, cores, and count.

Submit JAR application

$ spark-submit --class com.example.Main app.jar arg1 arg2

Submit Scala/Java application with main class.

Add dependencies

$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 app.py

Include Maven packages as dependencies.

Add Python files

$ spark-submit --py-files utils.zip,helpers.py app.py

Include additional Python files or archives.

Enable dynamic allocation

$ spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=2 app.py

Auto-scale executors based on workload.

Submit to Kubernetes

$ spark-submit --master k8s://https://k8s-api:443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark:3.5 app.py

Submit to Kubernetes cluster.

Start Scala shell

$ spark-shell

Launch interactive Scala REPL with Spark context.

Start PySpark shell

$ pyspark

Launch interactive Python shell with Spark context.

PySpark with Jupyter

$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark

Launch PySpark with Jupyter notebook interface.

Shell with custom master

$ spark-shell --master spark://host:7077

Connect shell to specific Spark master.

Shell with packages

$ pyspark --packages org.apache.spark:spark-avro_2.12:3.5.0

Start shell with additional Maven dependencies.

Shell with memory config

$ spark-shell --driver-memory 4g --executor-memory 8g

Start shell with custom memory settings.

Start SparkR shell

$ sparkR

Launch interactive R shell with Spark context.

Start Spark SQL CLI

$ spark-sql

Launch interactive SQL command line interface.

Execute SQL file

$ spark-sql -f query.sql

Execute SQL statements from a file.

Execute inline SQL

$ spark-sql -e "SELECT count(*) FROM mydb.mytable"

Execute a SQL statement directly.

Connect to Hive metastore

$ spark-sql --conf spark.sql.catalogImplementation=hive

Enable Hive metastore support for table metadata.

Set database

$ spark-sql --database mydb

Start CLI with specific database selected.

Enable adaptive query

$ spark-sql --conf spark.sql.adaptive.enabled=true

Enable Adaptive Query Execution for optimization.

Set shuffle partitions

$ spark-sql --conf spark.sql.shuffle.partitions=200

Configure number of shuffle partitions.

Set config via CLI

$ spark-submit --conf spark.executor.memory=4g --conf spark.executor.cores=2 app.py

Pass configuration options via command line.

Use properties file

$ spark-submit --properties-file spark-defaults.conf app.py

Load configuration from a properties file.

Enable Kryo serializer

$ spark-submit --conf spark.serializer=org.apache.spark.serializer.KryoSerializer app.py

Use faster Kryo serialization instead of Java.

Configure memory fraction

$ spark-submit --conf spark.memory.fraction=0.8 --conf spark.memory.storageFraction=0.3 app.py

Tune execution vs storage memory balance.

Enable speculation

$ spark-submit --conf spark.speculation=true app.py

Enable speculative execution for straggler mitigation.

Set driver memory

$ spark-submit --driver-memory 8g --driver-cores 4 app.py

Configure driver resource allocation.

Configure broadcast threshold

$ spark-submit --conf spark.sql.autoBroadcastJoinThreshold=100m app.py

Set threshold for automatic broadcast joins.

Start standalone master

$ $SPARK_HOME/sbin/start-master.sh

Start Spark standalone cluster master.

Start worker

$ $SPARK_HOME/sbin/start-worker.sh spark://master:7077

Start worker and connect to master.

Start all workers

$ $SPARK_HOME/sbin/start-workers.sh

Start workers on all machines in conf/workers file.

Stop cluster

$ $SPARK_HOME/sbin/stop-all.sh

Stop master and all workers.

Submit with supervise

$ spark-submit --deploy-mode cluster --supervise --master spark://host:7077 app.py

Enable automatic driver restart on failure.

Kill application

$ spark-submit --kill driver-20240101120000-0001 --master spark://host:7077

Kill a running application on standalone cluster.

YARN queue submission

$ spark-submit --master yarn --queue production --deploy-mode cluster app.py

Submit to specific YARN queue.

K8s with service account

$ spark-submit --master k8s://https://k8s-api:443 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark app.py

Submit to Kubernetes with service account.

Check cluster status

$ curl http://spark-master:8080/json/

Query Spark master REST API for cluster status.

Discussion

Loading comments...