Apache Spark

SQL Commands

Execute SQL queries with spark-sql CLI. Connect to Hive metastore, query data lakes, and perform distributed SQL analytics from the command line.

7 commands

Pro Tips

Use 'SET spark.sql.shuffle.partitions=200' to tune shuffle parallelism for your data size.

Enable Hive support with '--conf spark.sql.catalogImplementation=hive'.

Use EXPLAIN to understand query plans before running expensive operations.

Common Mistakes

SELECT * without LIMIT on large tables can crash the driver or take hours.

Hive compatibility mode may have different semantics than ANSI SQL.

Commands

Start Spark SQL CLI

$ spark-sql

Launch interactive SQL command line interface.

Execute SQL file

$ spark-sql -f query.sql

Execute SQL statements from a file.

Execute inline SQL

$ spark-sql -e "SELECT count(*) FROM mydb.mytable"

Execute a SQL statement directly.

Connect to Hive metastore

$ spark-sql --conf spark.sql.catalogImplementation=hive

Enable Hive metastore support for table metadata.

Set database

$ spark-sql --database mydb

Start CLI with specific database selected.

Enable adaptive query

$ spark-sql --conf spark.sql.adaptive.enabled=true

Enable Adaptive Query Execution for optimization.

Set shuffle partitions

$ spark-sql --conf spark.sql.shuffle.partitions=200

Configure number of shuffle partitions.