Interface to Spark

Airflow provides an interface to Spark by using the Providers.

See Provider examples of Spark at <airflow_home>/build/env/lib/python3.9/site-packages/airflow/providers/ezmeral/spark/example_dags/.

Hooks

EzSparkSubmitHook
Python path: airflow.providers.ezmeral.spark.hooks.ezspark_submit
Description: Launches Spark applications. This hook is a wrapper around the spark-submit binary to run a spark-submit job.
EzSparkSqlHook
Python path: airflow.providers.ezmeral.spark.hooks.ezspark_sql
Description: Enables interaction with binary tables through spark-sql. This hook is a wrapper around the spark-sql binary.
EzSparkJDBCHook
Python path: airflow.providers.ezmeral.spark.hooks.ezspark_jdbc
Description: Enables data transfers between JDBC databases and Apache Spark.

Operators

EzSparkSubmitOperator
Python path: airflow.providers.ezmeral.spark.operators.ezspark_submit
Description: This operator is a wrapper around the spark-submit binary to run a spark-submit job.
EzSparkSqlOperator
Python path: airflow.providers.ezmeral.spark.operators.ezspark_sql
Description: Executes Spark SQL query.
EzSparkJDBCOperator
Python path: airflow.providers.ezmeral.spark.operators.ezspark_jdbc
Description: Extends the SparkSubmitOperator to enable data transfers between JDBC databases and Apache Spark.

Spark binary path requirement

As of EEP release 9.1.0, you must define the fully-qualified path to the spark-submit binary when configuring the Airflow Spark connection. You can do this by adding the Spark binary path of the spark_default connection to the extra JSON field:

"spark-binary": "/opt/mapr/spark/spark-<spark_version>/bin/spark-submit"

If you need to overwrite this path, you can do so by adding the path to the DAG parameters:

spark_submit = SparkSubmitOperator(
        task_id='run_spark_job',
        application='/PATH/spark_application.jar',
        conn_id='spark_default',
        spark_binary='/opt/mapr/spark/spark-<spark_version>/bin/spark-submit',

Note that either method of specifying the full path name as described above (defining spark-binary in the extra JSON field or defining spark_binary as a parameter of spark_submit) must reference the correct fully-qualified path name of the spark-submit binary on the Airflow node. The path name could be different depending on which version of Spark is installed.