Interface to Spark
Airflow provides an interface to Spark by using the Providers.
See Provider examples of Spark at <airflow_home>/build/env/lib/python3.9/site-packages/airflow/providers/ezmeral/spark/example_dags/.
Hooks
- EzSparkSubmitHook
- Python path: airflow.providers.ezmeral.spark.hooks.ezspark_submit
- EzSparkSqlHook
- Python path: airflow.providers.ezmeral.spark.hooks.ezspark_sql
- EzSparkJDBCHook
- Python path: airflow.providers.ezmeral.spark.hooks.ezspark_jdbc
Operators
- EzSparkSubmitOperator
- Python path: airflow.providers.ezmeral.spark.operators.ezspark_submit
- EzSparkSqlOperator
- Python path: airflow.providers.ezmeral.spark.operators.ezspark_sql
- EzSparkJDBCOperator
- Python path: airflow.providers.ezmeral.spark.operators.ezspark_jdbc
Spark binary path requirement
As of EEP release 9.1.0, you must define the fully-qualified path to the spark-submit binary when configuring the Airflow Spark connection. You can do this by adding the Spark binary path of the spark_default connection to the extra JSON field:
"spark-binary": "/opt/mapr/spark/spark-<spark_version>/bin/spark-submit"If you need to overwrite this path, you can do so by adding the path to the DAG parameters:
spark_submit = SparkSubmitOperator(
task_id='run_spark_job',
application='/PATH/spark_application.jar',
conn_id='spark_default',
spark_binary='/opt/mapr/spark/spark-<spark_version>/bin/spark-submit',Note that either method of specifying the full path name as described above (defining
spark-binary in the extra JSON field or defining
spark_binary as a parameter of spark_submit) must reference the correct
fully-qualified path name of the spark-submit binary on the Airflow node. The path name
could be different depending on which version of Spark is installed.