Using Airflow to Schedule Spark Applications
This topic describes how to use Airflow to schedule Spark applications on HPE Ezmeral Runtime Enterprise.
To get started with Airflow on HPE Ezmeral Runtime Enterprise, see Airflow.
Run DAGs with SparkKubernetesOperator
To launch Spark jobs, you must select the Enable Spark Operator check box during Kubernetes cluster creation.
For more information, see the Apache Airflow documentation.
The following configuration changes has been made to the Airflow SparkKubernetesOperator provided by Hewlett Packard Enterprise in comparison to the open source Airflow SparkKubernetesOperator.
- Airflow SparkKubernetesOperator provided by Hewlett Packard Enterprise has three additional positional parameters at the
end of the constructor:
enable_impersonation_from_ldap_user: bool = True, api_group: str = 'sparkoperator.k8s.io', api_version: str = 'v1beta2',
Where:
enable_impersonation_from_ldap_user
: Launches Spark job with autoticket-generatorapi_group: str = 'sparkoperator.k8s.io'
: Specifies Spark API groupapi_version: str = 'v1beta2'
: Specifies Spark API version
-
The API group of the open source SparkKubernetesOperator and SparkKubernetesOperator offered by Hewlett Packard Enterprise is different.
You must set
enable_impersonation_from_ldap_user
toFalse
.
See DAG Example and Spark Job Example on Hewlett Packard Enterprise GitHub repository.
To generate the appropriate ticket for a Spark job, log in to the
tenantcli
pod in the tenant namespace as follows:
kubectl exec -it tenantcli-0 -n sampletenant -- bash
Execute the following script. For the ticket name, specify a Secret name that will be used in the Spark application yaml file.
ticketcreator.sh