Using Airflow to Schedule Spark Applications

This topic describes how to use Airflow to schedule Spark applications on HPE Ezmeral Runtime Enterprise.

To get started with Airflow on HPE Ezmeral Runtime Enterprise, see Airflow.

Run DAGs with SparkKubernetesOperator

To launch Spark jobs, you must select the Enable Spark Operator check box during Kubernetes cluster creation.

For more information, see the Apache Airflow documentation.

The following configuration changes has been made to the Airflow SparkKubernetesOperator provided by Hewlett Packard Enterprise in comparison to the open source Airflow SparkKubernetesOperator.

  • Airflow SparkKubernetesOperator provided by Hewlett Packard Enterprise has three additional positional parameters at the end of the constructor:
    enable_impersonation_from_ldap_user: bool = True,
    api_group: str = 'sparkoperator.k8s.io',
    api_version: str = 'v1beta2',

    Where:

    • enable_impersonation_from_ldap_user: Launches Spark job with autoticket-generator
    • api_group: str = 'sparkoperator.k8s.io': Specifies Spark API group
    • api_version: str = 'v1beta2': Specifies Spark API version
  • The API group of the open source SparkKubernetesOperator and SparkKubernetesOperator offered by Hewlett Packard Enterprise is different.

    You must set enable_impersonation_from_ldap_user to False.

See DAG Example and Spark Job Example on Hewlett Packard Enterprise GitHub repository.

To generate the appropriate ticket for a Spark job, log in to the tenantcli pod in the tenant namespace as follows:

kubectl exec -it tenantcli-0 -n sampletenant -- bash

Execute the following script. For the ticket name, specify a Secret name that will be used in the Spark application yaml file.

ticketcreator.sh