Airflow DAGs Git Repository

Describes how HPE Ezmeral Unified Analytics Software reads DAGs and how to configure a GitHub repository in Airflow.

Airflow DAGs are pulled from the GitHub repository that you specify when you configure Airflow. HPE Ezmeral Unified Analytics Software supports both private and public GitHub repositories. HPE Ezmeral Unified Analytics Software can only read DAGs from a GitHub repository on a specified branch from a specified subdirectory. If the GitHub repository is located behind a proxy, you can configure a proxy for the GitHub repository in Airflow.

In an air-gapped environment where there is no pre-configured proxy to forward outgoing cluster connections to the internet, the installation of Airflow will not function properly. To resolve this issue, the administrator of the HPE Ezmeral Unified Analytics Software must either manually set up an HTTP proxy or configure Airflow with an internal Git repository.
IMPORTANT
Best practice is to use Git submodules if multiple users have DAGs in their own repositories. To manage multiple users within the same GitHub repository, the HPE Ezmeral Unified Analytics Software administrator can create a root GitHub repository and then add all user GitHub repositories as submodules. As owner of the root GitHub repository, the administrator can update the Git submodules after users add/remove/modify files. For example, when a user modifies files, the user can ask the platform administrator to update the latest commit hash of the user's Git submodule in the root repository. For additional information, refer to GitHub - About code owners and Working with submodules.

Configuring a Git Repository for Airflow

To configure Airflow with the GitHub repository where DAGs are stored:

  1. Sign in to HPE Ezmeral Unified Analytics Software as Administrator.
  2. In the left navigation bar, click Tools & Frameworks.
  3. Select the Data Engineering tab.
  4. On the Airflow tile, click the three-dots menu and then select Configure. The YAML file editor opens.
  5. In the editor, find the git: section.
  6. Configure the following parameters in the git: section:
    repo:

    The repository URL for private or public Git repository which stores the DAGs. If you are using an air-gapped system without a proxy, specify your internal Git repository here.

    branch:

    The name of the branch within the repository to use.

    subDir:

    The path to the directory where the DAGs are located.



    If you are using an air-gapped system, and Git cannot be accessed without a proxy, configure the following fields:
    http

    The address of HTTP proxy.

    https

    The address of HTTPS proxy.

    If the git repository is private, configure the following fields:
    username

    The username of the user who has access to the private git repository.

    password

    The token or password of the user who has access to the private git repository.

    Alternatively, if you have created a secret in the airflow-hpe namespace under key: 'password' that contains the password or token information, you can specify the name of that secret in the secretName field under the cred section instead of using the password field directly.



  7. Click Configure and wait until Airflow is configured.