Accessing Data From Outside Airflow DAGs with DataTap

Describes how to access data from tenant storage in Airflow DAGs using DataTap. This can be done with either Airflow BashOperator or Airflow PythonOperator.

DataTap can be used to access data stored outside DAGs. For example, large datasets, binaries, or other large files which cannot be uploaded to Git repositories can instead be uploaded to DataTap tenant storage. These files can then be accessed in any Airflow DAG.

There are two ways to read and write data from tenant storage in Airflow DAGs using DataTap:

Airflow BashOperator

You can learn about Airflow BashOperator on the Apache site: Airflow BashOperator (link opens an external site in a new browser tab or window).

To access data from tenant storage in DAGs using DataTap with Airflow BashOperator, proceed as follows:

  1. Enter the hadoop CLI command to perform operations with tenant data. Example of bash_command argument in the DAG:
    bash_command='hadoop fs -ls dtap://TenantStorage/' + path
    NOTE
    The full path to tenant storage is:
    dtap://TenantStorage/
  2. For a full example, see this page.

Airflow PythonOperator

You can learn about Airflow PythonOperator on the Apache site: Airflow PythonOperator (link opens an external site in a new browser tab or window).

To access data from tenant storage in DAGs using DataTap with Airflow PythonOperator, proceed as follows:

  1. Use the pyarrow Python library to access data in tenant storage:
    from pyarrow import fs
    See: Pyarrow Python library (link opens an external site in a new browser tab or window).
  2. For a full example, see this page.