Accessing Data From Outside Airflow DAGs with DataTap
Describes how to access data from tenant storage in Airflow DAGs using DataTap. This can be done with either Airflow BashOperator or Airflow PythonOperator.
DataTap can be used to access data stored outside DAGs. For example, large datasets, binaries, or other large files which cannot be uploaded to Git repositories can instead be uploaded to DataTap tenant storage. These files can then be accessed in any Airflow DAG.
Airflow BashOperator
You can learn about Airflow BashOperator on the Apache site: Airflow BashOperator (link opens an external site in a new browser tab or window).
To access data from tenant storage in DAGs using DataTap with Airflow BashOperator, proceed as follows:
- Enter the hadoop CLI command to perform operations with tenant data. Example of
bash_command
argument in the DAG:bash_command='hadoop fs -ls dtap://TenantStorage/' + path
NOTEThe full path to tenant storage is:dtap://TenantStorage/
- For a full example, see this page.
Airflow PythonOperator
You can learn about Airflow PythonOperator on the Apache site: Airflow PythonOperator (link opens an external site in a new browser tab or window).
To access data from tenant storage in DAGs using DataTap with Airflow PythonOperator, proceed as follows:
- Use the
pyarrow
Python library to access data in tenant storage:
See: Pyarrow Python library (link opens an external site in a new browser tab or window).from pyarrow import fs
- For a full example, see this page.