About DataTaps

DataTaps expand access to shared data by specifying a named path to a specified storage resource. Applications running within virtual clusters that can use the HDFS filesystem protocols can then access paths within that resource using that name, and DataTap implements Hadoop File System API. This allows you to run jobs using your existing data systems without the need to make time-consuming copies or transfers of your data. Tenant/Project Administrator users can quickly and easily build, edit, and remove DataTaps using the DataTaps screen, as described in The DataTaps Screen (Admin). Tenant Member users can access DataTaps by name.

Each DataTap requires the following properties to be configured, depending on the type of storage being connected to (MapR, HDFS, HDFS with Kerberos, or NFS):

  • Name: A unique name for each DataTap. This name may contain letters (A-Z or a-z), digits (0-9), and hyphens (-), but may not contain spaces. You can use the name of a valid DataTap to compose DataTap URIs that you pass to applications as arguments. Each such URI maps to some path on the storage system that the DataTap points to. The path indicated by a URI might or might not exist at the time you start a job, depending on what the application wants to do with that path. Sometimes the path must indicate a directory or file that already exists, because the application intends to use it as input. Sometimes, the path must not currently exist, because the application expects to create it. The semantics of these paths are entirely application- dependent, and are identical to their behavior when running the application on a physical Hadoop or Spark platform.
  • Description: Brief description of the DataTap, such as the type of data or the purpose of the DataTap.
  • Type: Type of file system used by the shared storage resource associated with the DataTap (MAPR, HDFS, or NFS). This is completely transparent to the end job or other process using the DataTap.

The following fields depend on the DataTap type:

MapR

NOTE All of the links to MapR articles in this section will open in a new browser tab/window.

A MapR DataTap is configured as follows:

  • Cluster Name: Name of the MapR cluster. See the MapR articles Creating the Cluster and Creating a Volume articles.
  • CLDB Hosts: DNS name or address of the container location database of a MapR cluster. See the MapR article Viewing CLDB Information.
  • Port: Port for the namenode service on the host used to access the MapR file system. See the MapR article Specifying Ports.
  • Mount Path: Complete path to the directory containing the data within the specified MapR file system. You can leave this field blank if you intend the DataTap to point at the root of the MapR cluster. See the MapR articles Viewing Volume Details and Creating a Volume.
  • MapR Secure: Checking this check box if MapR cluster is secured. When the MapR cluster is secured, all network connections require authentication, and moving data is protected with wire-level encryption. MapR allows applying direct security protection for data as it comes into and out of the platform without requiring an external security manager server or a particular security plug-in for each ecosystem component. The security semantics are applied automatically on data being retrieved or stored by any ecosystem component, application, or users. See the MapR article Security.
  • Ticket Source: Select the ticket source. This will be one of the following:
    • Upload Ticket File: This is enabled when Ticket source is selected as Use Existing File.
    • Use the existing one: To use the existing ticket details.
  • Ticket file: This will be one of the following:
    • When Upload Ticket File is selected, Browse button is enabled to select the tiket file.
    • When Use the Existing One is selected, it is the name of the existing ticket file.
  • Enable Impersonation: When you enable impersonation, when a user signs into the container and creates a file in the MapR cluster through the DataTap connection, ownership of that file is assigned to that user. If the user does not exist in the MapR cluster, then the connection between the DataTap and the MapR cluster is rejected. Typically, administrators ensure that the same users exist in both the container and the MapR cluster by configuring both the container and the MapR cluster with the same AD/LDAP settings.
  • Select Ticket Type: Select the ticket type. This will be one of the following:
    • User: Grants access to individual users with no impersonation support. The ticket UID is used as the identity of the entity using this ticket.
    • Service: Accesses services running on client nodes with no impersonation support. The ticket UID is used as the identity of the entity using this ticket.
    • Service (with impersonation): Accesses services running on client nodes to run jobs on behalf of any user. The ticket cannot be used to impersonate the root or mapr users.
    • Tenant: Allows tenant users to access tenant volumes in a multi-tenant environment. The ticket can impersonate any user.
  • Ticket User: Username to be included in the ticket for authentication.
  • MapR Tenant Volume: Indicates whether or not the mount path is a MapR tenant volume. See the MapR article Setting Up a Tenant.
  • Enable Passthrough: Select this box to enable Passthrough mode.

See the following examples for additional information:

HDFS

An HDFS DataTap is configured as follows:

  • Host: DNS name or IP address of the server providing access to the storage resource. For example, this could be the host running the namenode service of an HDFS cluster.
  • Standby NameNode: DNS name or IP address of a standby namenode host that an HDFS DataTap will try to reach if it cannot contact the primary host. This field is optional; when used, it provides high-availability access to the specified HFDS DataTap.
  • Port: For HDFS DataTaps, this is the port for the namenode server on the host used to access the HDFS file system.
  • Path: Complete path to the directory containing the data within the specified HDFS file system. You can leave this field blank if you intend the DataTap to point at the root of the specified file system.
  • Kerberos parameters: If the HDFS DataTap has Kerberos enabled, then you will need to specify additional parameters. HPE Ezmeral Runtime Enterprise supports two modes of user access/authentication.
    • Proxy mode permits a “proxy user” to be configured to have access to the remote HDFS cluster. Individual users are granted access to the remote HDFS cluster by the proxy user configuration. Mixing and matching distributions is permitted between the compute Hadoop cluster and the remote HDFS.
    • Passthrough mode passes the credentials of the current user to the remote HDFS cluster for authentication.
  • HDFS file systems configured with TDE encryption as well as cross-realm Kerberos authentication are supported. See HDFS DataTap TDE Configuration and HDFS DataTap Cross-Realm Kerberos Authentication for additional configuration instructions.

NFS

NOTE This option is not available for Kubernetes tenants.

An NFS DataTap is configured as follows:

  • Host: DNS name or IP address of the server providing access to the storage resource.
  • Share:This is the exported share on the selected host.
  • Path: Complete path to the directory containing the data within the specified NFS share. You can leave this field blank if you intend the DataTap to point at the root of the specified share.

GCS

An GCS DataTap is configured as follows:

  • Bucket Name: Specify the bucket name for GCS.
  • Credential File Source: This will be one of the following:
    • When Upload Ticket File: is selected, Browse button is enabled to select in the Credential File. The credential file is a JSON file that contains the service account key.
    • When Use the Existing One: is selected, enter the name of the previously uploaded credential file. The credetial file is a JSON file that contains the service account key.
  • Proxy: This is optional. Specify http proxy to access GCS.
  • Mount Path:Enter a path within the bucket that will serve as the starting pointfor the DataTap. If the path is not specified, the starting point will default to the bucket.

Using a DataTap

The storage pointed to by a DataTap can be accessed via a URI that includes the name of the DataTap.

A DataTap points to the top of the “path” configured for the given DataTap. The URI has the following form:

dtap://datatap_name/

In this example, datatap_name is the name of the DataTap that you wish to use. You can access files and directories further in the hierarchy by appending path components to the URI:

dtap://datatap_name/some_subdirectory/another_subdirectory/some_file

For example, the URI dtap://mydatatapr/home/mydirectory means that the data is located within the /home/mydirectory directory in the storage that the DataTap named mydatatap points to.

DataTaps exist on a per-tenant basis. This means that a DataTap created for Tenant A cannot be used by Tenant B. You may, however, create a DataTap for Tenant B with the exact same properties as its counterpart for Tenant A, thus allowing both tenants to access the same storage resource. Further, multiple jobs within a tenant may use a given DataTap simultaneously. While such sharing can be useful, be aware that the same cautions and restrictions apply to these use cases as for other types of shared storage: multiple jobs modifying files at the same location may lead to file access errors and/or unexpected job results.

Users who have a Tenant Administrator role can view and modify detailed DataTap information. Members can only view general DataTap information and are unable to create, edit, or remove a DataTap.

CAUTION Data conflicts can occur if more than one DataTap points to a location being used by multiple jobs at once.
CAUTION Editing or deleting a DataTap while it is being used by one or more running jobs can cause errors in the affected jobs.