Using the File System JAR to Connect to the Cluster

The file system JAR file includes the Data Fabric client libraries required to connect to the cluster. While this is strongly discouraged, application developers can bundle the file system JAR file in Data Fabric file system, HPE Ezmeral Data Fabric Database, and HPE Ezmeral Data Fabric Streams applications instead of installing the Data Fabric client on the edge node (node that runs the application). Applications should not bundle the file system JAR file unless the application meets certain requirements.

IMPORTANT
When bundling the file system JAR file, if there is a binary mismatch between the bundled JAR file and the version that the cluster expects, this can result in failures. In release 5.2.2 and later, the system detects the mismatch and prevents the application from starting. In releases earlier than 5.2.2, nodes running applications may run out of memory or shut down unexpectedly.

Requirements

You can bundle the file system JAR (maprfs-<version>-mapr.jar) with applications that meet all of the following requirements:
  • The application communicates directly with the file system, HPE Ezmeral Data Fabric Database, or HPE Ezmeral Data Fabric Streams
  • The application does not run as a MapReduce or YARN job/application on the cluster.
  • The application does not include file system JARs on the local machine in its classpath.
  • The application accesses a cluster that is not secure.

Configuring the Cluster Connection

When you include the file system JAR in an application instead of installing the Data Fabric client on the edge node, you must create and configure a mapr-clusters.conf file on node that runs the application.

  1. Set a MAPR_HOME environment variable to a location such as /opt/mapr.
  2. Create the mapr-clusters.conf file in the $MAPR_HOME/conf directory.
  3. Configure the mapr-clusters.conf file with the cluster name and the list of CLDB nodes.

    For example, the mapr-clusters.conf on an edge node would contain the following content if it was connecting to a cluster named my.cluster with CLDB nodes on centos765, centos234, and centos123:

    my.cluster secure=false centos765 centos234 centos123

    For more information about how to configure mapr-clusters.conf, see mapr-clusters.conf.

For more information about how the Data Fabric client connects to the cluster, see How Data Fabric Clients Connect to the Cluster.

Using Maven to Include file system JAR as a Dependency

If you use Maven to bundle the file system JAR file with an application and you plan to run the application on a Data Fabric cluster where a patch has been applied, ensure that you specify both a system scope and a local system path to the file.

For example, to bundle the file system JAR file, the pom.xml file may include the following:
...
 <groupId>com.mapr.hadoop</groupId>
        <artifactId>maprfs</artifactId>
        <version>${mapr.core.version}</version>
        <scope>system</scope>
        <systemPath>/opt/mapr/lib/maprfs-5.2.0-mapr.jar</systemPath>
...

By default, the Data Fabric Maven repository includes JAR files from https://repository.mapr.com/maven/. This default Maven repository includes JAR files associated with the GA packages for each Data Fabric release. Therefore, when a patch has been applied to the cluster, failure to specify a system scope may result in errors due to a binary mismatch between the file system JAR files used by the application and the cluster.

Known Issues

Nodes running applications with a bundled file system JAR file may run out of memory or shut down unexpectedly in the following scenarios:
The version of the file system JAR included in the application differs from the version that is available on the cluster.
This may occur when a patch was applied to some, but not all the nodes in the cluster. It can also occur when Maven is bundling the GA version of the JAR file when the cluster expects a newer, patched version.
Two versions of the JAR are available on the node.
For YARN applications, the NodeManager nodes that run the tasks or containers store local versions of the dependencies included with the application. In this scenario, since both the cluster’s file system JAR and the version included in the application are available on the node, it is unknown which JAR will be used when processing the application.