Integrate Spark with HBase

Integrate Spark with HBase or HPE Ezmeral Data Fabric Database when you want to run Spark jobs on HBase or HPE Ezmeral Data Fabric Database tables.

About this task

If you installed Spark with the MapR Installer, these steps are not required.

Procedure

  1. Configure the HBase version in the /opt/mapr/spark/spark-<version>/mapr-util/compatibility.version file:
    hbase_versions=<version>
    The HBase version depends on the current EEP and MapR version that you are running.
  2. If you want to create HBase tables with Spark, add the following property to hbase-site.xml:
    <property>
    hbase.table.sanity.checks</name>
    <value>false</value>
    </property>
  3. On each Spark node, copy the hbase-site.xml to the {SPARK_HOME}/conf/ directory.
    TIP
    Starting in the EEP 7.0.0 release, you do not have to complete step 3. Running configure.sh copies the hbase-site.xml file to the Spark directory automatically.
  4. Specify the hbase-site.xml file in the SPARK_HOME/conf/spark-defaults.conf file:
    spark.yarn.dist.files SPARK_HOME/conf/hbase-site.xml
  5. To verify the integration, complete the following steps:
    1. Create an HBase or HPE Ezmeral Data Fabric Database table:
      create '<table_name>' , '<column_family>'
    2. Run the following command as the mapr user or as a user that mapr impersonates:
      /opt/mapr/spark/spark-<spark_version>/bin/spark-submit --master <master> [--deploy-mode <deploy-mode>]  --class org.apache.hadoop.hbase.spark.example.rdd.HBaseBulkPutExample /opt/mapr/hbase/hbase-<hbase_versrion>/lib/hbase-spark-<hbase_version>-mapr.jar  <table_name>  <column_family>
      The master URL for the cluster is either spark://<host>:7077, yarn, or local (without deploy-mode). The deploy-mode is either client or cluster.
    3. Check the data in the HBase or MapR-DB table:
      hbase(main):001:0> scan '<table_name>'