Integrate Spark with HBase
Integrate Spark with HBase or HPE Data Fabric Database when you want to run Spark jobs on HBase or HPE Data Fabric Database tables.
About this task
Procedure
-
Configure the HBase version in the
/opt/mapr/spark/spark-<version>/mapr-util/compatibility.versionfile:
The HBase version depends on the current EEP and MapR version that you are running.hbase_versions=<version> -
If you want to create HBase tables with Spark, add the following property to
hbase-site.xml:<property> hbase.table.sanity.checks</name> <value>false</value> </property> -
On each Spark node, copy the
hbase-site.xmlto the{SPARK_HOME}/conf/directory.TIPStarting in the EEP 7.0.0 release, you do not have to complete step 3. Runningconfigure.shcopies thehbase-site.xmlfile to the Spark directory automatically. -
Specify the
hbase-site.xmlfile in theSPARK_HOME/conf/spark-defaults.conffile:spark.yarn.dist.files SPARK_HOME/conf/hbase-site.xml -
To verify the integration, complete the following steps:
-
Create an HBase or HPE Data Fabric Database table:
create '<table_name>' , '<column_family>' -
Run the following command as the
mapruser or as a user thatmaprimpersonates:
The master URL for the cluster is either/opt/mapr/spark/spark-<spark_version>/bin/spark-submit --master <master> [--deploy-mode <deploy-mode>] --class org.apache.hadoop.hbase.spark.example.rdd.HBaseBulkPutExample /opt/mapr/hbase/hbase-<hbase_versrion>/lib/hbase-spark-<hbase_version>-mapr.jar <table_name> <column_family>spark://<host>:7077, yarn, or local (without deploy-mode). The deploy-mode is eitherclientorcluster. -
Check the data in the HBase or MapR-DB table:
hbase(main):001:0> scan '<table_name>'
-
Create an HBase or HPE Data Fabric Database table: