Integrate Spark-SQL (Spark 2.0.1 and later) with Hive
You integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables. This information is for Spark 2.0.1 or later users.
About this task
For information about Spark-SQL and Hive support, see Spark Feature Support.
NOTE
If you installed Spark with the MapR Installer, the following steps are not required. Procedure
-
Copy the
hive-site.xml
file into theSPARK_HOME/conf
directory so that Spark and Spark-SQL recognize the Hive Metastore configuration. Do not create a symbolic link instead of copying the file. You may need to edit the file with settings that are specific to the Spark Thrift server. -
Add
644
permission to thehive-site.xml
using the following command:sudo chmod 644 /opt/mapr/spark/spark-<sparkVersion>/conf/hive-site.xml
-
If Hive is configured on Tez (not on MR), you must remove the Tez property from the
Spark conf directory hive-site.xml. Delete this entry:
<property> <name>hive.execution.engine</name> <value>tez</value> </property>
-
If Hive is configured on PAM, set
"hive.metastore.sasl.enabled = true"
in thehive-site.xml
located in the Spark conf directory. -
Add the following additional properties to the
/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf
file:Property Configuration Requirements spark.yarn.dist.files For Spark on YARN, specify the location of the hive-site.xml
file:/opt/mapr/spark/spark-<spark-version>/conf/hive-site.xml
spark.sql.hive.metastore.version Specify the Hive version that you are using. NOTEIf you are using Hive Metastore 2.1, set the version to 1.2.1. -
Depending on whether you plan to run with impersonation, perform one of the
following:
- Configure user impersonation. See Hive User Impersonation for the steps to configure impersonation in the Spark Thrift server.
- Set
hive.server2.enable.doAs
tofalse
in thehive-site.xml
file.
-
To verify the integration, run the following command as the mapr user or as a user
that mapr impersonates:
<spark-home>/bin/run-example --master <master> [--deploy-mode <deploy-mode>] sql.hive.SparkHiveExample
The master URL for the cluster is either spark://<host>:7077 or yarn. The deploy-mode is either client or cluster.
What to do next
NOTE
The default port for both HiveServer 2 and the Spark Thrift server is 10000. Therefore,
before you start the Spark Thrift server on a node where HiveServer 2 is running, verify
that there is no port conflict.NOTE
If you plan to access Hive tables that store data in HPE Ezmeral Data Fabric Database, you need to copy the
Hive HBase handler jar into the Spark jars directory. For
example:cp /opt/mapr/hive/hive-2.1/lib/hive-hbase-handler-2.1.1-mapr-1707.jar /opt/mapr/spark/spark-2.1.0/jars/