You integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive
tables. This information is for Spark 2.0.1 or later users.
About this task
For information about Spark-SQL and Hive support, see Spark Feature Support.
NOTE
If you installed Spark with the MapR Installer, the following steps are not required.
Procedure
-
Copy the
hive-site.xml file into the
SPARK_HOME/conf directory so that Spark and Spark-SQL recognize
the Hive Metastore configuration. Do not create a symbolic link instead of copying
the file. You may need to edit the file with settings that are specific to the Spark
Thrift server.
-
Add
644 permission to the hive-site.xml using the following command:
sudo chmod 644 /opt/mapr/spark/spark-<sparkVersion>/conf/hive-site.xml
-
If Hive is configured on Tez (not on MR), you must remove the Tez property from the
Spark conf directory hive-site.xml. Delete this entry:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
-
If Hive is configured on PAM, set
"hive.metastore.sasl.enabled =
true" in the hive-site.xml located in the Spark conf directory.
-
Add the following additional properties to the
/opt/mapr/spark/spark-<version>/conf/spark-defaults.conf
file:
| Property |
Configuration Requirements |
| spark.yarn.dist.files |
For Spark on YARN, specify the location of the hive-site.xml file:
/opt/mapr/spark/spark-<spark-version>/conf/hive-site.xml
|
| spark.sql.hive.metastore.version |
Specify the Hive version that you are using.
NOTE If you are using
Hive Metastore 2.1, set the version to 1.2.1. |
-
Depending on whether you plan to run with impersonation, perform one of the
following:
- Configure user impersonation. See Hive User
Impersonation for the steps to configure impersonation in the Spark
Thrift server.
- Set
hive.server2.enable.doAs to false in
the hive-site.xml file.
-
To verify the integration, run the following command as the mapr user or as a user
that mapr impersonates:
<spark-home>/bin/run-example --master <master> [--deploy-mode <deploy-mode>] sql.hive.SparkHiveExample
The master URL for the cluster is either spark://<host>:7077 or yarn. The
deploy-mode is either client or cluster.
What to do next
NOTE
The default port for both HiveServer 2 and the Spark Thrift server is 10000. Therefore,
before you start the Spark Thrift server on a node where HiveServer 2 is running, verify
that there is no port conflict.
NOTE
If you plan to access Hive tables that store data in
HPE Data Fabric Database, you need to copy the
Hive HBase handler jar into the Spark jars directory. For
example:
cp /opt/mapr/hive/hive-2.1/lib/hive-hbase-handler-2.1.1-mapr-1707.jar /opt/mapr/spark/spark-2.1.0/jars/