Installing Spark Standalone
This topic describes how to use package managers to download and install Spark Standalone from the EEP repository.
Prerequisites
About this task
| Package | Description |
|---|---|
mapr-spark |
Install this package on any node where you want to install Spark. This
package is dependent on the mapr-client,
mapr-hadoop-client,
mapr-hadoop-util, and mapr-librdkafka
packages. |
mapr-spark-master |
Install this package on Spark master nodes. Spark master nodes must be
able to communicate with Spark worker nodes over SSH without using
passwords. This package is dependent on the mapr-spark and
the mapr-core packages. |
mapr-spark-historyserver |
Install this optional package on Spark History Server nodes. This
package is dependent on the mapr-spark and
mapr-core packages. |
mapr-spark-thriftserver |
Install this optional package on Spark Thrift Server nodes. This package
is available starting in the EEP 4.0 release. It is dependent on the |
root or using
sudo.Procedure
-
Create the
/apps/sparkdirectory on the cluster filesystem, and set the correct permissions on the directory.hadoop fs -mkdir /apps/spark hadoop fs -chmod 777 /apps/sparkNOTEBeginning with EEP 6.2.0, theconfigure.shscript creates the/apps/sparkdirectory automatically. -
Install Spark using the appropriate commands for your operating system:
- On CentOS 8.x / Red Hat 8.x
-
dnf install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver - On Ubuntu
-
apt-get install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver
- On SLES
-
zypper install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver
NOTEThemapr-spark-historyserver,mapr-spark-master, andmapr-spark-thriftserverpackages are optional.Spark is installed into the
/opt/mapr/sparkdirectory. -
For Spark 2.x:
Copy the
/opt/mapr/spark/spark-<version>/conf/slaves.templateinto/opt/mapr/spark/spark-<version>/conf/slaves, and add the hostnames of the Spark worker nodes. Put one worker node hostname on each line.For Spark 3.x:Copy the/opt/mapr/spark/spark-<version>/conf/workers.templateinto/opt/mapr/spark/spark-<version>/conf/workers, and add the hostnames of the Spark worker nodes. Put one worker node hostname on each line.For example:localhost worker-node-1 worker-node-2 -
Set up passwordless ssh for the
mapruser such that the Spark master node has access to all secondary nodes defined in theconf/slavesfile for Spark 2.x andconf/workersfile for Spark 3.x. -
As the
mapruser, start the worker nodes by running the following command in the master node. Since the Master daemon is managed by the Warden daemon, do not use thestart-all.shorstop-all.shcommand.For Spark 2.x:/opt/mapr/spark/spark-<version>/sbin/start-slaves.shFor Spark 3.x:/opt/mapr/spark/spark-<version>/sbin/start-workers.sh -
If you want to integrate Spark with HPE Data Fabric Streams,
install the Streams Client on each Spark node:
- On Ubuntu:
apt-get install mapr-kafka - On RedHat/CentOS:
yum install mapr-kafka
- On Ubuntu:
-
If you want to use a Streaming Producer, add the
spark-streaming-kafka-producer_2.12.jarfrom the HPE Data Fabric Maven repository to the Spark classpath (/opt/mapr/spark/spark-<versions>/jars/). - After installing Spark Standalone but before running your Spark jobs, follow the steps outlined at Configuring Spark Standalone.