Installing Spark Standalone
This topic describes how to use package managers to download and install Spark Standalone from the EEP repository.
Prerequisites
About this task
Package | Description |
---|---|
mapr-spark |
Install this package on any node where you want to install Spark. This
package is dependent on the mapr-client ,
mapr-hadoop-client ,
mapr-hadoop-util , and mapr-librdkafka
packages. |
mapr-spark-master |
Install this package on Spark master nodes. Spark master nodes must be
able to communicate with Spark worker nodes over SSH without using
passwords. This package is dependent on the mapr-spark and
the mapr-core packages. |
mapr-spark-historyserver |
Install this optional package on Spark History Server nodes. This
package is dependent on the mapr-spark and
mapr-core packages. |
mapr-spark-thriftserver |
Install this optional package on Spark Thrift Server nodes. This package
is available starting in the EEP 4.0 release. It is dependent on the |
root
or using
sudo
.Procedure
-
Create the
/apps/spark
directory on the cluster filesystem, and set the correct permissions on the directory.hadoop fs -mkdir /apps/spark hadoop fs -chmod 777 /apps/spark
NOTEBeginning with EEP 6.2.0, theconfigure.sh
script creates the/apps/spark
directory automatically. -
Install Spark using the appropriate commands for your operating system:
- On CentOS 8.x / Red Hat 8.x
-
dnf install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver
- On Ubuntu
-
apt-get install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver
- On SLES
-
zypper install mapr-spark mapr-spark-master mapr-spark-historyserver mapr-spark-thriftserver
NOTEThemapr-spark-historyserver
,mapr-spark-master
, andmapr-spark-thriftserver
packages are optional.Spark is installed into the
/opt/mapr/spark
directory. -
For Spark 2.x:
Copy the
/opt/mapr/spark/spark-<version>/conf/slaves.template
into/opt/mapr/spark/spark-<version>/conf/slaves
, and add the hostnames of the Spark worker nodes. Put one worker node hostname on each line.For Spark 3.x:Copy the/opt/mapr/spark/spark-<version>/conf/workers.template
into/opt/mapr/spark/spark-<version>/conf/workers
, and add the hostnames of the Spark worker nodes. Put one worker node hostname on each line.For example:localhost worker-node-1 worker-node-2
-
Set up passwordless ssh for the
mapr
user such that the Spark master node has access to all secondary nodes defined in theconf/slaves
file for Spark 2.x andconf/workers
file for Spark 3.x. -
As the
mapr
user, start the worker nodes by running the following command in the master node. Since the Master daemon is managed by the Warden daemon, do not use thestart-all.sh
orstop-all.sh
command.For Spark 2.x:/opt/mapr/spark/spark-<version>/sbin/start-slaves.sh
For Spark 3.x:/opt/mapr/spark/spark-<version>/sbin/start-workers.sh
-
If you want to integrate Spark with HPE Ezmeral Data Fabric Streams,
install the Streams Client on each Spark node:
- On Ubuntu:
apt-get install mapr-kafka
- On RedHat/CentOS:
yum install mapr-kafka
- On Ubuntu:
-
If you want to use a Streaming Producer, add the
spark-streaming-kafka-producer_2.12.jar
from the HPE Ezmeral Data Fabric Maven repository to the Spark classpath (/opt/mapr/spark/spark-<versions>/jars/
). - After installing Spark Standalone but before running your Spark jobs, follow the steps outlined at Configuring Spark Standalone.