Integrate Spark with R
You integrate Spark with R when you want to run R programs as Spark jobs.
About this task
Procedure
-
On each node that will submit Spark jobs, install R 3.2.2 or greater:
- On Ubuntu:
apt-get install r-base-dev - On CentOS/RedHat:
yum install R
For more information about installing R, see the R documentation.
- On Ubuntu:
-
To verify the integration, run the following commands as the mapr user or as a
user that mapr impersonates:
-
Start Spark R:
- On Spark 2.0.1, 2.1.0, and
later:
/opt/mapr/spark/spark-<version>/bin/sparkR --master <master> [--deploy-mode <deploy-mode>] - On Spark
1.6.1:
/opt/mapr/spark/spark-<version>/bin/sparkR --master <master-url>
- On Spark 2.0.1, 2.1.0, and
later:
-
Run the following command to create a DataFrame using sample
data:
On Spark 1.6.1:
people <- read.df(sqlContext, "file:///opt/mapr/spark/spark-<version>/examples/src/main/resources/people.json", "json")On Spark 2.0.1, 2.1.0, and later:people <- read.df(spark, "file:///opt/mapr/spark/spark-<version>/examples/src/main/resources/people.json", "json") -
Run the following command to display the data from the DataFrame that
you just created:
head(people)
-
Start Spark R: