Enabling GPU Support for Spark Operator

Describes how to enable and allocate GPU resources on Spark Operator.

Enabling GPU Support for Spark Operator

To enable GPU processing and allocate GPU resources on Spark Operator, follow these steps:
  1. Set the image option within the spec property of the Spark application yaml file to gcr.io/mapr-252711/spark-gpu-<spark-version>:<image-tag>. To see the list of Spark GPU images, see List of Spark Images.
  2. Add the following configuration options to sparkConf section within the spec property.
    • To enable the RAPIDS plugin and allocate the GPU resources, add:
      # Enabling RAPIDs plugin
      spark.plugins: "com.nvidia.spark.SQLPlugin"
      spark.rapids.sql.enabled: "true"
      spark.rapids.force.caller.classloader: "false"
       
      # GPU allocation and discovery settings
      spark.task.resource.gpu.amount: "1"
      spark.executor.resource.gpu.amount: "1"
      spark.executor.resource.gpu.vendor: "nvidia.com"
      
    • To set the path to the GPU discovery script, add:
      spark.executor.resource.gpu.discoveryScript: "/opt/mapr/spark/spark-<spark-version>/examples/src/main/scripts/getGpusResources.sh"
    • To set the RAPIDS shim layer used for the run, add:
      spark.rapids.shims-provider-override: "com.nvidia.spark.rapids.shims.<spark-identifier>.SparkShimServiceProvider"
      The Spark version distributed by Hewlett Packard Enterprise is compatible with its corresponding open-source version. The RAPIDS jar includes the shim layer provider classes called com.nvidia.spark.rapids.shims.[spark-identifier].SparkShimServiceProvider. You can replace the [spark-identifier] based on the Spark distributed by Hewlett Packard Enterprise such as:
      • For spark-3.5.0, the identifier is spark350.
      • For example, for spark-gpu-3.5.0, set the RAPIDS shim layer as follows:
        spark.rapids.shims-provider-override: "com.nvidia.spark.rapids.shims.spark350.SparkShimServiceProvider"

Verifying Spark Applications are Running on GPU

To verify the Spark applications are running on GPU, you can use the explain Spark method.

Run the following PySpark application:
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
 
conf = SparkConf()
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
 
df = sqlContext.createDataFrame([1,2,3], "int").toDF("value")
df.createOrReplaceTempView("df")
 
sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()
sqlContext.sql("SELECT * FROM df WHERE value<>1").show()
 
sc.stop()
If you get the following output where the explain method prints the GPU-related stages, you can verify that your Spark application is running on GPU.
== Physical Plan ==
GpuColumnarToRow false
+- GpuFilter NOT (value#2 = 1), true
   +- GpuRowToColumnar targetsize(2147483647)
      +- *(1) SerializeFromObject [input[0, int, false] AS value#2]
         +- Scan[obj#1]
However, if you get the following output, your Spark application is not running on GPU but instead on CPU. You must ensure that Spark applications are configured properly to work on GPU.
== Physical Plan ==
*(1) Filter NOT (value#2 = 1)
+- *(1) SerializeFromObject [input[0, int, false] AS value#2]
   +- Scan[obj#1]

Spark Operator YAML Example Using GPU for Spark 3.5.0

Example:
apiVersion: "sparkoperator.hpe.com/v1beta2"
kind: SparkApplication
metadata:
  name:
spark-eep-gpu-350
  namespace: spark
spec:
  sparkConf:
    # Enabling RAPIDs plugin
    spark.plugins: "com.nvidia.spark.SQLPlugin"
    spark.rapids.sql.enabled: "true"
    spark.rapids.force.caller.classloader: "false"
 
    # GPU allocation and discovery settings
    spark.task.resource.gpu.amount: "1"
    spark.executor.resource.gpu.amount: "1"
    spark.executor.resource.gpu.vendor: "nvidia.com"
    spark.executor.resource.gpu.discoveryScript: "/opt/mapr/spark/spark-3.5.0/examples/src/main/scripts/getGpusResources.sh"
    spark.rapids.shims-provider-override: "com.nvidia.spark.rapids.shims.spark350.SparkShimServiceProvider"
 
  type: Python
  sparkVersion:3.5.0
  mode: cluster
  image: gcr.io/mapr-252711/spark-gpu-3.5.0:v3.5.0
  imagePullPolicy: Always
  mainApplicationFile: .../path/to/application.py
  restartPolicy:
    type: Never
  imagePullSecrets:
    - imagepull
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "1024m"
    labels:
      version: 3.5.0
  executor:
    cores: 1
    coreLimit: "1000m"
    instances: 1
    memory: "2G"
    labels:
      version: 3.5.0