Configuring a Spark Application to Directly Access Data in an External S3 Data Source

Describes how to configure a Spark application to connect directly to an external S3 data source.

To connect a Spark application directly to an external S3 data source, you must provide the following information in the sparkConf section of the Spark application YAML file:
  • Access credentials (access key, secret key)
  • Secret (to securely pass configuration values)
  • Endpoint URL
  • Bucket name (name of the bucket in the S3 data source)
  • Region (domain)
How you configure the Spark application depends on the type of Spark image used, either HPE-Curated Spark or Spark OSS. Follow the steps in the section that applies to the type of image used.

HPE-Curated Spark

Use these instructions if you are configuring a Spark application using the HPE-Curated Spark image.

Using the EzSparkAWSCredentialProvider option in the configuration automatically generates the secret for you.

The following example shows the required configuration options for a Spark application that uses the HPE-Curated Spark image:
sparkConf:
    spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY>
    spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY>
    spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
    spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.spark.s3a.EzSparkAWSCredentialProvider"
    spark.hadoop.fs.s3a.path.style.access: "true"
(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:
spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

Spark OSS

Use these instructions if you are configuring a Spark application using the Spark OSS image.

Complete the following steps:
  1. Generate a secret.
    Use either of the following methods to generate a secret:
    Use a Notebook to Generate the Secret
    Use a notebook to create a Kubernetes secret with Base64-encoded values for the AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password).
    For example, run kubectl apply -f for the following YAML:
    apiVersion: v1 
    kind: Secret
    data: 
      AWS_ACCESS_KEY_ID: <Base64-encoded value; example: dXNlcg== >
      AWS_SECRET_ACCESS_KEY: <Base64-encoded value; example:cGFzc3dvcmQ= > 
    metadata: 
      name: <K8s-secret-name-for-S3> 
    type: Opaque
    See Creating and Managing Notebook Servers.
    Use a Configuration File to Generate the Secret
    Create a spark-defaults.conf file to generate the secret. Provide the object store access key and secret key as values for the spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key properties in the file.
    1. Create a spark-defaults.conf file with the following properties:
      spark.hadoop.fs.s3a.access.key EXAMPLE_ACCESS_KEY
      spark.hadoop.fs.s3a.secret.key EXAMPLE_SECRET_KEY
    2. Create a secret from the spark-defaults.conf file:
      kubectl create secret generic <k8s-secret-name> --from-file=spark-defaults.conf
  2. Configure the Spark application.
    The following example demonstrates how to add the required fields to the sparkConf section of the Spark application YAML file:
    sparkConf:
        spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY>
        spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY>
        spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
        spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
        spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
        spark.hadoop.fs.s3a.path.style.access: "true"
        
(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:
spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

(Optional) Setting Environment Variables for the Access Key and Secret Key

Regardless of the type of Spark imaged used, you can set environment variables for the access key and secret key.

Set environment variables for the access key and secret key, as shown:
spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY> 
spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY> 
spark.executorEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY> 
spark.executorEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY>
IMPORTANT
When you set these environment variables, the user access token (JWT) is not automatically refreshed if the endpoint URL changes. To refresh the token, you must run %update_token.