Configuring a Spark Application to Directly Access Data in an External S3 Data Source

Describes how to configure a Spark application to connect directly to an external S3 data source.

To connect a Spark application directly to an external S3 data source, you must provide the following information in the sparkConf section of the Spark application YAML file:

Access credentials (access key, secret key)
Secret (to securely pass configuration values)
Endpoint URL
Bucket name (name of the bucket in the S3 data source)
Region (domain)

How you configure the Spark application depends on the type of Spark image used, either HPE-Curated Spark or Spark OSS. Follow the steps in the section that applies to the type of image used.

HPE-Curated Spark

Use these instructions if you are configuring a Spark application using the HPE-Curated Spark image.

Using the EzSparkAWSCredentialProvider option in the configuration automatically generates the secret for you.

The following example shows the required configuration options for a Spark application that uses the HPE-Curated Spark image:

sparkConf:
    spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY>
    spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY>
    spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
    spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.spark.s3a.EzSparkAWSCredentialProvider"
    spark.hadoop.fs.s3a.path.style.access: "true"

(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:

spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

Spark OSS

Use these instructions if you are configuring a Spark application using the Spark OSS image.

Complete the following steps:

Generate a secret.
Use either of the following methods to generate a secret:
Use a Notebook to Generate the Secret
Use a notebook to create a Kubernetes secret with Base64-encoded values for the AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password).
For example, run kubectl apply -f for the following YAML:
apiVersion: v1 kind: Secret data: AWS_ACCESS_KEY_ID: <Base64-encoded value; example: dXNlcg== > AWS_SECRET_ACCESS_KEY: <Base64-encoded value; example:cGFzc3dvcmQ= > metadata: name: <K8s-secret-name-for-S3> type: Opaque
See Creating and Managing Notebook Servers.
Use a Configuration File to Generate the Secret
Create a spark-defaults.conf file to generate the secret. Provide the object store access key and secret key as values for the spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key properties in the file.
1. Create a spark-defaults.conf file with the following properties:
  spark.hadoop.fs.s3a.access.key EXAMPLE_ACCESS_KEY spark.hadoop.fs.s3a.secret.key EXAMPLE_SECRET_KEY
2. Create a secret from the spark-defaults.conf file:
  kubectl create secret generic <k8s-secret-name> --from-file=spark-defaults.conf

Configure the Spark application.

The following example demonstrates how to add the required fields to the sparkConf section of the Spark application YAML file:

sparkConf:
    spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY>
    spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY>
    spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
    spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.path.style.access: "true"

(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:

spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

(Optional) Setting Environment Variables for the Access Key and Secret Key

Regardless of the type of Spark imaged used, you can set environment variables for the access key and secret key.

Set environment variables for the access key and secret key, as shown:

spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY> 
spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY> 
spark.executorEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY> 
spark.executorEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY>

IMPORTANT

When you set these environment variables, the user access token (JWT) is not automatically refreshed if the endpoint URL changes. To refresh the token, you must run %update_token.

HPE Ezmeral Unified Analytics Software 1.5 Documentation
Abstract	HPE Ezmeral Unified Analytics Software is a usage-based Software-as-a-Service (SaaS) model that operationalizes hybrid and multi-cloud modern analytical workloads through a simple user interface, easily installed and deployed in minutes. HPE Ezmeral Unified Analytics Software separates compute and storage for flexible, cost-efficient scalability to securely access data stored in multiple data platforms, enabling you to run traditional and advanced analytics workloads with open-source tools.
Published	June 2025
Edition	1.5.0
Topic last updated	2024-04-01