Configuring a Spark Application to Directly Access Data in an External S3 Data Source
Describes how to configure a Spark application to connect directly to an external S3 data source.
To connect a Spark application directly to an external S3 data source, you must provide the
following information in the
sparkConf
section of the Spark application
YAML file: - Access credentials (access key, secret key)
- Secret (to securely pass configuration values)
- Endpoint URL
- Bucket name (name of the bucket in the S3 data source)
- Region (domain)
HPE-Curated Spark
Use these instructions if you are configuring a Spark application using the HPE-Curated Spark image.
Using the EzSparkAWSCredentialProvider
option in the configuration
automatically generates the secret for you.
The following example shows the required configuration options for a Spark application that
uses the HPE-Curated Spark
image:
sparkConf:
spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY>
spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY>
spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.spark.s3a.EzSparkAWSCredentialProvider"
spark.hadoop.fs.s3a.path.style.access: "true"
(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source,
you must also include the following options in the
sparkConf
section:spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
Spark OSS
Use these instructions if you are configuring a Spark application using the Spark OSS image.
Complete the following steps:
- Generate a secret.Use either of the following methods to generate a secret:
- Use a Notebook to Generate the Secret
- Use a notebook to create a Kubernetes secret with Base64-encoded values for
the AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password). For example, run
kubectl apply -f
for the following YAML:
See Creating and Managing Notebook Servers.apiVersion: v1 kind: Secret data: AWS_ACCESS_KEY_ID: <Base64-encoded value; example: dXNlcg== > AWS_SECRET_ACCESS_KEY: <Base64-encoded value; example:cGFzc3dvcmQ= > metadata: name: <K8s-secret-name-for-S3> type: Opaque
- Use a Configuration File to Generate the Secret
- Create a
spark-defaults.conf
file to generate the secret. Provide the object store access key and secret key as values for thespark.hadoop.fs.s3a.access.key
andspark.hadoop.fs.s3a.secret.key
properties in the file.- Create a
spark-defaults.conf
file with the following properties:spark.hadoop.fs.s3a.access.key EXAMPLE_ACCESS_KEY spark.hadoop.fs.s3a.secret.key EXAMPLE_SECRET_KEY
- Create a secret from the
spark-defaults.conf
file:kubectl create secret generic <k8s-secret-name> --from-file=spark-defaults.conf
- Create a
- Configure the Spark application.The following example demonstrates how to add the required fields to the
sparkConf
section of the Spark application YAML file:sparkConf: spark.hadoop.fs.s3a.access.key: <S3-ACCESS_KEY> spark.hadoop.fs.s3a.secret.key: <S3-SECRET-KEY> spark.hadoop.fs.s3a.connection.ssl.enabled: "true" spark.hadoop.fs.s3a.endpoint: <S3-endpoint> spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.path.style.access: "true"
(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source,
you must also include the following options in the
sparkConf
section:spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
(Optional) Setting Environment Variables for the Access Key and Secret Key
Regardless of the type of Spark imaged used, you can set environment variables for the access key and secret key.
Set environment variables for the access key and secret key, as
shown:
spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY>
spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY>
spark.executorEnv.AWS_ACCESS_KEY_ID: <ACCESS_KEY>
spark.executorEnv.AWS_SECRET_ACCESS_KEY: <SECRET_KEY>
IMPORTANT
When you set these environment variables, the user access token (JWT) is not
automatically refreshed if the endpoint URL changes. To refresh the token, you must run
%update_token.