Configuring a Spark Application to Access Data in an External S3 Data Source through the S3 Proxy Layer

Describes how to configure a Spark application to connect to an external S3 data source through the S3 proxy later in HPE Ezmeral Unified Analytics Software.

To connect Spark to an external S3 data source, include the following informaiton in the sparkConf section of the Spark application YAML file:

Data source name
Endpoint URL
Secret (to securely pass configuration values)
Bucket that you want the client to access

You can find the data source name and endpoint URL on the data source tile in the HPE Ezmeral Unified Analytics Software UI.

Once connected, the Spark application can:

Read and download files in a bucket
Upload files from a bucket
Create buckets

Getting the Data Source Name and S3 Proxy Endpoint URL

To get the data source name and S3 proxy endpoint URL:

Sign in to HPE Ezmeral Unified Analytics Software.
In the left navigation bar, select Data Engineering > Data Sources.
On the Data Sources page, find the tile for the S3 data source that you want the Spark application to connect to.
The following image shows an example of a tile for an AWS S3 data source with the name (aws-s3) and the enpoint URL (http://aws-s3-service-ezdata.svc.cluster.local:30000):
NOTE
By default, a local-s3 Ezmeral Data Fabric tile also displays on the screen. This Ezmeral Data Fabric version of S3 is a local S3 version used internally by HPE Ezmeral Unified Analytics Software. Do not connect to this data source.
Note the data source name and endpoint URL and then use them in the Spark configuration.

Configuring Spark

How you configure the Spark application depends on the type of Spark image used, either HPE-Curated Spark or Spark OSS.

If you used the HPE-Curated Spark image to create the Spark application, see HPE-Curated Spark.
If you used the Spark OSS image to create the Spark application, see Spark OSS.

HPE-Curated Spark

Use these instructions if you are configuring a Spark application using the HPE-Curated Spark image.

Using the EzSparkAWSCredentialProvider option in the configuration automatically generates the secret for you.

The following example shows the required configuration options for a Spark application that uses the HPE-Curated Spark image:

sparkConf:
    spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
    spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.spark.s3a.EzSparkAWSCredentialProvider"
    spark.hadoop.fs.s3a.path.style.access: "true"

(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:

spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

Spark OSS

Use these instructions if you are configuring a Spark application using the Spark OSS image.

Complete the following steps:

Generate a secret.

Use either of the following methods to generate a secret:

Apply a YAML

Use a notebook to create a Kubernetes secret with Base64-encoded values for the AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password).

For example, run kubectl apply -f for the following YAML:

apiVersion: v1 
kind: Secret
data: 
  AWS_ACCESS_KEY_ID: <Base64-encoded value; example: dXNlcg== >
  AWS_SECRET_ACCESS_KEY: <Base64-encoded value; example:cGFzc3dvcmQ= > 
metadata: 
  name: <K8s-secret-name-for-S3> 
type: Opaque

See Creating and Managing Notebook Servers

Run a Script in a Notebook

Run the following script in a notebook to generate the secret. You can also access a sample script from your notebook server in the shared/ezua-tutorials/Data-Analytics/ directory.

def deploy_s3_secret(namespace, spark_secret):
    try:
        #Run kubectl apply command using subprocess
        subprocess.run(['kubectl', 'delete', 'secret', spark_secret, '-n', namespace], check=False)
        subprocess.run(['kubectl', 'create', 'secret', 'generic', spark_secret, '-n', namespace , '--from-file=spark-defaults.conf'], check=True)
        print("Secret creation successful!")
        except subprocess.CalledProcessError as e:
        print(f"Secret creation  failed. Error: {e}")

s3_access_data = "spark.hadoop.fs.s3a.access.key EXAMPLE_ACCESS_KEY"
s3_secret_data = "spark.hadoop.fs.s3a.secret.key EXAMPLE_SECRET_KEY"

s3_data = s3_access_data.replace('EXAMPLE_ACCESS_KEY', os.environ['AUTH_TOKEN'])
s3_data += "\n" + s3_secret_data.replace("EXAMPLE_SECRET_KEY", "s3")    

namespace = os.environ['USER']
spark_secret = "spark-s3-secret"

#Save data to a file spark-defaults.conf
with open('spark-defaults.conf', 'w') as file:
    file.write(s3_data)

# Call the function to deploy the Kubernetes secret
deploy_s3_secret(namespace, spark_secret)

Configure the Spark application.

The following example demonstrates how to add the required fields to the sparkConf section of the Spark application YAML file:

sparkConf:
    spark.hadoop.fs.s3a.connection.ssl.enabled: "true"
    spark.hadoop.fs.s3a.endpoint: <S3-endpoint>
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.path.style.access: "true"

(AWS S3 Only) If you are connecting the Spark application to an AWS S3 data source, you must also include the following options in the sparkConf section:

spark.driver.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts
spark.executor.extraJavaOptions: -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts

HPE Ezmeral Unified Analytics Software 1.5 Documentation
Abstract	HPE Ezmeral Unified Analytics Software is a usage-based Software-as-a-Service (SaaS) model that operationalizes hybrid and multi-cloud modern analytical workloads through a simple user interface, easily installed and deployed in minutes. HPE Ezmeral Unified Analytics Software separates compute and storage for flexible, cost-efficient scalability to securely access data stored in multiple data platforms, enabling you to run traditional and advanced analytics workloads with open-source tools.
Published	July 2025
Edition	1.5.0
Topic last updated	2024-04-01