Accessing Data on Amazon S3 Using Spark Operator

This topic describes how to access the data on Amazon S3 bucket using a Hadoop S3A Client.

Amazon Web Services (AWS) offers Amazon Simple Storage Service (Amazon S3). Amazon S3 provides the storage and retrieval of objects through a web service interface.

You can access the data stored on Amazon S3 bucket for your Spark job by using a Hadoop S3A Client. For the full list of Hadoop S3A Client configuration options, see Hadoop-AWS module: Integration with Amazon Web Services.

Adding S3A Credentials through YAML

Add the following configuration options on sparkConf section of SparkApplication and submit the spark jobs using Spark Operator.

spark.hadoop.fs.s3a.access.key <YOURACCESSKEY>
spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>

For example:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-s3-example
spec:
  sparkConf:
    # ...
    spark.hadoop.fs.s3a.access.key: <YOURACCESSKEY>
    spark.hadoop.fs.s3a.secret.key: <YOURSECRETKEY>

Adding S3A Credentials Using a Kubernetes Secret

A Secret is an object that contains the sensitive data such as a password, a token, or a key. See Secrets.

You can access the data stored on Amazon S3 bucket for your Spark job by using the Kubernetes Secret.

Creating a Secret

Create the Kubernetes Secret with Base64-encoded values for AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password).

For example: Run kubectl apply -f for the following YAML:

apiVersion: v1 
kind: Secret
data: 
  AWS_ACCESS_KEY_ID: <Base64-encoded value; example: dXNlcg== >
  AWS_SECRET_ACCESS_KEY: <Base64-encoded value; example:cGFzc3dvcmQ= > 
metadata: 
  name: <K8s-secret-name-for-S3> 
type: Opaque

Configuring Spark Applications with a Secret

You can configure Spark Applications with a Secret manually using a YAML or adding the Secret during Configure Spark Applications step using HPE Ezmeral Runtime Enterprise new UI.

Using YAML: Configure the secretRef property in envFrom section for driver and executor in Spark Applications. Set the name option with the Kubernetes Secret.

driver:
   coreLimit: "1000m"
   cores: 1
   labels:
     version: 2.4.7
   envFrom:
      - secretRef:
          name: <K8s-secret-name-for-S3>
executor:
   cores: 1
   coreLimit: "1000m"
   instances: 2
   memory: "512m"
   envFrom:
      - secretRef:
          name: <K8s-secret-name-for-S3>

Using HPE Ezmeral Runtime Enterprise new UI: Enter a Secret when you select a Source as S3 during Configure Spark Applications step of creating Spark applications. This will automatically add secretRef option in the YAML. See Creating Spark Applications.

Additional Configuration Options in SSL Environment

To access the Amazon S3 buckets using SSL, in addition to the previous configurations, you must also add the following configuration options in the sparkConf section of the Spark application configuration.

spark.driver.extraJavaOptions: "-Dcom.amazonaws.sdk.disableCertChecking=true"
spark.executor.extraJavaOptions: "-Dcom.amazonaws.sdk.disableCertChecking=true"

If you are using the HPE Ezmeral Runtime Enterprise new UI, add these configuration options by clicking Edit YAML in Review step or Edit YAML from Actions menu on Spark Applications screen. See Managing Spark Applications.

HPE Ezmeral Runtime Enterprise 5.6 Documentation
Abstract	HPE Ezmeral Container Platform is a unified container platform built on open source Kubernetes and designed for both cloud-native applications and non-cloud-native applications running on any infrastructure either on-premises, in multiple public clouds, in a hybrid model, or at the edge.
Published	July 2024
Edition	5.6.0