Accessing Data on Amazon S3 Using Spark Operator
This topic describes how to access the data on Amazon S3 bucket using a Hadoop S3A Client.
Amazon Web Services (AWS) offers Amazon Simple Storage Service (Amazon S3). Amazon S3 provides the storage and retrieval of objects through a web service interface.
You can access the data stored on Amazon S3 bucket for your Spark job by using a Hadoop S3A Client. For the full list of Hadoop S3A Client configuration options, see Hadoop-AWS module: Integration with Amazon Web Services.
Adding S3A Credentials through YAML
Add the following configuration options on sparkConf
section of
SparkApplication and submit the spark jobs using Spark Operator.
spark.hadoop.fs.s3a.access.key <YOURACCESSKEY>
spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-s3-example
spec:
sparkConf:
# ...
spark.hadoop.fs.s3a.access.key: <YOURACCESSKEY>
spark.hadoop.fs.s3a.secret.key: <YOURSECRETKEY>
Adding S3A Credentials Using a Kubernetes Secret
A Secret is an object that contains the sensitive data such as a password, a token, or a key. See Secrets.
- Creating a Secret
- Create the Kubernetes Secret with Base64-encoded values for AWS_ACCESS_KEY_ID (username) and AWS_SECRET_ACCESS_KEY (password).
- Configuring Spark Applications with a Secret
- You can configure Spark Applications with a Secret manually using a YAML or adding the Secret during Configure Spark Applications step using HPE Ezmeral Runtime Enterprise new UI.
Additional Configuration Options in SSL Environment
sparkConf
section of the Spark application configuration.
spark.driver.extraJavaOptions: "-Dcom.amazonaws.sdk.disableCertChecking=true"
spark.executor.extraJavaOptions: "-Dcom.amazonaws.sdk.disableCertChecking=true"
If you are using the HPE Ezmeral Runtime Enterprise new UI, add these configuration
options by clicking Edit YAML in
Review step or Edit YAML from
Actions menu on Spark Applications screen. See Managing Spark Applications.