Accessing Data on Amazon S3 Using Livy

This topic describes how to configure Amazon S3 to access data using Livy on HPE Ezmeral Runtime Enterprise.

You must configure Amazon S3 credentials to access the S3 storage using Livy.

You can configure your S3 access credentials in the following ways:

Configuring access to Amazon S3 for all the Livy sessions created by Livy instance.
To configure S3 credentials for tenants, add the following configuration options in spark-defaults.conf section in extraConfigs section of values.yaml file of Helm chart in a tenant namespace.
```
extraConfigs:
  spark-defaults.conf: |
    spark.hadoop.fs.s3a.access.key <access-key>
    spark.hadoop.fs.s3a.secret.key <secret-key>
    spark.hadoop.fs.s3a.path.style.access true
```
The sensitive data provided in extraConfigs section are added to spark-defaults.conf file using the Kubernetes secret. The secret has Key-Value format where the key is spark-defaults.conf file and the value is sensitive data.
You must also add the following properties on spark-defaults.conf section of values.yaml file in a tenant namespace.
```
extraConfigs:
  spark-defaults.conf: |
    # Environment variables here would be replaced by its values
    # ...
    spark.driver.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true
    spark.executor.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true
```

Configuring access to Amazon S3 for a specific Livy session.

Add the following configuration options to configure Livy session during session creation.

spark.hadoop.fs.s3a.access.key <YOURACCESSKEY>
spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>

For example: Configuring Livy session to access S3 storage using the REST API.

curl \
    -k \
    -s \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
        "kind": "spark",
        "conf": {
            "spark.hadoop.fs.s3a.access.key": "<YOURACCESSKEY>",
            "spark.hadoop.fs.s3a.secret.key": "<YOURSECRETKEY>"
        }
    }' \
    -u username:password \
    https://hcp-lb1.qa.lab:10075/sessions | jq

Configuring access to Amazon S3 during runtime.

Set the spark.sparkContext.hadoopConfiguration options during runtime and submit the spark jobs.

For example:

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", "<YOURACCESSKEY>")
hadoopConf.set("fs.s3a.secret.key", "<YOURSECRETKEY>")
hadoopConf.set("fs.s3a.path.style.access", "true")

val path = "s3a://bucket/path/to/dest/"

val data = Seq(
    ("banana", "yellow"),
    ("orange", "orange"),
    ("tomato", "red"),
    ("potato", "white"),
    ("plum", "purple"),
)

val df = data.toDF

println(s"Writing DataFrame to $path")
df.write.parquet(path)
println("Write complete")

println(s"Reading DataFrame from $path")
spark.read.parquet(path).show()
println("Read complete")

The output of the submitted code block is as follows:

hadoopConf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, org.apache.hadoop.conf.CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce.conf.MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf.YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, file:/opt/mapr/spark/spark-2.4.7/conf/hive-site.xml

path: String = s3a://bucket/path/to/dest/

data: Seq[(String, String)] = List((banana,yellow), (orange,orange), (tomato,red), (potato,white), (plum,purple))

df: org.apache.spark.sql.DataFrame = [_1: string, _2: string]

Writing DataFrame to s3a://bucket/path/to/dest/

Write complete

Reading DataFrame from s3a://bucket/path/to/dest/

+------+------+
|    _1|    _2|
+------+------+
|tomato|   red|
|potato| white|
|  plum|purple|
|banana|yellow|
|orange|orange|
+------+------+

Read complete

HPE Ezmeral Runtime Enterprise 5.7 Documentation
Abstract	HPE Ezmeral Container Platform is a unified container platform built on open source Kubernetes and designed for both cloud-native applications and non-cloud-native applications running on any infrastructure either on-premises, in multiple public clouds, in a hybrid model, or at the edge.
Published	May 2025
Edition	5.7.0
Topic last updated	2022-09-29