Accessing Data on Amazon S3 Using Livy
This topic describes how to configure Amazon S3 to access data using Livy on HPE Ezmeral Runtime Enterprise.
You must configure Amazon S3 credentials to access the S3 storage using Livy.
You can configure your S3 access credentials in the following ways:
- Configuring access to Amazon S3 for all the Livy sessions created by Livy instance.To configure S3 credentials for tenants, add the following configuration options in
spark-defaults.conf
section inextraConfigs
section ofvalues.yaml
file of Helm chart in a tenant namespace.extraConfigs: spark-defaults.conf: | spark.hadoop.fs.s3a.access.key <access-key> spark.hadoop.fs.s3a.secret.key <secret-key> spark.hadoop.fs.s3a.path.style.access true
The sensitive data provided in
extraConfigs
section are added tospark-defaults.conf
file using the Kubernetes secret. The secret has Key-Value format where the key isspark-defaults.conf
file and the value is sensitive data.You must also add the following properties onspark-defaults.conf
section ofvalues.yaml
file in a tenant namespace.extraConfigs: spark-defaults.conf: | # Environment variables here would be replaced by its values # ... spark.driver.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true spark.executor.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true
- Configuring access to Amazon S3 for a specific Livy session.Add the following configuration options to configure Livy session during session creation.
spark.hadoop.fs.s3a.access.key <YOURACCESSKEY> spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>
For example: Configuring Livy session to access S3 storage using the REST API.curl \ -k \ -s \ -X POST \ -H "Content-Type: application/json" \ -d '{ "kind": "spark", "conf": { "spark.hadoop.fs.s3a.access.key": "<YOURACCESSKEY>", "spark.hadoop.fs.s3a.secret.key": "<YOURSECRETKEY>" } }' \ -u username:password \ https://hcp-lb1.qa.lab:10075/sessions | jq
- Configuring access to Amazon S3 during runtime.
Set the
spark.sparkContext.hadoopConfiguration
options during runtime and submit the spark jobs.For example:val hadoopConf = spark.sparkContext.hadoopConfiguration hadoopConf.set("fs.s3a.access.key", "<YOURACCESSKEY>") hadoopConf.set("fs.s3a.secret.key", "<YOURSECRETKEY>") hadoopConf.set("fs.s3a.path.style.access", "true") val path = "s3a://bucket/path/to/dest/" val data = Seq( ("banana", "yellow"), ("orange", "orange"), ("tomato", "red"), ("potato", "white"), ("plum", "purple"), ) val df = data.toDF println(s"Writing DataFrame to $path") df.write.parquet(path) println("Write complete") println(s"Reading DataFrame from $path") spark.read.parquet(path).show() println("Read complete")
The output of the submitted code block is as follows:hadoopConf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, org.apache.hadoop.conf.CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce.conf.MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf.YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, file:/opt/mapr/spark/spark-2.4.7/conf/hive-site.xml path: String = s3a://bucket/path/to/dest/ data: Seq[(String, String)] = List((banana,yellow), (orange,orange), (tomato,red), (potato,white), (plum,purple)) df: org.apache.spark.sql.DataFrame = [_1: string, _2: string] Writing DataFrame to s3a://bucket/path/to/dest/ Write complete Reading DataFrame from s3a://bucket/path/to/dest/ +------+------+ | _1| _2| +------+------+ |tomato| red| |potato| white| | plum|purple| |banana|yellow| |orange|orange| +------+------+ Read complete