Accessing Data on Amazon S3 Using Livy
This topic describes how to configure Amazon S3 to access data using Livy on HPE Ezmeral Runtime Enterprise.
You must configure Amazon S3 credentials to access the S3 storage using Livy.
You can configure your S3 access credentials in the following ways:
- Configuring access to Amazon S3 for all the Livy sessions created by Livy instance.To configure S3 credentials for tenants, add the following configuration options in
spark-defaults.confsection inextraConfigssection ofvalues.yamlfile of Helm chart in a tenant namespace.extraConfigs: spark-defaults.conf: | spark.hadoop.fs.s3a.access.key <access-key> spark.hadoop.fs.s3a.secret.key <secret-key> spark.hadoop.fs.s3a.path.style.access trueThe sensitive data provided in
extraConfigssection are added tospark-defaults.conffile using the Kubernetes secret. The secret has Key-Value format where the key isspark-defaults.conffile and the value is sensitive data.You must also add the following properties onspark-defaults.confsection ofvalues.yamlfile in a tenant namespace.extraConfigs: spark-defaults.conf: | # Environment variables here would be replaced by its values # ... spark.driver.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true spark.executor.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true - Configuring access to Amazon S3 for a specific Livy session.Add the following configuration options to configure Livy session during session creation.
spark.hadoop.fs.s3a.access.key <YOURACCESSKEY> spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>For example: Configuring Livy session to access S3 storage using the REST API.curl \ -k \ -s \ -X POST \ -H "Content-Type: application/json" \ -d '{ "kind": "spark", "conf": { "spark.hadoop.fs.s3a.access.key": "<YOURACCESSKEY>", "spark.hadoop.fs.s3a.secret.key": "<YOURSECRETKEY>" } }' \ -u username:password \ https://hcp-lb1.qa.lab:10075/sessions | jq - Configuring access to Amazon S3 during runtime.
Set the
spark.sparkContext.hadoopConfigurationoptions during runtime and submit the spark jobs.For example:val hadoopConf = spark.sparkContext.hadoopConfiguration hadoopConf.set("fs.s3a.access.key", "<YOURACCESSKEY>") hadoopConf.set("fs.s3a.secret.key", "<YOURSECRETKEY>") hadoopConf.set("fs.s3a.path.style.access", "true") val path = "s3a://bucket/path/to/dest/" val data = Seq( ("banana", "yellow"), ("orange", "orange"), ("tomato", "red"), ("potato", "white"), ("plum", "purple"), ) val df = data.toDF println(s"Writing DataFrame to $path") df.write.parquet(path) println("Write complete") println(s"Reading DataFrame from $path") spark.read.parquet(path).show() println("Read complete")The output of the submitted code block is as follows:hadoopConf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, org.apache.hadoop.conf.CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce.conf.MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf.YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, file:/opt/mapr/spark/spark-2.4.7/conf/hive-site.xml path: String = s3a://bucket/path/to/dest/ data: Seq[(String, String)] = List((banana,yellow), (orange,orange), (tomato,red), (potato,white), (plum,purple)) df: org.apache.spark.sql.DataFrame = [_1: string, _2: string] Writing DataFrame to s3a://bucket/path/to/dest/ Write complete Reading DataFrame from s3a://bucket/path/to/dest/ +------+------+ | _1| _2| +------+------+ |tomato| red| |potato| white| | plum|purple| |banana|yellow| |orange|orange| +------+------+ Read complete