Accessing Data on Amazon S3 Using Livy

This topic describes how to configure Amazon S3 to access data using Livy on HPE Ezmeral Runtime Enterprise.

You must configure Amazon S3 credentials to access the S3 storage using Livy.

You can configure your S3 access credentials in the following ways:
  1. Configuring access to Amazon S3 for all the Livy sessions created by Livy instance.
    To configure S3 credentials for tenants, add the following configuration options in spark-defaults.conf section in extraConfigs section of values.yaml file of Helm chart in a tenant namespace.
      spark-defaults.conf: |
        spark.hadoop.fs.s3a.access.key <access-key>
        spark.hadoop.fs.s3a.secret.key <secret-key> true

    The sensitive data provided in extraConfigs section are added to spark-defaults.conf file using the Kubernetes secret. The secret has Key-Value format where the key is spark-defaults.conf file and the value is sensitive data.

    You must also add the following properties on spark-defaults.conf section of values.yaml file in a tenant namespace.
      spark-defaults.conf: |
        # Environment variables here would be replaced by its values
        # ...
        spark.driver.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true
        spark.executor.extraJavaOptions -Dcom.amazonaws.sdk.disableCertChecking=true
  2. Configuring access to Amazon S3 for a specific Livy session.
    Add the following configuration options to configure Livy session during session creation.
    spark.hadoop.fs.s3a.access.key <YOURACCESSKEY>
    spark.hadoop.fs.s3a.secret.key <YOURSECRETKEY>
    For example: Configuring Livy session to access S3 storage using the REST API.
    curl \
        -k \
        -s \
        -X POST \
        -H "Content-Type: application/json" \
        -d '{
            "kind": "spark",
            "conf": {
                "spark.hadoop.fs.s3a.access.key": "<YOURACCESSKEY>",
                "spark.hadoop.fs.s3a.secret.key": "<YOURSECRETKEY>"
        }' \
        -u username:password \ | jq
  3. Configuring access to Amazon S3 during runtime.

    Set the spark.sparkContext.hadoopConfiguration options during runtime and submit the spark jobs.

    For example:
    val hadoopConf = spark.sparkContext.hadoopConfiguration
    hadoopConf.set("fs.s3a.access.key", "<YOURACCESSKEY>")
    hadoopConf.set("fs.s3a.secret.key", "<YOURSECRETKEY>")
    hadoopConf.set("", "true")
    val path = "s3a://bucket/path/to/dest/"
    val data = Seq(
        ("banana", "yellow"),
        ("orange", "orange"),
        ("tomato", "red"),
        ("potato", "white"),
        ("plum", "purple"),
    val df = data.toDF
    println(s"Writing DataFrame to $path")
    println("Write complete")
    println(s"Reading DataFrame from $path")
    println("Read complete")
    The output of the submitted code block is as follows:
    hadoopConf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, org.apache.hadoop.conf.CoreDefaultProperties, core-site.xml, mapred-default.xml, org.apache.hadoop.mapreduce.conf.MapReduceDefaultProperties, mapred-site.xml, yarn-default.xml, org.apache.hadoop.yarn.conf.YarnDefaultProperties, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, file:/opt/mapr/spark/spark-2.4.7/conf/hive-site.xml
    path: String = s3a://bucket/path/to/dest/
    data: Seq[(String, String)] = List((banana,yellow), (orange,orange), (tomato,red), (potato,white), (plum,purple))
    df: org.apache.spark.sql.DataFrame = [_1: string, _2: string]
    Writing DataFrame to s3a://bucket/path/to/dest/
    Write complete
    Reading DataFrame from s3a://bucket/path/to/dest/
    |    _1|    _2|
    |tomato|   red|
    |potato| white|
    |  plum|purple|
    Read complete