Delta Lake with Apache Spark

This section describes the Delta Lake that provides ACID transactions for Apache Spark 3.x.x on HPE Ezmeral Runtime Enterprise.

Delta Lake is an open-source storage layer that supports ACID transactions to provide reliability, consistency, and scalability to Apache Spark applications. Delta Lake runs on the top of the existing storage and is compatible with Apache Spark APIs. For more details see Delta Lake documentation.

ACID Transactions with Delta Lake

ACID stands for Atomicity, Consistency, Isolation and Durability. ACID transactions for Spark applications are supported out of box with Delta Lake on HPE Ezmeral Runtime Enterprise.

You can use any Apache Spark APIs to read and write data with Delta Lake. Delta Lake stores the data in Parquet format as versioned Parquet files. Delta Lake has a well-defined open protocol called Delta Transaction Protocol that provides ACID transactions to Apache Spark applications.

Delta Lake stores the commits of every successful transaction (Spark job) as a DeltaLog or a Delta Lake transaction log.

For example: You can view these commit logs on MinIO Browser by navigating to /<table_name>/_delta_Log/.

Commits in the transaction log:
/<table_name>/_delta_log/00000000000000000000.json
/<table_name>/_delta_log/00000000000000000001.json
/<table_name>/_delta_log/00000000000000000003.json

Delta lake uses optimistic concurrency control to provide ACID transactions between writes operation. See Concurrency Control.

Examples of Enabling Delta Lake for Apache Spark 3.x.x

Perform the following steps to enable Delta Lake support for Spark Applications for S3 storage:
  1. Add the following Delta Lake jar to deps option of spec property of the Spark job.
    deps:
        jars:
          - local:///tmp/delta-core_2.12-1.0.0.jar
    
    
    NOTE
    In HPE Ezmeral Runtime Enterprise, Delta Lake jar is included in the Apache Spark image.
  2. Add the following security configurations to sparkconf option of spec property of the Spark job.
    "spark.hadoop.fs.s3a.endpoint": #minio endpoint
    "spark.hadoop.fs.s3a.access.key": #minio access key
    "spark.hadoop.fs.s3a.secret.key": #minio secret key
    "spark.hadoop.fs.s3a.impl": org.apache.hadoop.fs.s3a.S3AFileSystem
    NOTE
    If you are using a Python API, you need to add the following additional configurations to sparkconf option of spec property of the Spark job.
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"

For examples of Delta lake enabled Spark jobs in HPE Ezmeral Runtime Enterprise 5.6, see Delta Lake Spark Examples.

To locate Delta Lake examples for other versions of HPE Ezmeral Runtime Enterprise, navigate to the release branch of your choice at Spark on K8s GitHub location and find the examples in examples folder.

Learn more about supported Spark versions at Interoperability Matrix for Spark.