Delta Lake with Apache Spark
This section describes the Delta Lake that provides ACID transactions for Apache Spark 3.x.x on HPE Ezmeral Runtime Enterprise.
Delta Lake is an open-source storage layer that supports ACID transactions to provide reliability, consistency, and scalability to Apache Spark applications. Delta Lake runs on the top of the existing storage and is compatible with Apache Spark APIs. For more details see Delta Lake documentation.
ACID Transactions with Delta Lake
ACID stands for Atomicity, Consistency, Isolation and Durability. ACID transactions for Spark applications are supported out of box with Delta Lake on HPE Ezmeral Runtime Enterprise.
You can use any Apache Spark APIs to read and write data with Delta Lake. Delta Lake stores the data in Parquet format as versioned Parquet files. Delta Lake has a well-defined open protocol called Delta Transaction Protocol that provides ACID transactions to Apache Spark applications.
Delta Lake stores the commits of every successful transaction (Spark job) as a DeltaLog or a Delta Lake transaction log.
For example: You can view these commit logs on MinIO Browser by navigating to
/<table_name>/_delta_Log/
.
/<table_name>/_delta_log/00000000000000000000.json
/<table_name>/_delta_log/00000000000000000001.json
/<table_name>/_delta_log/00000000000000000003.json
Delta lake uses optimistic concurrency control to provide ACID transactions between writes operation. See Concurrency Control.
Examples of Enabling Delta Lake for Apache Spark 3.x.x
- Add the following Delta Lake jar to
deps
option ofspec
property of the Spark job.deps: jars: - local:///tmp/delta-core_2.12-1.0.0.jar
NOTEIn HPE Ezmeral Runtime Enterprise, Delta Lake jar is included in the Apache Spark image. - Add the following security configurations to
sparkconf
option ofspec
property of the Spark job."spark.hadoop.fs.s3a.endpoint": #minio endpoint "spark.hadoop.fs.s3a.access.key": #minio access key "spark.hadoop.fs.s3a.secret.key": #minio secret key "spark.hadoop.fs.s3a.impl": org.apache.hadoop.fs.s3a.S3AFileSystem
NOTEIf you are using a Python API, you need to add the following additional configurations tosparkconf
option ofspec
property of the Spark job."spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension" "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
For examples of Delta lake enabled Spark jobs in HPE Ezmeral Runtime Enterprise 5.6, see Delta Lake Spark Examples.
To locate Delta Lake examples for other versions of HPE Ezmeral Runtime Enterprise, navigate to the release branch of your choice
at Spark on K8s GitHub location and find the examples in examples
folder.
Learn more about supported Spark versions at Interoperability Matrix for Spark.