Apache Spark Feature Support
HPE Ezmeral Data Fabric supports most Apache Spark features. However, there are some exceptions.
- GPU Aware Scheduling Support on Spark
-
Starting from EEP 9.0.0, you can use RAPIDS Accelerator for Apache Spark by Nividia to accelerate the processing for Spark by using the GPUs.
To use RAPIDS Accelerator on HPE Ezmeral Data Fabric:- Follow the setup instructions in the official RAPIDS documentation: Getting Started (link opens an external site in a new browser tab or window).
- Set the Apache Spark version with the following
option:
The value for this option must be the name of a corresponding shim. You can find a list of available shims here.spark.rapids.shims-provider.override
For example, to use the RAPIDS plugin with Spark version3.3.2.100-eep-912
, you can set the version to332
as follows:spark.rapids.shims-provider.override=com.nvidia.spark.rapids.shims.spark332.SparkShimServiceProvider
For examples, limitations, and a full list of configuration details for RAPIDS, see RAPIDS.
- Delta Lake Support on Spark
Starting from EEP 8.x.x, Apache Spark 3 provides Delta Lake support on HPE Ezmeral Data Fabric.
Delta Lake is an open-source storage layer that supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions to provide reliability, consistency, and scalability to Apache Spark applications. Delta Lake runs on the top of the existing storage and is compatible with Apache Spark APIs. For more details, see Delta Lake documentation.
You can use any Apache Spark APIs to read and write data with Delta Lake. Delta Lake stores the data in Parquet format as versioned Parquet files. Delta Lake has a well-defined open protocol called Delta Transaction Protocol that provides ACID transactions to Apache Spark applications.
To enable the Delta Lake:- Download the Delta Lake library from Maven repository.
- Add the Delta Lake library and set the following configuration options. For
example:
/opt/mapr/spark/spark-3.1.2/bin/spark-shell --jars ~/delta-core_2.1.2-1.0.0.jar --conf "spark.sql.extensions"= io.delta.sql.DeltaSparkSessionExtension --conf "spark.sql.catalog.spark_catalog"= org.apache.spark.sql.delta.catalog.DeltaCatalog
Delta Lake stores the commits of every successful transaction (Spark job) as a DeltaLog or a Delta Lake transaction log.
For example: You can view these commit logs on MinIO Browser by navigating to
Commits in the transaction log:/<table_name>/_delta_Log/
./<table_name>/_delta_log/00000000000000000000.json /<table_name>/_delta_log/00000000000000000001.json /<table_name>/_delta_log/00000000000000000003.json
Delta lake uses optimistic concurrency control to provide ACID transactions between writes operation. See Concurrency Control.
To accelerate the data lake operations, use optimizations provided by Delta Lake. Z-Ordering method is used to combine related information in the same files. Delta Lake automatically maintains minimum and maximum values for each column in delta table and stores these values as part of the metadata. The co-location of related information is used by Delta Lake in data skipping algorithm which optimizes performance by reducing the amount of data to be read by Apache Spark. To learn more, see Optimizations.
See Setup Apache Spark with Delta Lake and Advanced Dependency Management to start using Delta Lake.
- Spark SQL and Apache Derby Support on Spark
-
If you are using Spark SQL with Derby database without Hive or Hive Metastore installation, you will see the following exception:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Add the
hive-service-2.3.*.jar
andlog4j2
jars to/opt/mapr/spark/spark-3.x.x/jars
location to use Spark SQL with Derby Database without Hive or Hive Metastore installation.The
log4j2
jars are located at/opt/mapr/lib/log4j2/log4j-*.jar
location.Spark 3.1.2 and Spark 3.2.0 does not support
log4j1.2
logging on HPE Ezmeral Data Fabric. - Spark Thrift JDBC/ODBC Server Support
- Running the Spark Thrift JDBC/ODBC Server on a secure cluster is supported only on Spark 2.1.0 or later.
- Spark SQL and Hive Support for Spark 2.1.0
- Spark 2.1.0 is able to connect to Hive 2.1 Metastore; however, only features of Hive 1.2 are supported.
- Spark SQL and Hive Support for Spark 2.0.1
- Spark SQL is supported, but it is not fully compatible with Hive. For details, see the
Apache Spark documentation.
The following Hive functions are not supported in Spark SQL:
- Tables with buckets
- UNION type
- Unique join
- Column statistics collecting
- Output formats: File format (for CLI), Hadoop Archive
- Block-level bitmap indexes and virtual columns
- Automatic determination of the number of reducers for JOIN and GROUP BY
- Metadata-only query
- Skew data flag
- STREAMTABLE hint in JOIN
- Merging of multiple small files for query results
- Spark SQL and Hive Support for Spark 1.6.1
- Spark SQL is supported, but it is not fully compatible with Hive. For details, see the
Apache Spark documentation. The following Spark SQL operations support the
following Hive table formats:
Hive 1.2 Table Format Spark SQL Operations AVRO ORC Parquet RC default create Yes Yes Yes Yes Yes drop Yes Yes Yes Yes Yes insert into Yes Yes Yes Yes Yes insert overwrite Yes Yes Yes Yes Yes select Yes Yes Yes Yes Yes load data Yes Yes Yes Yes Yes