Optimizing HPE Ezmeral Data Fabric Database Lookups in Spark Jobs
The lookupFromMapRDB()
API utilizes the primary and secondary indexes
on a HPE Ezmeral Data Fabric Database table to optimize table lookups and outputs the
results to an Apache Spark DataFrame.
lookupFromMapRDB()
API functionality requires a
patch. The patch works with EEP 6.2.0 (Core 6.1.0,
Spark 2.4.0.0) and EEP 6.3.0 (Core 6.1.0, Spark
2.4.4.0). To install patches, see Applying a PatchThe loadfromMapRDB()
API in MapR
Database Connectors for Apache Spark is optimized to load massive amounts of data
from HPE Ezmeral Data Fabric Database tables with high throughput. In cases where a
Spark job needs to lookup a small number of documents based on the equality (or short range)
condition on a primary or secondary key, the lookupFromMapRDB()
API should be
used.
Invoke the lookupFromMapRDB()
API when the filter conditions in short range
and equality queries reference primary and secondary keys. If the filter condition references
any non-primary keys (fields other than the _id
field), a secondary index
must exist on the secondary keys. Indexes on the filtering keys is essential to achieving
reasonable performance of lookup queries in HPE Ezmeral Data Fabric Database tables.
The lookupFromMapRDB()
API uses the secondary keys in indexes to lookup
values in the primary table. For example, if a query contains the filter conditions
mydate = '2012-03-26'
and myid = '120026015'
, a secondary index (of type composite) created on
the mydate
and myid
fields must exist for the query to
quickly output results.
lookupFromMapRDB()
API to perform a lookup in a HPE Ezmeral Data Fabric Database table and output the results to an Apache Spark DataFrame:import com.mapr.db.spark.sql._
import spark.implicits._
val df = spark.lookupFromMapRDB("/tbl")
df.filter("mydate" === "2012-03-26" && $"myid" === 120026015).show
SparkSession sparkSession = SparkSession.builder().getOrCreate();
MapRDBJavaSession mapRDBJavaSession = new MapRDBJavaSession(sparkSession);
Dataset<Row> df2 = mapRDBJavaSession.lookupFromMapRDB("/tbl");
df2.filter("mydate = '2012-03-26' and myid = '120026015'").show();
from pyspark.sql import SparkSession
df = spark.lookupFromMapRDB("/tbl")
df.filter("mydate = '2012-03-26' and myid = '120026015'").show()