HPE Ezmeral Data Fabric Database Binary Connector for Apache Spark Integration with Spark Streaming
Spark Streaming is a micro-batching, stream-processing framework built on top of Spark. HBase APIs and Spark Streaming make great companions. When used alongside Spark Streaming, HBase APIs can serve as:
- A place to grab reference data or profile data on the fly.
- A place to store counts or aggregates in a way that supports the Spark Streaming promise of only once processing.
The HPE Ezmeral Data Fabric Database Binary Connector for Apache Spark integration points with Spark Streaming are similar to its normal Spark integration points. You can use the following commands straight off a Spark Streaming DStream:
bulkPut |
Enables massively parallel sending of puts to HBase APIs. |
bulkDelete |
Enables massively parallel sending of deletes to HBase APIs. |
bulkGet |
Enables massively parallel sending of gets to HBase APIs to create a new RDD. |
mapPartition |
Enables the Spark Map function with a Connection object to allow
full access to HBase APIs. |
hBaseRDD |
Simplifies a distributed scan to create an RDD. |
bulkPut
Example with DStreams
The following example shows a bulkPut with DStreams. It is similar to the RDD bulk
put.
NOTE
To invoke the hbaseBulkPut
method, make sure you import the
HBaseDStreamFunctions
class.import org.apache.hadoop.hbase.spark.HBaseDStreamFunctions._
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val ssc = new StreamingContext(sc, Milliseconds(200))
val rdd1 = ...
val rdd2 = ...
val queue = mutable.Queue[
RDD[(Array[Byte],
Array[(Array[Byte],
Array[Byte],
Array[Byte])])]]()
queue += rdd1
queue += rdd2
val dStream = ssc.queueStream(queue)
dStream.hbaseBulkPut(
hbaseContext,
TableName.valueOf(tableName),
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.addColumn(putValue._1, putValue._2, putValue._3))
put
})
The
hbaseBulkPut
function has three inputs:- The
hbaseContext
that carries the configuration broadcast information link to the HBase Connections in the executors. - The table name of the table you are putting data into.
- A function that will convert a record in the DStream into an HBase
Put
object.The code snippet above has been extracted from https://github.com/mapr/hbase/blob/1.1.8-mapr-1703/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseDStreamFunctionsSuite.scala.