Configure Spark 2.2.1 and later to Consume HPE Ezmeral Data Fabric Streams Messages
Using the Kafka 0.10 API, you can configure a Spark application to query HPE Ezmeral Data Fabric Streams for new messages at a given interval. This information is for Spark 2.2.1 and later users.
About this task
Procedure
- Install the Data Fabric core Kafka package, if you have not already done so.
-
Copy the Kafka client jar into the Spark jars directory as shown below:
cp /opt/mapr/lib/kafka-clients-<version>.jar SPARK_HOME/jars
-
Add the following dependency:
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-9_2.11 version = <spark_version>-mapr-<mapr_eco_version>
NOTEIf you would like to use Streaming Producer Examples, you must add the appropriate Spark streaming Kafka producer jar from the MapR Maven repository to the Spark classpath (/opt/mapr/spark/spark-<spark_version>/jars/
. -
Consider the following when you write the Spark application:
Example: https://github.com/mapr/spark/blob/3.5.1.0-eep-930/examples/src/main/scala/org/apache/spark/examples/streaming/V010DirectKafkaWordCount.scala is a sample consumer program.
The
KafkaUtils.createDirectStream
method creates an input stream to read HPE Ezmeral Data Fabric Streams messages. TheConsumerStrategies.Subscribe
method creates theconsumerStrategy
that will limit the set of topics the stream subscribes to. This is derived from thetopics
parameter passed into the program. UsingLocationStategies.PreferConsistent
will distribute partitions evenly across available executors.val consumerStrategy = ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, consumerStrategy)