Converting an Apache Spark RDD to an Apache Spark DataFrame
When APIs are only available on an Apache Spark RDD but not an Apache Spark DataFrame, you can operate on the RDD and then convert it to a DataFrame.
You can convert an RDD to a DataFrame in one of two ways:
- Use the helper function,
toDF
. - Convert the RDD to a DataFrame using the
createDataFrame
call on aSparkSession
object.
Using the toDF Helper Function
The
toDF
method is available through MapRDBTableScanRDD
. The
following example loads an RDD that filters on first_name
equal to
"Peter"
and projects the _id
and
first_name
fields, and then converts the RDD to a DataFrame:
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field("first_name") === "Peter")
.select("_id", "first_name").toDF()
Using SparkSession.createDataFrame
With this approach,
you can convert an RDD[Row]
to a DataFrame by calling
createDataFrame
on a SparkSession
object. The API for
the call is as follows:
def createDataFrame(RDD, schema: StructType)
You might need to first convert an RDD[OJAIDocument]
to an
RDD[Row]
. The following example shows how to do this:
val df = sparkSession.createDataFrame(
rdd.map(doc =>MapRDBSpark.docToRow(doc, schema)), schema)
rdd
is of type RDD[OJAIDocument]
. The
docToRow
call converts rdd
to an
RDD[Row]
that is then passed to createDataFrame
.