Converting an Apache Spark RDD to an Apache Spark DataFrame
When APIs are only available on an Apache Spark RDD but not an Apache Spark DataFrame, you can operate on the RDD and then convert it to a DataFrame.
You can convert an RDD to a DataFrame in one of two ways:
- Use the helper function,
toDF. - Convert the RDD to a DataFrame using the
createDataFramecall on aSparkSessionobject.
Using the toDF Helper Function
The
toDF method is available through MapRDBTableScanRDD. The
following example loads an RDD that filters on first_name equal to
"Peter" and projects the _id and
first_name fields, and then converts the RDD to a DataFrame:
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field("first_name") === "Peter")
.select("_id", "first_name").toDF()Using SparkSession.createDataFrame
With this approach,
you can convert an RDD[Row] to a DataFrame by calling
createDataFrame on a SparkSession object. The API for
the call is as follows:
def createDataFrame(RDD, schema: StructType)You might need to first convert an RDD[OJAIDocument] to an
RDD[Row]. The following example shows how to do this:
val df = sparkSession.createDataFrame(
rdd.map(doc =>MapRDBSpark.docToRow(doc, schema)), schema)rdd is of type RDD[OJAIDocument]. The
docToRow call converts rdd to an
RDD[Row] that is then passed to createDataFrame.