Apache Spark: RDD vs. DataFrame vs. Dataset

Author: Joseph Macwan

Developers are in love with APIs because APIs are what make the work of a developer easy and seamless. Apache Spark has since long been appealing its developers to make use of its easy APIs for operating on humongous datasets across the supported languages- Scala, R, Java, and Python.

Here, we will explore three sets of APIs- RDDs, DataFrames, and Datasets, that are available in Apache Spark 2.2 and more. This article is a comprehensive guide on how, when and what to use out of the three targeted APIs.

The DataFrames and Dataset APIs are unified in Apache Spark 2.0.

  • Resilient Distributed Dataset – RDD was the primary user-facing API in Apache Spark ever since it was created. Essentially, an RDD is a distributed, Immutable collection of elements of data that is scattered across nodes in a cluster that can be operated simultaneously with a low-level API that does the transformations and actions. RDDs can be of use when-
  1. A low-level action and transformation and control is required on the dataset.
  2. The data is unstructured, like in the form of media streams or text streams.
  3. You want to go for functional programming constructs rather than domain-specific expressions.
  4. You have no need of imposing a schema like a columnar format while processing and accessing data attributes by name or column.
  5. You can afford to leave out the optimization and performance advantages that come with DataFrames and Datasets for structured or semi-structured data.
  • DataFrames – Like an RDD, a DataFrame is also an immutable distributed collection of data, but unlike it, data is organized like that in a relational database. It is designed to make the processing of large datasets even easier and so, to accomplish that, allows developers to impose a schema on their data allowing higher-level abstraction. It also provides a domain-specific languages API to perform operations on the distributed data and makes Spark accessible to a wider pool of people beyond Data Scientists and Engineers.
  • Datasets – From Apache Spark 2.0, Dataset has taken on two distinct APIs characteristics- an untyped API and a strongly-typed API. Conceptually, a DataFrame can be considered as a name for a collection of generic objects Dataset[Row], where, "Row" is a generic untyped JVM object. On the contrary, datasets are a collection of strongly-typed JVM objects ruled by a case class created in Scala or a class in Java. Benefits of Dataset APIs include Static-typing and Runtime type-safety, high-level abstraction and custom view of the structured and semi-structured data, ease-of-use of APIs with Structure, and Performance and Optimization benefits

When to use DataFrames or Datasets

  1. Use either when what you want are rich semantics, domain-specific APIs, high-level abstractions, etc.
  2. If your processing requirements demand filters, maps, high-level expressions, averages, aggregations, SQL queries, sum, and use of lambda functions on semi-structured data.
  3. Unified and Simplified APIs across Spark libraries.
  4. If you use R, go for DataFrames.
  5. If you are a Python user, use DataFrames and come back to RDDs if you need more control.

In a wrap, your choice of RDD, DataFrame or Dataset will be obvious once your requirements are clear. While RDDs offer low-level functionalities and better control, the latter two allow custom view and structure, high-level and domain-specific operations, and speed of processing and execution.