Apache Spark: RDD vs. DataFrame vs. Dataset
Developers are in love with APIs because APIs are what make the work of a developer easy and seamless. Apache Spark has since long been appealing its developers to make use of its easy APIs for operating on humongous datasets across the supported languages- Scala, R, Java, and Python.
Here, we will explore three sets of APIs- RDDs, DataFrames, and Datasets, that are available in Apache Spark 2.2 and more. This article is a comprehensive guide on how, when and what to use out of the three targeted APIs.
The DataFrames and Dataset APIs are unified in Apache Spark 2.0.
- Resilient Distributed Dataset – RDD was the primary user-facing API in Apache Spark ever since it was created. Essentially, an RDD is a distributed, Immutable collection of elements of data that is scattered across nodes in a cluster that can be operated simultaneously with a low-level API that does the transformations and actions. RDDs can be of use when-
- A low-level action and transformation and control is required on the dataset.
- The data is unstructured, like in the form of media streams or text streams.
- You want to go for functional programming constructs rather than domain-specific expressions.
- You have no need of imposing a schema like a columnar format while processing and accessing data attributes by name or column.
- You can afford to leave out the optimization and performance advantages that come with DataFrames and Datasets for structured or semi-structured data.
- DataFrames – Like an RDD, a DataFrame is also an immutable distributed collection of data, but unlike it, data is organized like that in a relational database. It is designed to make the processing of large datasets even easier and so, to accomplish that, allows developers to impose a schema on their data allowing higher-level abstraction. It also provides a domain-specific languages API to perform operations on the distributed data and makes Spark accessible to a wider pool of people beyond Data Scientists and Engineers.
- Datasets – From Apache Spark 2.0, Dataset has taken on two distinct APIs characteristics- an untyped API and a strongly-typed API. Conceptually, a DataFrame can be considered as a name for a collection of generic objects Dataset[Row], where, "Row" is a generic untyped JVM object. On the contrary, datasets are a collection of strongly-typed JVM objects ruled by a case class created in Scala or a class in Java. Benefits of Dataset APIs include Static-typing and Runtime type-safety, high-level abstraction and custom view of the structured and semi-structured data, ease-of-use of APIs with Structure, and Performance and Optimization benefits
When to use DataFrames or Datasets
- Use either when what you want are rich semantics, domain-specific APIs, high-level abstractions, etc.
- If your processing requirements demand filters, maps, high-level expressions, averages, aggregations, SQL queries, sum, and use of lambda functions on semi-structured data.
- Unified and Simplified APIs across Spark libraries.
- If you use R, go for DataFrames.
- If you are a Python user, use DataFrames and come back to RDDs if you need more control.
In a wrap, your choice of RDD, DataFrame or Dataset will be obvious once your requirements are clear. While RDDs offer low-level functionalities and better control, the latter two allow custom view and structure, high-level and domain-specific operations, and speed of processing and execution.