Pentaho doubles down on Hadoop and Spark

Author: Dhrumit Shukla

Pentaho is a BI or Business Intelligence software organization that gives Pentaho Business Analytics, a series of open source products, which offer data integration, reporting, OLAP services, dash boarding, ETL capabilities and data mining. The software was founded in 2004 by five founders and headquartered in Orlando, Florida in the United States. It was acquired in 2015 by the Hitachi Data Systems.

WHY PENTAHO

The software has low integration time as well as infrastructural cost compared to other business intelligence tools in the market. Also, Pentaho takes less time to process. There is a big support community that’s available 24/7, together with different support forums. It’s easily scalable and could cater to big volumes of data that scale to billions of data terabytes. Virtually, it possesses unlimited visualizations and data sources and could handle any data type. The companies could have any amount of big or existing data and they could be taken care of with Pentaho. All core engines are stand-alone and open projects with their own development plan and community. Moreover, it has a very good tool set that has broad applicability beyond simply the base product.

Today, Pentaho is doubling down through adding improved support for various key technologies from the Hadoop ecosystem to its famous data preparation system.

SPARK

First on the list is Spark that Pentaho Data Integration or PDI users could now employ to execute SQL requests. The addition enables business analysts to harness their current structured query skills to interact with data crunching engine rather than having to learn the complex native execution model. Moreover, in the same spirit, the update today also aims to ease management operations through extending orchestration capabilities of PDI into more of the components of the analytics framework. Now, organizations could use Pentaho to control Spark’s SQL, stream processing module and machine learning modules aside from their own custom apps. The functionality lends itself to a huge range of various use instances.

For example, a bank could utilize PDI for running a fraud detection algorithm on Spark, feed customer data from the Hadoop cluster to the model for processing and then pushing the results to a third system wherein they could be checked and examined by analysts. The aim is to minimize the learning curves for users, which also is the major motivation behind the other new integrations rolling out these days.

Spark is a powerful open –source processing engine that is built for speed, usage ease and machine learning. It was engineered for performance and the next-generation Big Data technology that is used for storing, blending and governing data at new speed levels, simplicity and scalability. Pentaho was able to innovative ahead of time with the emerging technology since it was built upon modern open source foundations.

For two years, the platform did experimentation with prospective use scenarios, which are based on big data blueprints and sizing the Spark’s enterprise market opportunity. Customers benefit from the work with real-time analytic and simplified capabilities.

Hadoop

Big data technologies are evolving at nearly immeasurable speed and the people at Pentaho Labs continue leveraging and driving innovation in analytics and integration to offer users big data deployments with less risk.

One of the great passions of Pentaho is to empower organizations to benefit from the amazing innovations in Big Data for solving new challenges using existing skill sets that they have in their organizations at present. Pentaho Labs prototyping and innovating efforts on integrating data engineering natively and analytics with Big Data platforms such as Hadoop and Storm already have led dozens of clientele to deploy the next-generation Big Data solutions. Examples of the solutions include data warehousing architectures optimization, leveraging Hadoop as a cost effective data refinery and doing advanced analytics on different data sources to achieve a wider 360-degree customers’ view.

Not since Hadoop’s early days that there is so much excitement around new Big Data technology as now with Spark. Spark’s a Hadoop-compatible computing system, which makes big data analysis amazingly faster, via in-memory computation and is simpler to write via easy Java APIs.

PENTAHO IS INTEGRATED WITH HADOOP AT NUMEROUS LEVELS

Pentaho is integrated with Hadoop at a lot of levels, which include:

  1. Traditional ETL
  2. Data Orchestration
  3. Pentaho MapReduce Execution
  4. Traditional Reporting
  5. Web-based interactive reporting
  6. Pentaho Analyzer

ENHANCEMENTS TO PDI’s CORE DATA PROCESSING

Pentaho also has made some enhancements to the PDI’s core data processing capabilities. Now, the platform allows analysts to execute transformations against their information while analytics workflow runs of having to hard-cored the operations beforehand. Based on the company, the functionality could make the process up to ten times more efficient than ever.

EXPERIMENT TODAY WITH PENTAHO AND SPARK

Experiment now with Pentaho and Spark for both ETL and Reporting. The following use cases are applicable in combining Pentaho and Spark:

  • Reading data from Spark, being a part of an ETL workflow through using Pentaho Data Integration’s Table Input step with Apache Shark.
  • Reporting on Spark data with the use of Pentaho Reporting against Apache Shark.

With Pentaho doubling down on Hadoop and Spark, one could experiment and use the combination in use cases that are applicable.