MapReduce can be easy

Contributed by Khurshidali Shaikh on 22 Mar 2013

MapReduce is a programming model for processing large data sets on top of Hadoop framework. MapReduce jobs can be written in Java or other languages(Ruby, Python, Perl, etc) using Hadoop streaming API.

Although Hadoop as a framework is a favorite for processing large data set anyone who has written or seen a MapReduce program will quickly agree that MapReduce programs are tedious to write. Also for a real life use case you often have to write many MapReduce programs and chain them together to meet the application requirements.

What this means is.

1. Writing MapReduce programs needs specialized and lot of effort to model, develop and maintain.

2. Data analysts and other similar users who may be very good at working with data warehouses using SQL and related tools cannot interact with the Hadoop system.

3. Many organizations also have invested in really good BI tools which can work with RDBMs but not Hadoop. There are some tools who have developed connectors to working with Hadoop though.

These challenges has been well understood by many players in the Hadoop & BigData industry and few have tried to solve this problem. This has given rise to some useful tools & frameworks that allow users to interact with Hadoop using SQL, SQL-like and other high level abstractions in place of MapReduce. Below are some such popular frameworks & tools.

Hive

Hive is a data warehouse system built on top of Hadoop that allows easy summarization and adhoc query of data stored in Hadoop using an SQL-like language called HiveQL. Not only this Hive also allows traditional MapReduce jobs to be run in cases where it makes more sense to run those instead of HiveQL scripts.

Cloudera Impala

Cloudera Impala provides fast, interactive query capabilities on top of data stored in HDFS. The meta data, SQL syntax and user interface are same as that of Apache Hive.

Pivotal HD from Greenplum

Pivotal HD is Greenplum’s distribution of Hadoop. One of the highlights of this distribution is HAWQ which is a powerful, SQL compliant database engine on top of Hadoop. It claims to delivery a high performance, “True SQL” database on top of Hadoop. During the launch a live demo was given where Tableu was used to query the data in the Hadoop system using SQL thereby abstracting the system from the underlying MapReduce complexity.

Pig

Apache Pig is a framework for building data analysis application over Hadoop using a high level language called Pig Latin. This is not truly SQL but adopts some constructs from SQL like querying, grouping, sorting, etc which may be easier for SQL users to adapt to. This is very productive and the Pig Latin script is compiled into MapReduce jobs. At Yahoo a major portion of Hadoop system is accessed using Pig.

Cascading Lingual

Lingual is another framework that executes ANSI SQL queries on top of Apache Hadoop. Lingual was created on top of Cascading which is a Java Api for creating complex data processing jobs.

In future we can expect to see more such frameworks, distributions of Hadoop that will try to provide a SQL or similar query mechanism to query data stored in Hadoop.

Visit us at Neevtech.com to know more about our offerings.

Tags: Bigdata, cascading, cloudera, greenplum, hadoop, hive, impala, mapreduce, pig, pivotalhd

MapReduce can be easy

Hive

Cloudera Impala

Pivotal HD from Greenplum

Pig

Cascading Lingual

Leave a Comment

Search Neevtech

Categories

Archives