i

Hadoop Tutorial

MapReduce vs. Apache Pig vs. Hive

When we load big data on Hadoop, the first thing we think, how to process this data? Collecting vast amounts of unstructured data does not help unless there is an effective way to draw meaningful insights from it. We have many compelling alternatives to analyze the data like the Hadoop MapReduce or other components like Apache Pig and Hive. They have their processing way, and they work effectively. In the below section, I have summarized their properties:

  • MapReduce is a compiled language, whereas Pig is a high-level scripting language, and Hive is a SQL like a query language.

  • Pig and Hive provide a higher level of abstraction, whereas Hadoop MapReduce delivers a low level of abstraction.

  • Hadoop MapReduce requires more lines of code compared to Pig and Hive. Hive requires very few lines of SQL like queries when compared to Pig and MapReduce.

  • MapReduce requires more development effort than Apache Pig and Hive.

  • Pig and Hive coding approaches are much slower than a fully tuned Hadoop MapReduce program.

  • For executing jobs in Pig and Hive, Hadoop developers need not worry about any version mismatch.

  • There is a minimal possibility for the developer to write java level bugs when coding in Pig or Hive.

  • Apache Pig has problems in dealing with unstructured data like images, videos, audio, text that is ambiguously delimited, log data, etc.

  • The pig cannot deal with the poor design of XML or JSON and flexible schemas.