i

Hadoop Tutorial

Hadoop Eco System

Hadoop ecosystem is a platform or framework that helps to solve problems with big data. This includes various components and services (ingesting, processing, storing, and maintaining) within it. Most of the utilities in the Hadoop ecosystem are to complement Hadoop's core components, including HDFS, YARN, and MapReduce.

                                               Fig: Hadoop Eco-System

Apache Pig:

Pig is a high-level programming language for analyzing large data sets, usually in a Hadoop environment. A pig was a development effort at Yahoo! In a MapReduce framework, where programs need to be translated into a series of MapReduce phases. However, this is not a direct programming model that data analysts are familiar with. So, to bridge this gap, an abstraction called Pig was built on top of Hadoop.  Pig includes Pig Latin, which is a high-level scripting language. The pig can translate the Pig Latin scripts into MapReduce, which can run on YARN and process data in the HDFS cluster.

  • To analyze data, programmers need to write Pig scripts using the Pig Latin language. Pig Engine accepts the Pig Latin scripts as input and internally converts to Map and Reduce jobs.

  • Apache Pig enables us to focus more on analyzing huge data sets and to spend less writing time for Map-Reduce programs. The Pig language is designed to work upon any data. That's why it is named after Pig!

  • Pig is a faster programming language and very easy to program.

  • Pig compiler internally converts pig Latin script to MapReduce tasks. It produces a sequential set of MapReduce, and that's an abstraction.

  • It gives a platform for building data flow for ETL (Extract, Transform, and Load), processing, and analyzing huge data sets.

Hive:

Facebook created a HIVE for people who are fluent with SQL. Thus, the HIVE makes them feel at home while working in a Hadoop Ecosystem. HIVE is a data warehousing component that performs a reading, writing, and managing large data sets in a distributed environment using an SQL-like interface. The language of Hive is called Hive Query Language (HIVE + SQL = HQL ), which is very similar to SQL.

  • It supports all basic data types of SQL.

  • Primary components are JDBC/ODBC driver and Hive Command-Line.

  • HQL commands are executed in Hive Command-line interface.

  • Hive is highly scalable. It can serve both real-time and extensive data set processing.

  • We can use predefined functions or write tailored user-defined functions (UDF) also to accomplish your specific needs.

Apache Drill:

Apache Drill is a schema-free SQL query engine. It works on the top of Hadoop, NoSQL and cloud storage. Its primary purpose is large scale processing of data with low latency. It is a distributed query processing engine. We can query petabytes of data using Drill. It can scale to several thousands of nodes. It supports NoSQL databases like Azure BLOB storage, Google cloud storage, Amazon S3, HBase, MongoDB, and so on.

Apache HBase:

It's a NoSQL database that supports all kinds of data and thus capable of handling anything of the Hadoop Database. It provides Google's BigTable capabilities, enabling us to work effectively on Big Data sets.

In occasions when the instances of something small in a large database need to be checked and retrieved, the query must be handled within a short period of time. HBase is handy at such times as it gives us a tolerant way to store limited data.

Example: Suppose we have Millions of Emails from customers. Now we have to find out the number of customers who are interested in Hadoop. HBase was designed to solve these kinds of problems.   

APACHE MAHOUT:

Mahout offers a framework for the development of scalable machine learning applications. It performs collaborative filtering, classification and clustering. Frequent item-set missing is another function of Mahout. Let us understand them individually:

  • Collaborative filtering: Mahout mines user behaviors, their patterns, and their characteristics, and based on that, it predicts and makes recommendations to the users. The typical example is an E-commerce website.

  • Classification: It means categorizing data into various sub-groups like Bank loan can be classified into good or bad.

  • Clustering:  It organizes together a similar data category such as articles that may include blogs, news, research papers, etc.

  • Frequent item-set missing: Here, Mahout checks, which objects are likely to be appearing together and make suggestions, if they are missing. For example, In general, cell phones and screen protectors are bought together. So if you're looking for a cell phone, the screen protector is also suggested.

APACHE SPARK:

Apache Spark is an application framework for real-time data analytics in a distributed computing environment. The Spark is written in Scala and

  • It executes in-memory computations to increase the speed of data processing over Map-Reduce.

  • Using in-memory computations and other optimizations, it is a hundred times faster than Hadoop for large-scale data processing.  It, therefore, requires a high processing power compared to Map-Reduce.

  • Spark comes with high-level libraries, including support for Java, SQL, Python, Scala, R etc. these standard libraries increase the seamless integrations in a complex workflow.

  •  It also allows various services to integrate with it like GraphX, MLlib, SQL + Data Frames, Streaming services, etc. to increase its capabilities.

Apache Spark is best suited for real-time processing, while Hadoop was designed to store unstructured data and process batch over it. When we combine the ability of Apache Spark, i.e. high processing speed, advanced analytics, and multiple integration support with the low-cost operation of Hadoop on commodity hardware, the best results are achieved. That's why many companies use Spark and Hadoop together to process and analyze their Big Data stored in HDFS.

Apache Sqoop:

Sqoop works as a front-end loader of Big-data. Sqoop is a front-end interface that enables moving bulk data from Hadoop to relational databases and into variously structured data marts.  Primarily helps to transfer data from an enterprise database to the Hadoop cluster to perform the ETL process.

  • Sqoop fulfils the growing need to transfer data from the mainframe to HDFS.

  • Sqoop helps in achieving improved compression and light-weight indexing for advanced query performance.

  • It facilitates feature to transfer data parallelly for effective performance and optimal system utilization.

  • Sqoop creates high-speed data copies from an external source into Hadoop. 

  • It acts as a load balancer by mitigating extra storage and processing loads to other devices.

Oozie:

Apache Ooze is a tool where you can pipeline all sorts of programs in a necessary way to work in the distributed environment of Hadoop. Oozie works to run and manage Hadoop jobs as a scheduler system. To achieve the desired output, Oozie allows multiple complex jobs to be run in sequential order. It is firmly integrated with the Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and system-specific jobs like Java and Shell. Oozie is an open-source Java web application.

Oozie consists of two jobs:

1. Oozie workflow:  It is a collection of actions that are arranged one after another to perform the jobs. It's just like a relay race where you have to start to complete the race right after you finish.

2. Oozie Coordinator: It executes workflow jobs based on data availability and predefined schedules.

Apache Flume:

Flume gathers, aggregates, and transfers large data sets from their source and returns them to HDFS. It works as a fault-tolerant mechanism. It helps to transmit data to a Hadoop environment from a source. Flume enables its users to get the data from multiple servers immediately into Hadoop.

Source – It accepts the data from the incoming stream and stores the data in the channel

Channel – It is a medium of temporary storage between the source of the data and persistent storage of HDFS.

Sink – This component collects the data from the channel and writes it permanently to the HDFS.

Apache KAFKA

Apache Kafka is a distributed community streaming platform that handles trillions of events per day. Conceived initially as a messaging queue, Kafka is based on the abstraction of a distributed commit log. Kafka has evolved rapidly from messaging queue to a full-fledged event streaming platform

LinkedIn was created and open-sourced in 2011. Confluent Platform expands Kafka to significantly improve the streaming experience of both operators and developers in development with the new community and commercial features.

Solr and Lucene:

Apache Solr and Lucene are two services that search and indexes the Hadoop ecosystem. Apache Solr is an application built around Apache Lucene. Code of Apache Lucene is in Java. It uses Java libraries for searching and indexing. Apache Solr is an open-source, blazing fast search platform.

  • Solr is highly scalable, reliable, and fault-tolerant.

  • It provides distributed indexing, automated failover and recovery, load-balanced query, centralized configuration, and much more.

  • You can query Solr using HTTP GET and receive the result in JSON, binary, CSV, and XML.

  • Solr provides matching capabilities like phrases, wildcards, grouping, joining, and much more.

  • It gets shipped with a built-in administrative interface enabling management of solr instances.

  • Solr takes advantage of Lucene’s near real-time indexing. It enables you to see your content when you want to see it.

APACHE ZOOKEEPER:

Zookeeper is the supervisor of any Hadoop job that involves a combination of various services in a Hadoop ecosystem.  Apache Zookeeper coordinates in a distributed environment with different services.

Before Zookeeper, it was tough and time-consuming to coordinate between different services in the Hadoop Ecosystem. The services had many problems with interactions like standard configuration while synchronizing data. Though the services are configured, changes in the configurations of the services make it complicated and difficult to manage. The naming and grouping was also a time-killing factor. The Zookeeper was implemented because of the above issues. Now it is saving a lot of time by configuration maintenance, performing synchronization, grouping, and naming.

Apache Ambari:

Ambari is an open-source software of Apache software foundation. Hadoop is much more manageable with Ambari. It is capable of managing, provisioning, and monitoring Apache Hadoop clusters. 

Hadoop cluster provisioning: It guides us on how to install Hadoop services across many hosts with a step-by-step procedure. Ambari handles the configuration of Hadoop services across all clusters. 

Hadoop Cluster management: It is the central management system for starting, stopping, and reconfiguring of Hadoop services across all clusters.

Hadoop cluster monitoring: To monitor health and status, Ambari provides us with a dashboard. The Ambari acts as an alarming system when anything goes wrong. 

For example, if any service goes down or low disk space on the node, it helps us through notification.

Summary of Hadoop Technologies:

Technology

Working Domain

HDFS (Hadoop Distributed File System)

Storage of Hadoop ( BigData)

MapReduce

Data processing using programming

YARN (Yet Another Resource Negotiator)

Resource Manager

PIG, HIVE

Data Processing Services using Query (SQL-like)

Spark

In-memory Data Processing

HBase

NoSQL Database

Mahout, Spark MLlib

Machine Learning Application

Apache Drill

SQL on Hadoop

Zookeeper

Cluster Manager

Oozie

Job Scheduling

Flume, Sqoop

Data Ingesting Services

Solr & Lucene

Searching & Indexing

Ambari

Provision, Monitor and Maintain cluster