i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
Hadoop ecosystem is a platform or framework that helps to solve problems with big data. This includes various components and services (ingesting, processing, storing, and maintaining) within it. Most of the utilities in the Hadoop ecosystem are to complement Hadoop's core components, including HDFS, YARN, and MapReduce.
Fig: Hadoop Eco-System
Apache Pig:
Pig is a high-level programming language for analyzing large data sets, usually in a Hadoop environment. A pig was a development effort at Yahoo! In a MapReduce framework, where programs need to be translated into a series of MapReduce phases. However, this is not a direct programming model that data analysts are familiar with. So, to bridge this gap, an abstraction called Pig was built on top of Hadoop. Pig includes Pig Latin, which is a high-level scripting language. The pig can translate the Pig Latin scripts into MapReduce, which can run on YARN and process data in the HDFS cluster.
To analyze data, programmers need to write Pig scripts using the Pig Latin language. Pig Engine accepts the Pig Latin scripts as input and internally converts to Map and Reduce jobs.
Apache Pig enables us to focus more on analyzing huge data sets and to spend less writing time for Map-Reduce programs. The Pig language is designed to work upon any data. That's why it is named after Pig!
Pig is a faster programming language and very easy to program.
Pig compiler internally converts pig Latin script to MapReduce tasks. It produces a sequential set of MapReduce, and that's an abstraction.
It gives a platform for building data flow for ETL (Extract, Transform, and Load), processing, and analyzing huge data sets.
Hive:
Facebook created a HIVE for people who are fluent with SQL. Thus, the HIVE makes them feel at home while working in a Hadoop Ecosystem. HIVE is a data warehousing component that performs a reading, writing, and managing large data sets in a distributed environment using an SQL-like interface. The language of Hive is called Hive Query Language (HIVE + SQL = HQL ), which is very similar to SQL.
It supports all basic data types of SQL.
Primary components are JDBC/ODBC driver and Hive Command-Line.
HQL commands are executed in Hive Command-line interface.
Hive is highly scalable. It can serve both real-time and extensive data set processing.
We can use predefined functions or write tailored user-defined functions (UDF) also to accomplish your specific needs.
Apache Drill:
Apache Drill is a schema-free SQL query engine. It works on the top of Hadoop, NoSQL and cloud storage. Its primary purpose is large scale processing of data with low latency. It is a distributed query processing engine. We can query petabytes of data using Drill. It can scale to several thousands of nodes. It supports NoSQL databases like Azure BLOB storage, Google cloud storage, Amazon S3, HBase, MongoDB, and so on.
Apache HBase:
It's a NoSQL database that supports all kinds of data and thus capable of handling anything of the Hadoop Database. It provides Google's BigTable capabilities, enabling us to work effectively on Big Data sets.
In occasions when the instances of something small in a large database need to be checked and retrieved, the query must be handled within a short period of time. HBase is handy at such times as it gives us a tolerant way to store limited data.
Example: Suppose we have Millions of Emails from customers. Now we have to find out the number of customers who are interested in Hadoop. HBase was designed to solve these kinds of problems.
APACHE MAHOUT:
Mahout offers a framework for the development of scalable machine learning applications. It performs collaborative filtering, classification and clustering. Frequent item-set missing is another function of Mahout. Let us understand them individually:
Collaborative filtering: Mahout mines user behaviors, their patterns, and their characteristics, and based on that, it predicts and makes recommendations to the users. The typical example is an E-commerce website.
Classification: It means categorizing data into various sub-groups like Bank loan can be classified into good or bad.
Clustering: It organizes together a similar data category such as articles that may include blogs, news, research papers, etc.
Frequent item-set missing: Here, Mahout checks, which objects are likely to be appearing together and make suggestions, if they are missing. For example, In general, cell phones and screen protectors are bought together. So if you're looking for a cell phone, the screen protector is also suggested.
APACHE SPARK:
Apache Spark is an application framework for real-time data analytics in a distributed computing environment. The Spark is written in Scala and
It executes in-memory computations to increase the speed of data processing over Map-Reduce.
Using in-memory computations and other optimizations, it is a hundred times faster than Hadoop for large-scale data processing. It, therefore, requires a high processing power compared to Map-Reduce.
Spark comes with high-level libraries, including support for Java, SQL, Python, Scala, R etc. these standard libraries increase the seamless integrations in a complex workflow.
It also allows various services to integrate with it like GraphX, MLlib, SQL + Data Frames, Streaming services, etc. to increase its capabilities.
Apache Spark is best suited for real-time processing, while Hadoop was designed to store unstructured data and process batch over it. When we combine the ability of Apache Spark, i.e. high processing speed, advanced analytics, and multiple integration support with the low-cost operation of Hadoop on commodity hardware, the best results are achieved. That's why many companies use Spark and Hadoop together to process and analyze their Big Data stored in HDFS.
Apache Sqoop:
Sqoop works as a front-end loader of Big-data. Sqoop is a front-end interface that enables moving bulk data from Hadoop to relational databases and into variously structured data marts. Primarily helps to transfer data from an enterprise database to the Hadoop cluster to perform the ETL process.
Sqoop fulfils the growing need to transfer data from the mainframe to HDFS.
Sqoop helps in achieving improved compression and light-weight indexing for advanced query performance.
It facilitates feature to transfer data parallelly for effective performance and optimal system utilization.
Sqoop creates high-speed data copies from an external source into Hadoop.
It acts as a load balancer by mitigating extra storage and processing loads to other devices.
Oozie:
Apache Ooze is a tool where you can pipeline all sorts of programs in a necessary way to work in the distributed environment of Hadoop. Oozie works to run and manage Hadoop jobs as a scheduler system. To achieve the desired output, Oozie allows multiple complex jobs to be run in sequential order. It is firmly integrated with the Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and system-specific jobs like Java and Shell. Oozie is an open-source Java web application.
Oozie consists of two jobs:
1. Oozie workflow: It is a collection of actions that are arranged one after another to perform the jobs. It's just like a relay race where you have to start to complete the race right after you finish.
2. Oozie Coordinator: It executes workflow jobs based on data availability and predefined schedules.
Apache Flume:
Flume gathers, aggregates, and transfers large data sets from their source and returns them to HDFS. It works as a fault-tolerant mechanism. It helps to transmit data to a Hadoop environment from a source. Flume enables its users to get the data from multiple servers immediately into Hadoop.
Source – It accepts the data from the incoming stream and stores the data in the channel
Channel – It is a medium of temporary storage between the source of the data and persistent storage of HDFS.
Sink – This component collects the data from the channel and writes it permanently to the HDFS.
Apache KAFKA
Apache Kafka is a distributed community streaming platform that handles trillions of events per day. Conceived initially as a messaging queue, Kafka is based on the abstraction of a distributed commit log. Kafka has evolved rapidly from messaging queue to a full-fledged event streaming platform
LinkedIn was created and open-sourced in 2011. Confluent Platform expands Kafka to significantly improve the streaming experience of both operators and developers in development with the new community and commercial features.
Solr and Lucene:
Apache Solr and Lucene are two services that search and indexes the Hadoop ecosystem. Apache Solr is an application built around Apache Lucene. Code of Apache Lucene is in Java. It uses Java libraries for searching and indexing. Apache Solr is an open-source, blazing fast search platform.
Solr is highly scalable, reliable, and fault-tolerant.
It provides distributed indexing, automated failover and recovery, load-balanced query, centralized configuration, and much more.
You can query Solr using HTTP GET and receive the result in JSON, binary, CSV, and XML.
Solr provides matching capabilities like phrases, wildcards, grouping, joining, and much more.
It gets shipped with a built-in administrative interface enabling management of solr instances.
Solr takes advantage of Lucene’s near real-time indexing. It enables you to see your content when you want to see it.
APACHE ZOOKEEPER:
Zookeeper is the supervisor of any Hadoop job that involves a combination of various services in a Hadoop ecosystem. Apache Zookeeper coordinates in a distributed environment with different services.
Before Zookeeper, it was tough and time-consuming to coordinate between different services in the Hadoop Ecosystem. The services had many problems with interactions like standard configuration while synchronizing data. Though the services are configured, changes in the configurations of the services make it complicated and difficult to manage. The naming and grouping was also a time-killing factor. The Zookeeper was implemented because of the above issues. Now it is saving a lot of time by configuration maintenance, performing synchronization, grouping, and naming.
Apache Ambari:
Ambari is an open-source software of Apache software foundation. Hadoop is much more manageable with Ambari. It is capable of managing, provisioning, and monitoring Apache Hadoop clusters.
Hadoop cluster provisioning: It guides us on how to install Hadoop services across many hosts with a step-by-step procedure. Ambari handles the configuration of Hadoop services across all clusters.
Hadoop Cluster management: It is the central management system for starting, stopping, and reconfiguring of Hadoop services across all clusters.
Hadoop cluster monitoring: To monitor health and status, Ambari provides us with a dashboard. The Ambari acts as an alarming system when anything goes wrong.
For example, if any service goes down or low disk space on the node, it helps us through notification.
Summary of Hadoop Technologies:
Technology |
Working Domain |
HDFS (Hadoop Distributed File System) |
Storage of Hadoop ( BigData) |
MapReduce |
Data processing using programming |
YARN (Yet Another Resource Negotiator) |
Resource Manager |
PIG, HIVE |
Data Processing Services using Query (SQL-like) |
Spark |
In-memory Data Processing |
HBase |
NoSQL Database |
Mahout, Spark MLlib |
Machine Learning Application |
Apache Drill |
SQL on Hadoop |
Zookeeper |
Cluster Manager |
Oozie |
Job Scheduling |
Flume, Sqoop |
Data Ingesting Services |
Solr & Lucene |
Searching & Indexing |
Ambari |
Provision, Monitor and Maintain cluster |
Don't miss out!