i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
The following diagram describes the Hive Architecture and the flow in which a query is submitted into the Hive and finally processed using the MapReduce framework:
Fig: Hive Architecture
As shown in the above image, the Apache Hive Architecture can be categorized into the following components:
6.3.1 Hive Clients: Hive supports application written in many languages like Java, C++, Python, etc. using JDBC, Thrift and ODBC drivers. Hence one can always write the hive client application written in a language of their choice. Apache Hive allows various types of client applications to perform Hive queries. We can categorize these clients into three types:
Thrift Clients: As the Hive server is based on Apache Thrift, it can serve the request from all those programming languages that support Thrift.
JDBC Clients: Hive enables Java applications to connect to it via the JDBC driver specified in the org.apache.hadoop.hive.jdbc.HiveDriver class.
ODBC Clients: This final type of Hive ODBC Driver allows applications to connect to Hive using the ODBC protocol. (Like the ODBC driver, the JDBC driver uses Thrift to communicate with the Hive server.)
6.3.2 Hive Services: Apache Hive provides various services like CLI, Web Interface, etc. to perform queries. We will go through each one of them in the below sections. Hive offers many services, as shown in the diagram above. Let's look at each one of them:
Hive Command Line Interface (CLI): This is the default shell provided by the Hive where we can execute Hive queries and commands directly.
Hive Web Interfaces: Hive also offers a web-based GUI to execute Hive queries and commands in addition to the command-line interface.
Hive Server: Hive server is based on Apache Thrift and is therefore also referred to as a Thrift Server that allows various clients to submit requests to Hive and get the final outcome.
Apache Hive Driver: It is primarily responsible for receiving the queries submitted through the CLI, Thrift, the web UI, JDBC or ODBC interfaces by a client. After that, the driver passes the query to the compiler where type checking, parsing, and semantic analysis takes place using the schema present in the metastore. In the next step, map-reduce tasks and HDFS tasks are generated in the form of a DAG (Directed Acyclic Graph). Lastly, the execution engine executes these jobs in the order of their dependencies, using Hadoop.
Complier: After that hive driver transfers the query to the compiler. Where parsing, type checking, and semantic analysis occurs with the aid of schema available in metastore.
Optimizer: Optimizer generates the optimized logical plan in the form of a Directed Acyclic Graph (DAG) of MapReduce and HDFS tasks.
Executor: Once compilation and optimization complete, these tasks are executed by execution engine the order of their dependencies using Hadoop.
Metastore: We may think of metastore as a central repository to store all information about the Hive metadata. This metadata includes various types of information, such as tables’ structure and partitions, along with the column, column type, serializer, or deserializer needed to read/write the data present in HDFS. The metastore consists of two basic units:
A service which provides metastore, access to other Hive services.
Special disk storage for the metadata that is separate from HDFS storage.
6.3.3 Processing framework and Resource Management: Internally, Hive uses the Hadoop MapReduce framework as a de facto engine to run the queries.
6.3.4 Distributed Storage: As Hive is built on top of Hadoop, the distributed processing uses the underlying HDFS.
Don't miss out!