i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
For writing a Pig script, we need a Pig Latin language, and to execute them, we need an execution environment. First, we will start from Pig Latin Script.
7.2.1 Pig Latin Scripts
Initially, we submit Pig scripts to the Pig execution environment, which can be written in Pig Latin code using built-in operators. There are three different ways to execute the Pig script:
1. Grunt Shell: This is Pig's interactive shell, where we generally execute all Pig commands/scripts.
2. Script File: We write all the Pig commands in a script file using the editor (nano/ vi/gedit) and execute the Pig script file from the local file system. This is executed by the help of the Pig Server.
3. Embedded Script: This is, in fact, a way to use other language's capability. We can create User Defined Functions to, bring the functionality which is not available in built-in operators, using different languages like Java, Python, Ruby, etc. and embed it in Pig Latin Script file. Then, execute that script file.
Fig: Apache Pig Architecture
7.2.2 Parser
In this phase, Pig Scripts are given to the Parser, and it does type checking and checks the syntax of the pig script. The parser outcome is a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. The data flows are represented as edges, and logical operators are represented as the nodes.
7.2.3 Optimizer
From the Parser, the DAG is submitted to the optimizer to perform automatic optimization activities like split, merge, transform, and reorder operators, etc. The optimizer aims to reduce the amount of data in the pipeline at any instance of time while processing the extracted data. It uses the below functions to perform these activities.
PushUpFilter: If there are multiple conditions used in the filter, and the filter can be split, Apache Pig splits the conditions and pushes up each condition separately. Early Selection of these conditions helps in reducing the number of data records remaining in the pipeline.
PushDownForEachFlatten: Flattens produces a cross product between a tuple or a bag (complex type) and other fields in the data record, as late as possible in the plan. It helps to keep the number of records low in the pipeline.
ColumnPruner: Omitting the unused or no longer needed columns, reduce the size of the record. This can be applied after each operator so that fields can be pruned as aggressively as possible.
MapKeyPruner: This function omits never used map keys, which reduce the size of the record.
LimitOptimizer: If the limit operator is immediately applied after a sort or load operator, Pig converts the sort or sort operator into a limit-sensitive implementation, which does not require processing the whole data set. Applying the limit at an earlier phase reduces the number of records.
This is, in fact, a flavor of the optimization process. Over that, it also performs Order By, Join, and Group By functions.
7.2.4 Compiler
After finishing the optimization, the compiler compiles the optimized code into a series of MapReduce jobs. The compiler is responsible for converting Pig jobs automatically into MapReduce jobs.
7.2.5 Execution engine
Finally, these MapReduce jobs are submitted for execution to the execution engine. Then the MapReduce jobs are executed and give the required outcome. This result can be displayed on the screen using the "DUMP" statement and can be stored in the HDFS using the "STORE" statement.
Don't miss out!