i

Hadoop Tutorial

Apache Pig Architecture

For writing a Pig script, we need a Pig Latin language, and to execute them, we need an execution environment. First, we will start from Pig Latin Script.

7.2.1 Pig Latin Scripts

Initially, we submit Pig scripts to the Pig execution environment, which can be written in Pig Latin code using built-in operators. There are three different ways to execute the Pig script:

1. Grunt Shell: This is Pig's interactive shell, where we generally execute all Pig commands/scripts.

2. Script File: We write all the Pig commands in a script file using the editor (nano/ vi/gedit) and execute the Pig script file from the local file system. This is executed by the help of the Pig Server.

3. Embedded Script: This is, in fact, a way to use other language's capability. We can create User Defined Functions to, bring the functionality which is not available in built-in operators, using different languages like Java, Python, Ruby, etc. and embed it in Pig Latin Script file. Then, execute that script file. 

                                Fig: Apache Pig Architecture

7.2.2 Parser

In this phase, Pig Scripts are given to the Parser, and it does type checking and checks the syntax of the pig script. The parser outcome is a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. The data flows are represented as edges, and logical operators are represented as the nodes.

7.2.3 Optimizer

From the Parser, the DAG is submitted to the optimizer to perform automatic optimization activities like split, merge, transform, and reorder operators, etc. The optimizer aims to reduce the amount of data in the pipeline at any instance of time while processing the extracted data.  It uses the below functions to perform these activities.

PushUpFilter: If there are multiple conditions used in the filter, and the filter can be split, Apache Pig splits the conditions and pushes up each condition separately. Early Selection of these conditions helps in reducing the number of data records remaining in the pipeline. 

PushDownForEachFlatten: Flattens produces a cross product between a tuple or a bag (complex type) and other fields in the data record, as late as possible in the plan. It helps to keep the number of records low in the pipeline.

ColumnPruner: Omitting the unused or no longer needed columns, reduce the size of the record. This can be applied after each operator so that fields can be pruned as aggressively as possible. 

MapKeyPruner: This function omits never used map keys, which reduce the size of the record.

LimitOptimizer: If the limit operator is immediately applied after a sort or load operator, Pig converts the sort or sort operator into a limit-sensitive implementation, which does not require processing the whole data set. Applying the limit at an earlier phase reduces the number of records. 

This is, in fact, a flavor of the optimization process. Over that, it also performs Order By, Join, and Group By functions.

7.2.4 Compiler

After finishing the optimization, the compiler compiles the optimized code into a series of MapReduce jobs. The compiler is responsible for converting Pig jobs automatically into MapReduce jobs.

7.2.5 Execution engine

Finally, these MapReduce jobs are submitted for execution to the execution engine. Then the MapReduce jobs are executed and give the required outcome. This result can be displayed on the screen using the "DUMP" statement and can be stored in the HDFS using the "STORE" statement.