i

Hadoop Tutorial

Apache Spark Ecosystem

Apache Spark is an open-source, extensive data processing engine with a revealing development API that enables data workers to perform streaming in Spark, machine learning or SQL workloads that require repeated access to data sets. It is designed to perform better batch processing and stream processing (deal with streaming data). It is a platform for general-purpose cluster computing. Spark is designed to integrate with all the Big Data tools. As an example, Spark can access any Hadoop data source and can run on Hadoop clusters. Spark extends Hadoop MapReduce to the next level, which includes iterative queries and stream processing. Spark is highly accessible and offers simple Python, Java, Scala, and R APIs.

There is a common but untrue belief that Apache Spark is an extension of Hadoops not true. Spark is independent of Hadoop as it has its own cluster management system. Hadoop is used for storage purpose only.

Spark's key feature is its in-memory cluster computation capability, which increases the processing speed of an application. In-memory computing means a middleware allows one to store and process data in RAM across a cluster of computers.

Apache Spark Ecosystem Components:

Apache Spark Ecosystem has six major components, which empower to Apache Spark, and they are Spark Core, Spark Streaming, Spark SQL, Spark GraphX, Spark MLlib, and SparkR.

                                                Fig: Apache Spark Eco-System

1. Apache Spark Core:

The functions of Apache Spark are built on the top of Spark Core. It delivers speed by providing in-memory computing capability. Thus, Spark Core is the foundation for parallel and distributed processing of large datasets. The primary features of Spark Core are the following:

1. In charge of essential I/O functionalities.

2. Essential in the programming and observation of the Spark cluster's function.

3. Task dispatching.

4. Fault recovery.

5. It overcomes MapReduce's problem by using computation in memory.

Spark Core is embedded in a special RDD (Resilient Distributed Dataset) collection. RDD is one of Spark's abstractions. Spark RDD manages data partitioning among all the nodes in a cluster. It holds them in clusters memory pool as a single unit. The operations transformation and Action are performed on RDDs:

Transformation: It is the feature from existing RDDs that produces new RDDs.

Action: RDDs are generated from one another in Transformation. But if we want to deal with the actual dataset, we will use Action at that stage.

2 Apache Spark SQL:

The Spark SQL component is a structured data storage distributed system. Spark provides more knowledge about the data structure and the calculation using Spark SQL. Spark will carry out extra optimization with this data. When calculating an output, it uses the same execution engine. Expressing the calculation does not rely on the API / language.

Spark SQL works to access information that is structured and semi-structured. It also allows for efficient, interactive, analytical application through streaming as well as historical data. Spark SQL is a structured data storage Spark unit. It, therefore, serves as a distributed search engine for SQL. Spark SQL's features include:

1. Cost-based optimizer. For more information, follow the Spark SQL Optimization tutorial.

2. Mid query fault-tolerance: This is achieved by scaling thousands of nodes and multi-hour queries using the Spark engine.

3. Total compatibility with existing Hive data:

4. Data Frames and SQL provide a common way of accessing a variety of data sources. This includes Hive, Avro, Parquet, ORC, JSON, and JDBC.

5. Provision for the transfer of structured data within Spark programs, using either the SQL or the familiar Data Frame API.

3 Spark Streaming:

It is an add-on feature to core Spark API, that allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark has accessibility to data from sources like Kafka, Flume, Kinesis or TCP socket. It can operate using various algorithms. Lastly, the data so received is given to file system, databases and live dashboards. Spark uses Micro-batching for real-time streaming.

Micro batching is a technique that allows a system or function to view a flow as a series of small batches of data. As a consequence, Spark Streaming groups live data into small batches. It then transfers it to the storage batch unit. It also includes the features of fault sensitivity.

There are 3 phases of Spark Streaming:

a. GATHERING

Spark Streaming sets out two types of distributed streaming sources:

1. Basic sources: These are the sources available in the Streaming Context API. Examples: file systems, and socket connections.

2. Advanced Sources: These are sources such as Kafka, Flume, Kinesis, etc. which are accessible via extra utility categories. This is why Spark access information from different sources, such as Kafka, Kinesis, Flume, or TCP sockets.

b. PROCESSING

The stored data is processed using complex algorithms expressed with a high-level function. For example, map, reduce, join and window.

c. DATA STORAGE

Processed data is transferred to file servers, repositories, and live dashboards.

Spark Streaming also offers high-level abstraction. It is known as the DStream or Discretized Stream. DStream in Spark means a continuous stream of data. DStream can be formed in two ways, either from sources such as Kafka, Kinesis, and Flume or by high-level operations on other DStreams. So, internally, DStream is a sequence of RDDs.

4 Apache Spark MLlib:

MLlib in Spark is a versatile machine learning library that addresses high-quality and high-speed algorithms. The purpose behind MLlib's development is to make machine learning scalable and simple. This requires machine learning libraries that have specific machine learning algorithms in place. For example, clustering, regression, classifying and collaborative filtering. Some low-level machine learning primitives such as the standard gradient descent optimization algorithm are also present in MLlib.

The reason MLlib moves to a DataFrame-based API is that it is more user-friendly than RDD. Some of the advantages of using DataFrames include Spark Data Sources, SQL DataFrame queries for Tungsten and Catalyst optimizations, and standardized APIs across languages. MLlib also uses the Breeze linear algebra bundle. Breeze is a set of numerical computing and machine learning libraries.

5 Apache Spark GraphX:

GraphX in Spark is an API for parallel graphics and graph execution. It's a network graph analytics engine and data store. Graphs can also be used for clustering, classification, traversing, searching and pathfinding. In addition, GraphX extends Spark RDD by illuminating a new Graph Abstract: a guided multigraph with properties attached to each vertex and edge.

GraphX also optimizes the way we can represent vertex and edges when they are primitive data types. In support of graph computing, fundamental operators, as well as the streamlined version of the Pregel API, are supported.

6 Apache SparkR:

SparkR was the launch of Apache Spark 1.4. SparkR DataFrame is the key component of SparkR. DataFrames are a fundamental data structure for the storage of data in R. The DataFrames framework is generalized to other languages with libraries such as Pandas, etc.

R also includes tools for data manipulation, calculation and graphic display. The main idea behind SparkR was, therefore, to explore different approaches to combine the functionality of R with the scalability of Spark. It's an R kit that offers a lightweight frontend to use R's Apache Spark. There are several advantages of SparkR:

1. Data Sources API: By connecting the Spark SQL data sources, API SparkR can read information from a variety of sources. Types include JSON files, Hive tables, Parquet files, etc.

2. Data Frame Optimizations: In terms of code generation memory control, SparkR DataFrames also inherits all the optimizations made to the computation engine.

3. Scalability:  Operations that run on SparkR DataFrames are spread across all the cores and machines available in the Spark cluster. As a result, SparkR DataFrames will run thousands of machines on terabytes of data and clusters