i
Characteristics of Big Data
Application of Big Data Processing
Introduction to BIG DATA
Where to get Big Data?
Types of Big Data
Storage layer - HDFS (Hadoop Distributed File System)
MapReduce
YARN
How Hadoop works?
Hadoop Eco System
Hadoop Architecture
Hadoop Installation & Environment Setup
Setting Up A Single Node Hadoop Cluster
Ubuntu User Configuration
SSH Setup With Key Generation
Disable IPv6
Download and Install Hadoop 3.1.2
Working with Configuration Files
Start The Hadoop instances
Hadoop Distributed File System (HDFS)
HDFS Features and Goals
HDFS Architecture
Read Operations in HDFS
Write Operations In HDFS
HDFS Operations
YARN
YARN Features
YARN Architecture
Resource Manager
Node Manager
Application Master
Container
Application Workflow in Hadoop YARN
Hadoop MapReduce
How MapReduce Works?
MapReduce Examples with Python
Running The MapReduce Program & Storing The Data File To HDFS
Create A Python Script
Hadoop Environment Setup
Execute The Script
Apache Hive Definition
Why Apache Hive?
Features Of Apache Hive
Hive Architecture
Hive Metastore
Hive Query Language
SQL vs Hive
Hive Installation
Apache Pig Definition
MapReduce vs. Apache Pig vs. Hive
Apache Pig Architecture
Installation Process Of Apache Pig
Execute Apache Pig Script
Hadoop Eco Components
NoSQL Data Management
Apache Hbase
Apache Cassandra
Mongodb
Introduction To Kafka
The Architecture of Apache Flume
Apache Spark Ecosystem
Kafka is a messaging (publish-subscribe) system that is spread through multiple servers to handle large volumes of data that flow through it. A messaging system is a particular type of application which helps transfer data from one location to another so that other applications can concentrate on doing other things instead of sharing data.
Kafka Architecture
In the below diagram, we are explaining a simple messaging system:
Fig: Conventional Messaging System
A producer is a sender that sends a message to the message queue. The message queue then transmits the message to the consumer. According to this diagram, a message can only share to one consumer. This is, in fact, a point to point message system.
Kafka is a publish-subscribe messaging system. This is different than the conventional messaging system where the producer publishes a message to the topic (similar to a message queue), and the message gets stored to the disk. In order to receive the message, multiple consumers can subscribe to the topic. The below diagram shows the differences:
Fig: Kafka Messaging System
Kafka Terminology
Broker: a server that has Kafka installed that houses topics.
Topic: It is a collection of data that is stored in the form of messages. A topic can be split up into many partitions.
Partitions: It is a subset of a topic. These multiple partitions help to increase throughput. It also allows consumption patterns and organization inside of a topic. A topic can be presented as a collection of partitions.
Offset a unique sequence id for a message inside of a topic.
Replicas: backups of topics to help prevent data loss.
Kafka Cluster: many brokers that work together to ensure throughput is achieved without downtime.
Producers: It is the application that transfers the data from the source to the Broker. Then the broker appends the message to associate topic and partition. A producer may write to a particular topic or even to a particular partition of a topic.
Consumers: It is the application that gets the data out of a Kafka topic.
Let's put all of those terms altogether. One to many Kafka brokers makes up a Kafka cluster. A broker houses of one to many Kafka topics. Kafka topics can be split into partitions. The other brokers make replicas to prevent data loss. A producer puts data into a particular partition of a topic on a specific broker, and a consumer retrieves that data.
Why Kafka?
1. Scalability — Apache Kafka is perfect for scaling horizontally without any downtime.
2. Reliability —Since Kafka is spread across many brokers and data is replicated, it is highly reliable and fault-tolerant.
3. Durability — Kafka writes messages to the disk as quickly as possible to ensure that the data would still be safe if the broker goes down.
4. Performance — Due to its ability to scale, Kafka has a high data throughput doing it a high throughput messaging service.
With all of these advantages, it is said that Kafka is speedy and can guarantee virtually no downtime and data loss.
Don't miss out!