i

Hadoop Tutorial

Introduction To Kafka

Kafka is a messaging (publish-subscribe) system that is spread through multiple servers to handle large volumes of data that flow through it. A messaging system is a particular type of application which helps transfer data from one location to another so that other applications can concentrate on doing other things instead of sharing data.

Kafka Architecture

In the below diagram, we are explaining a simple messaging system:

                           Fig: Conventional Messaging System

A producer is a sender that sends a message to the message queue. The message queue then transmits the message to the consumer. According to this diagram, a message can only share to one consumer. This is, in fact, a point to point message system.

Kafka is a publish-subscribe messaging system. This is different than the conventional messaging system where the producer publishes a message to the topic (similar to a message queue), and the message gets stored to the disk. In order to receive the message, multiple consumers can subscribe to the topic. The below diagram shows the differences:

                                                    Fig: Kafka Messaging System

Kafka Terminology

  • Broker: a server that has Kafka installed that houses topics.

  • Topic: It is a collection of data that is stored in the form of messages. A topic can be split up into many partitions.

  • Partitions: It is a subset of a topic. These multiple partitions help to increase throughput. It also allows consumption patterns and organization inside of a topic. A topic can be presented as a collection of partitions.

  • Offset a unique sequence id for a message inside of a topic.

  • Replicas: backups of topics to help prevent data loss.

  • Kafka Cluster: many brokers that work together to ensure throughput is achieved without downtime.

  • Producers: It is the application that transfers the data from the source to the Broker. Then the broker appends the message to associate topic and partition. A producer may write to a particular topic or even to a particular partition of a topic.

  • Consumers:  It is the application that gets the data out of a Kafka topic.

Let's put all of those terms altogether. One to many Kafka brokers makes up a Kafka cluster. A broker houses of one to many Kafka topics. Kafka topics can be split into partitions. The other brokers make replicas to prevent data loss. A producer puts data into a particular partition of a topic on a specific broker, and a consumer retrieves that data.

Why Kafka?

1. Scalability — Apache Kafka is perfect for scaling horizontally without any downtime.

2. Reliability —Since Kafka is spread across many brokers and data is replicated, it is highly reliable and fault-tolerant.

3. Durability — Kafka writes messages to the disk as quickly as possible to ensure that the data would still be safe if the broker goes down.

4. Performance — Due to its ability to scale, Kafka has a high data throughput doing it a high throughput messaging service.

With all of these advantages, it is said that Kafka is speedy and can guarantee virtually no downtime and data loss.