i

Hadoop Tutorial

The Architecture of Apache Flume

Flume is a distributed, available, and reliable service for efficiently gathering, aggregating, and moving massive amounts of data. The distributed property of Flume makes it available and dependable service. In simple terms, It is a data ingestion tool that transfers data from one place to another and guarantees data delivery.

The Architecture of Flume:

Let's discuss the below diagram of Flume architecture:

                                             Fig: Architecture of Flume

1. First, an event is a single unit of data that is transported by Flume.

2. A client is an entity that creates events that are ingested via Flume.

3. A source is how the event enters into Flume. There are two types of sources in Flume:

4. Passively Waiting is a type of source that waits for events to be sent to it via the Client.

5. Polling actively is a type of source that continues to request events from the Client. Sources send events to the Channel.

6. The channel is the bridge between source to sink. Channels are used for buffers in order not to overload the sink with incoming events from the source.

7. Sinks are the method used by Flume to deliver the data to the destination. Flume bunches events into transactions while in the channel to help transmit data through the sink which can write transactions into file systems such as HDFS, or it can pass the data into another Flume agent.

Apache Flume has a very straightforward architecture that enables minor moving parts to pass through the data. It's great to take streaming data from the source and write it in batches to HDFS. It also has the ability to write to HDFS in a real-time fashion.