i

Hadoop Tutorial

Apache Hbase

HBase is a column-oriented database that gives the user a dynamic database schema. It is called the Hadoop database because though it is a NoSQL database, it runs on top of Hadoop.  NoSQL runs on the Hadoop Distributed File System (HDFS), it blends Hadoop's scalability with real-time data access as a key/value store and Map Reduce's deep analytical capabilities. In addition, HBase also supports other high-level languages for data processing. The unique features of Apache HBase are Consistency, High Availability and many more.

HBase can store huge quantities of terabyte-to-petabyte data. HBase tables are made up of billions of rows and millions of columns. HBase is designed for low latency operations, with specific characteristics compared to traditional relational models.

Features of HBase:

  • HBase offers consistent reads and writes.

  • While one read or write process is going on, all other processes are prevented from performing any read or write operations, that is "Atomic read and write".  So, on a row-level, HBase provides atomic read and write.

  • HBase provides automatic and manual splitting of regions into smaller sub-regions, as soon as it reaches a threshold size, which reduces I/O time and overhead.

  • It also provides LAN and WAN, enabling failover and recovery. In fact, at the core, there is a master server,  which handles monitoring both the region servers and metadata for the cluster.

  • HBase supports both linear and modular scalability.

  • As well as Hadoop / HDFS integration, HBase will operate on top of other file systems.

  • HBase supports data replication across clusters.

  • HBase supports Failover and load sharing

  • HBase supports MapReduce, which enables it to parallel processing of a large volume of data. It also supports back-up of Hadoop MapReduce jobs in HBase tables.

  • An optimal application can be made here since searching happens on the range of rows, HBase stores row keys in lexicographical orders. Hence, an optimized request can be built by using these sorted row keys and timestamp.

  • While performing real-time query processing, it supports block cache and Bloom filters.

  • For faster lookups, HBase internally uses Hash tables and offers random access, simultaneously stores the data in indexed HDFS files.

  • HBase supports both structured and semi-structured data

  • As HBase is schema-less, there is no concept of fixed columns schema. Hence, it defines only column families.

  • For non-Java front-ends, HBase supports Thrift and REST API.

Storage Mechanism in HBase:

HBase is a column-oriented database, where data is stored in tables, it has RowId.  RowId is the collection of several column families that are present in the table. The tables are sorted by RowId.

The column families in the schema are key-value pairs. Upon detailed observation, it can be found that each column family has multiple columns. The column values are stored into disk memory. Each cell of the table has its own Metadata like timestamp and other information.

Rowid

 

Column Family 1

Column Family 2

Column Family 3

Col 1

Col 2

Col 3

Col 1

Col 2

Col 3

Col 1

Col 2

Col 3

1

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

Fig: Storage Mechanism in HBase

The following are the key terms representing the table schema of HBase:

Key terms representing table schema of HBase:

  • Table: Collection of rows present.

  • Row: Collection of available column families.

  • Column Family: Set of columns.

  • Column: Set of key-value pairs

  • Namespace: Logical grouping of tables.

  • Cell: A {row, column, version} tuple precisely specifies a cell definition in HBase.