HBase

Mohit Sharma
8 min readDec 26, 2023

HBase is an open-source, distributed, NoSQL database that provides a way to store and manage large amounts of semi-structured or unstructured data. It is a column-oriented database, which means that it stores data in column families rather than in tables. HBase is built on top of Hadoop and is designed to run on Hadoop Distributed File System (HDFS). HBase provides low-latency access to large data sets, making it suitable for real-time data processing and analytics.

It consists of rows and columns, much like a traditional relational database table. However, unlike a traditional database, an HTable can have an infinite number of columns, and each row does not have to have the same number of columns.

HBase is composed of multiple components that work together to store and manage large datasets. These components include the following:

  1. HMaster: The HMaster is the main component of HBase that manages the cluster metadata, including region assignment, load balancing, and failover handling. It communicates with the ZooKeeper service to coordinate cluster-wide activities.
  2. RegionServer: The RegionServer is responsible for serving data to clients and managing the storage of data in HDFS. Each RegionServer is responsible for a set of HBase regions, which are subsets of the total dataset. Each region contains a contiguous range of row keys and is served by a single RegionServer. The RegionServer manages read and write requests from clients and communicates with the HDFS service to store and retrieve data.
  3. ZooKeeper: ZooKeeper is a distributed coordination service that is used by HBase to manage cluster metadata and provide coordination among HBase nodes. HBase uses ZooKeeper to manage HMaster election, region server coordination, and distributed lock management.
  4. HDFS: HDFS is the distributed file system that provides the storage layer for HBase. HBase stores data in HDFS files, which are split into multiple HDFS blocks and distributed across the HDFS data nodes.
  5. HFiles are the underlying data storage format used by HBase. An HFile consists of a sequence of key-value pairs, where the key is a byte array and the value is an arbitrary byte array. An HFile is immutable once created, which makes it easy to read and to store. When data is written to an HTable, it is first written to a write-ahead log (WAL) and then written to an HFile.
  6. A cell is the smallest unit of data in HBase. Each cell consists of a row key, a column family, a column qualifier, a timestamp, and a value. The row key is used to uniquely identify the row, and the column family and qualifier are used to identify the column. The timestamp is used to keep track of when the cell was written, and the value is the actual data being stored.
  7. A Namespace is a logical container for a set of HTables in HBase. It is used to group related tables together and provide a way to organize them.
  8. When data is written to an HTable, it is first written to a WAL for durability and then written to the memstore in memory. Once the memstore reaches a certain threshold, it is flushed to disk as an HFile. HFiles are periodically compacted to reduce the number of files and improve read performance.
  9. When data is read from an HTable, HBase uses an index to locate the HFiles that contain the relevant data. The data is then read from the HFiles and combined with data from the memstore to provide the most up-to-date view of the data.

Data flows through the HBase components in the following way:

  1. When a client sends a request to read or write data, it contacts the ZooKeeper service to get the location of the HMaster and the RegionServer responsible for the data.
  2. The HMaster receives the request and uses its knowledge of the cluster metadata to determine which RegionServer is responsible for the requested data.
  3. The HMaster communicates with the target RegionServer to initiate the read or write request.
  4. The RegionServer reads or writes data to the HDFS service and returns the response to the client.
  5. During this process, ZooKeeper is used to manage the coordination of the HBase components and ensure that cluster-wide activities are handled correctly.

HBase vs HDFS

HBase and HDFS are both distributed file systems, but they serve different purposes.

HDFS (Hadoop Distributed File System) is primarily designed to store and process large files across a cluster of commodity hardware. It is optimized for batch processing and sequential reads, making it a good fit for Hadoop MapReduce jobs. HDFS uses a write-once-read-many architecture and is optimized for high throughput rather than low latency.

On the other hand, HBase is a NoSQL database built on top of Hadoop and is designed to store and manage large amounts of sparse, semi-structured or unstructured data. It is optimized for random read/write access, making it a good fit for real-time applications. HBase uses a write-many-read-many architecture and is optimized for low latency rather than high throughput.

In summary, HDFS is good for batch processing (scans over big files)

  • Not good for record lookup
  • Not good for incremental addition of small batches
  • Not good for updates

HBase addresses the above points

  • Fast record lookup
  • Support for record-level insertion
  • Support for updates

HFile

The header section contains metadata about the file, such as the compression algorithm used, the block size, and the number of keys and value lengths. This information is used by the HBase RegionServer to read and write data to the file efficiently.

The data section contains the actual data stored in the HFile. Data in HFiles is stored in the form of Key-Value pairs. Each Key-Value pair is referred to as an HBase cell. Each cell contains a row key, column family, column qualifier, timestamp, and value. The row key is used to identify a unique row in HBase, while the column family and column qualifier are used to identify the specific column of that row. The timestamp represents the time at which the data was written to HBase, and the value is the actual data stored in that cell.

The data section is divided into multiple blocks, with each block containing multiple Key-Value pairs. Blocks are compressed to reduce their size and improve the overall performance of HBase. Compression algorithms such as gzip, snappy, or LZO are typically used for this purpose.

The trailer section contains additional metadata about the file, such as the offset of the last data block and the bloom filter. Bloom filters are used to speed up lookups by providing a way to quickly determine if a key exists in the HFile or not.

In addition to the HFile format, HBase also uses a few other components for storing and managing data. The HTable is the main client-facing API for accessing HBase data, while the Namespace provides a way to logically group tables and provide unique identifiers for each of them. Overall, the HBase architecture is designed to provide a scalable and reliable way to store and manage large amounts of data while ensuring fast and efficient access to this data.

HRegion

The architecture of HRegion can be explained as follows:

  • HRegion Server: Each HRegion is served by an HRegionServer process. The HRegionServer handles client requests and manages the storage and retrieval of data from the HRegion.
  • HLog (Write-Ahead-Log): HRegion maintains a Write-Ahead-Log, which is an append-only log stored in HDFS. All data modifications are first written to the HLog, which is then flushed to HFile. In case of a crash, HBase can recover the data from HLog.
  • MemStore: HRegion maintains a MemStore, which is an in-memory data structure that stores the data modifications made since the last flush to HFile. MemStore is implemented as a TreeMap, where the keys are the row keys and the values are the column families and their associated qualifiers and values.

Hive

Hive is a data warehousing tool built on top of Hadoop. It provides a SQL-like interface to interact with large datasets stored in Hadoop Distributed File System (HDFS) or other compatible distributed storage systems such as Amazon S3 or Azure Data Lake Storage.

The main components of Hive are:

  1. Metastore: It is a relational database that stores metadata information about Hive tables such as schema, location, partition information, and other relevant details. It provides a central repository for metadata information that can be used by different Hive clients.
  2. Driver: The Hive Driver is responsible for receiving SQL-like queries submitted by users, optimizing them, and submitting them to the appropriate execution engine (e.g., MapReduce, Spark, or Tez).
  3. Query Compiler: It translates the HiveQL (Hive Query Language) statements into a series of MapReduce or Spark jobs.
  4. Execution Engine: It executes the MapReduce or Spark jobs generated by the Query Compiler.
  5. HCatalog: It provides a unified metadata management service for Hadoop that enables Pig, Hive, and MapReduce jobs to share metadata information.

The architecture of Hive is designed to allow users to interact with big data using SQL-like queries. The HiveQL statements are translated into MapReduce or Spark jobs, which are then executed on a Hadoop cluster. Hive supports various file formats such as Apache Avro, ORC, Parquet, and SequenceFile, and can read data from different sources such as HDFS, HBase, and Amazon S3.

When a Hive query is submitted, the query goes through the following stages:

  1. Parsing: Hive parses the HiveQL statement to identify the syntax and semantic errors.
  2. Compilation: After parsing, Hive compiles the query into an execution plan, which is a set of MapReduce or Spark jobs.
  3. Optimization: Hive applies various optimizations such as query pruning, predicate pushdown, and partition pruning to reduce the data processed by the query.
  4. Execution: Finally, Hive submits the compiled MapReduce or Spark jobs to the Hadoop cluster for execution.

Hive uses a schema-on-read approach, which means that the schema is applied to the data during query execution, rather than when the data is loaded. This approach allows users to work with a variety of data formats and structures without the need to predefine a schema.

Hope you enjoyed reading. I’m always open to suggestions and new ideas. Please write to me :)

--

--