Friday, August 30, 2013

Let's Discuss Hbase - Row key Design

HBase: Row-Key Design ---- --- Demonstrate: Design Solutions and Pros/Cons ---


Row-Key Design

Try to keep row keys short because they are stored with each cell in an HBase table, thus noticeably reducing row-key size results of data needed for storing HBase data. This advice also applies to column family names.

Common problems of choosing between sequential row keys and randomly distributed row keys:

Some mixed-design approaches allow fast range scans while distributing data among all clusters when writing sequential (by nature) data.

Design Solution: Using sequential row keys (e.g. time-series data with row key built based on timestamp)
Pros: Makes it possible to perform fast range scans with help of setting start/stop keys on Scanner
Cons: Creates single regionserver, hotspotting problems upon writing data (as row keys go in sequence, all records end up written into a single region at a time)


Design Solution: Using randomly distributed row keys(e.g. UUIDs)
Pros: Aims for fastest writing performance by distributing new records over random regions
Cons: Does not conduct fast range scans against written data


{{{Column Families}}}
Currently, HBase does not do well with anything above two or three column families per table. With that said, keep the number of column families in your schema low. Try to make do with one column family in your schemata if you can. Only introduce a second and third column family in the case where data access is usually column-scoped; i.e. you usually query no more than a single column family at one time.

You can also set TTL (in seconds) for a column family. HBase will automatically delete rows once reaching the expiration time.

{{{Versions}}}
The maximum number of row versions that can be stored is configured per column family (the default is 3). This is an important parameter because HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Setting the number of maximum versions to an exceedingly high level (e.g., hundreds or more) is not a good idea because that will greatly increase StoreFile size.

The minimum number of row versions to keep can also be configured per column family (the default is 0, meaning the feature is disabled). This parameter is used together with TTL and maximum row versions parameters to allow configurations such as “keep the last T minutes worth of data of at least M versions, and at most N versions.” This parameter should only be set when TTL is enabled for a column family and must be less than the number of row versions.

{{{Data Types}}}
HBase supports a “bytes-in/bytes-out” interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a
value. Input can be strings, numbers, complex objects, or even images, as long as they can be rendered as bytes.

One supported data type that deserves special mention is the “counters” type. This type enables atomic increments of numbers.......

Some Case study for designing the row key-

No comments:

Post a Comment