解决数据库多写问题,同事推荐使用hbase,并做了HBase培训,也看到老大tim参会说淘宝用hbase替代部分mysql核心应用,学习研究下看是否适用
分布式计算的谬论.:
1 The network is reliable.
2 Latency is zero.
3 Bandwidth is infinite.
4 The network is secure.
5 Topology doesn't change.
6 There is one administrator.
7 Transport cost is zero.
8 The network is homogeneous.
下载版本0.92.1 889个文件 285749 行java代码(find . -name '*.java'|wc -l)
《HBase 权威指南》目录摘要:
- hbase演进
November 2006
Google releases paper on BigTable
February 2007
Initial HBase prototype created as Hadoop contrib§
October 2007
First “usable” HBase (Hadoop 0.15.0)
January 2008
Hadoop becomes an Apache top-level project, HBase becomes subproject
October 2008
HBase 0.18.1 released
January 2009
HBase 0.19.0 released
September 2009
HBase 0.20.0 released, the performance release
May 2010
HBase becomes an Apache top-level project
June 2010
HBase 0.89.20100621, first developer release
January 2011
HBase 0.90.0 released, the durability and stability release
Mid 2011
HBase 0.92.0 released, tagged as coprocessor and security release - rdbms的局限性
举例“Hush, the HBase URL Shortener”这个应用,随访问量增大要加slave,加cache,只能做简单查询,考虑读写的不断优化和扩展,分表分库,在应用层面改程序,做sharding,买好的硬件,以及随后的不尽噩梦。 -
HBase的面向column的表
the most basic unit is a column. One or more columns form a
row that is addressed uniquely by a row key. A number of rows, in turn, form a table,
and there can be many of them. Each column may have multiple versions, with each
distinct value contained in a separate cell.
(Table, RowKey, Family, Column, Timestamp) → Value 可在编程语言中表达为:SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>> (p19)
相同rowkey会有不同时间戳的数据,对应不同的版本,数据存储在HFiles中,索引保存在内存中,默认64KB,HFiles又被保存在Hadoop Distributed File System(hdfs)中,确保在跨服务器的数据写入不会丢失。索引存储在文件块的最后面. - HBase的anto-sharding
region去管理监控做sharding。“Each region is served by exactly one region server, and each of these servers can serve
many regions at any time"
- 数据写入流程
When data is updated it is first written to a commit log, called a write-ahead log (WAL)
in HBase, and then stored in the in-memory memstore. Once the data in memory has
exceeded a given maximum value, it is flushed as an HFile to disk. After the flush, the
commit logs can be discarded up to the last unflushed modification. While the system
is flushing the memstore to disk, it can continue to serve readers and writers without
having to block them.Since flushing memstores to disk causes more and more HFiles to be created, HBase
has a housekeeping mechanism that merges the files into larger ones using compaction.
There are two types of compaction: minor compactions and major compactions.(p24) -
HBase组成部分
the client library, one master server, and many region servers.HBase master server 使用zookeeper管理region servers,负载均衡,去掉繁忙服务器。hbase相比google bigtable,增加了" push-down predicates, that is, filters,reducing data transferred over the network"