The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:
Hadoop Core , our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
HBase builds on Hadoop Core to provide a scalable, distributed database.
Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets.
Pig
Pig raises the level of abstraction for processing large datasets. With MapReduce, there is a map function and there is a reduce function, and working out how to fit your data processing into this pattern, which often requires multiple MapReduce stages, can be a challenge.
Pig is made up of two pieces:
- The language used to express data flows, called Pig Latin .
- The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.
A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output. Taken as a whole, the operations describe a data flow , which the Pig execution environment translates into an executable representation and then runs. Under the covers, Pig turns the transformations into a series of MapReduce jobs , but as a programmer you are mostly unaware of this, which allows you to focus on the data rather than the nature of the execution.
Pig is a scripting language for exploring large datasets. One criticism of MapReduce is that the development cycle is very long. Writing the mappers and reducers, compiling and packaging the code, submitting the job(s) and retrieving the results is a timeconsuming business, and even with Streaming, which removes the compile and package step, the experience is still involved. Pig’s sweet spot is its ability to process terabytes of data simply by issuing a half-dozen lines of Pig Latin from the console.
总结一下,Pig是什么,为什么需要这个project,那就要先从MapReduce模型的不足说起
实际使用MapReduce模型时,你首先会碰到的一个问题就是建模问题,给定一个复杂的数据分析问题,怎么样把它抽象,转化成一系列的MapReduce过程,这个往往是比较困难的,是有些技术含量的
在分析抽象成了一系列MapReduce过程,你还要用Java去实现他们,事实证明用Java去实现还是很麻烦的
那么有没有一种Framework,能够让用户仅仅focus在怎样处理data的data flow上,而把这些底层的MapReduce和Java给屏蔽掉,好,Pig粉墨登场了,这个是Yahoo捐给Apache的一个项目,也是模仿google的一个项目开发的。
Pig包含了Pig Latin,一种脚本语言,专门用于描述data flow,看底下的一个例子。
An Example
Let’s look at a simple example by writing the program to calculate the maximum recorded temperature by year for the weather dataset in Pig Latin. The complete program is only a few lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD ''input/ncdc/micro-tab/sample.txt''
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
好,这样一段简单的代码就完成了前面也介绍过的,年最高温度的问题,可以和前面用java写的对比一下,你就再也不会想用java写code了
具体什么意思,下面分别解释, Grunt是Pig的命令行程序
grunt> records = LOAD ''input/ncdc/micro-tab/sample.txt''
>> AS (year:chararray, temperature:int, quality:int);
For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields.
Load这个文件,Load会默认文件中的每行包含一些由Tab隔开的field,每个field是什么意思,即元数据在AS里面指定
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
这步你读过去就知道什么意思了
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The third statement uses the GROUP function to group the records relation by the year field.
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
grunt> max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature);
grunt> DUMP max_temp;
(1949,111)
(1950,22)
想知道Pig怎么样parse这些脚本语言的,看看下面这段话
As a Pig Latin program is executed, each statement is parsed in turn. If there are syntax errors, or other (semantic) problems such as undefined aliases, the interpreter will halt and display an error message. The interpreter builds a logical plan for every relational operation, which forms the core of a Pig Latin program. The logical plan for the statement is added to the logical plan for the program so far, then the interpreter moves on to the next statement.
It’s important to note that no data processing takes place while the logical plan of the program is being constructed.
When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is syntactically and semantically correct, and adds it to the logical plan, but it does not load the data from the file (or even check whether the file exists). Similarly, Pig validates the GROUP and FOREACH ... GENERATE statements, and adds them to the logical plan without executing them. The trigger for Pig to start processing is the DUMP statement (a STORE statement also triggers processing). At that point, the logical plan is compiled into a physical plan and executed.
The type of physical plan that Pig prepares depends on the execution environment. For local execution, Pig will create a physical plan that runs in a single local JVM, whereas for execution on Hadoop, Pig compiles the logical plan into a series of MapReduce jobs.
You can see the logical and physical plans created by Pig using the EXPLAIN command on a relation (EXPLAIN max_temp; for example). In MapReduce mode, EXPLAIN will also show the MapReduce plan, which shows how the physical operators are grouped into MapReduce jobs. This is a good way to find out how many MapReduce jobs Pig will run for your query.
总结,Pig在Parse脚本的时候,不是一句一句执行的,而是一条一条去check正确性,都放到Logic Plan里面,再把Logic Plan转化为Physical Plan去执行,这里脚本是local还是distributed执行对于用户透明的,只在转化为Physical Plan的时候,系统做了不同的处理。
那么如果你想知道Pig将你的脚本转化为怎样的MapReduce过程,你可以通过Explain命令去查看, 这个很有意思。
下面列出了Pig Latin的关系操作,从这儿你大概可以看出Pig可以对数据做怎么样的操作
Pig Latin relational operators
Category Operator Description
Loading and storing LOAD Loads data from the filesystem or other storage into a relation
STORE Saves a relation to the filesystem or other storage
DUMP Prints a relation to the console
Filtering FILTER Removes unwanted rows from a relation
DISTINCT Removes duplicate rows from a relation
FOREACH ... GENERATE Adds or removes fields from a relation
STREAM Transforms a relation using an external program
Grouping and joining JOIN Joins two or more relations
COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation to a maximum number of tuples
Combining and splitting UNION Combines two or more relations into one
SPLIT Splits a relation into two or more relations
HBase
HBase is a distributed column-oriented database built on top of HDFS . HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets.
给出一个HBase的Usecase
The canonical HBase use case is the webtable , a table of crawled web pages and their attributes (such as language and MIME type) keyed by the web page URL. The webtable is large with row counts that run into the billions. Batch analytic and parsing MapReduce jobs are continuously run against the webtable deriving statistics and adding new columns of MIME type and parsed text content for later indexing by a search engine. Concurrently, the table is randomly accessed by crawlers running at various rates updating random rows while random web pages are served in real time as users click on a website’s cached-page feature.
Concepts
Whirlwind Tour of the Data Model
Applications store data into labeled tables . Tables are made of rows and columns. Table cells—the intersection of row and column coordinates—are versioned . By default, their version is a timestamp auto-assigned by HBase at the time of cell insertion. A cell’s content is an uninterpreted array of bytes.
Table row keys are also byte arrays , so theoretically anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Table rows are sorted by row key, the table’s primary key .
Row columns are grouped into column families . All column family members have a common prefix, so, for example, the columns temperature:air and temperature: dew_point are both members of the temperature column family, whereas station:identifier belongs to the station family.The column family prefix must be composed of printable characters. The qualifying tail can be made of any arbitrary bytes.
A table’s column families must be specified up front as part of the table schema definition, but new column family members can be added on demand .
Physically, all column family members are stored together on the filesystem. So, though earlier we described HBase as a column-oriented store, it would be more accurate if it were described as a column-family-oriented store .
In synopsis, HBase tables are like those in an RDBMS, only cells are versioned , rows are sorted , and columns can be added on the fly by the client as long as the column family they belong to preexists.
总结,从抽象上你可以理解为Hbase也是采用Table的结构,不同于严谨的关系表,它design的目的就是可扩展性,所以你不用定义每个 column的类型(都是byte arrays),column的个数也是可以每行都不一样的,不会象关系表为稀疏表占用大量的空间,所以他就是一种可扩展的比关系表更灵活的一种表结构。
可是你有没有想过,这种表结构为什么那么灵活,答案就是其实Table只是你想象出来的,它本身并不是什么真正的表结构,尽管Google称为BigTable.
参考:http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
Fortunately, Google''s BigTable Paper clearly explains what BigTable actually is. Here is the first sentence of the "Data Model" section:
A Bigtable is a sparse , distributed , persistent multidimensional sorted map .
The BigTable paper continues, explaining that:
The map is indexed by a row key , column key, and a timestamp; each value in the map is an uninterpreted array of bytes .
是的,它只是sorted map by row key ,对于它前面那些定语还是比较好理解的,可以参考原文。
Regions
Tables are automatically partitioned horizontally by HBase into regions . Each region comprises a subset of a table’s rows. A region is defined by its first row, inclusive, and last row, exclusive, plus a randomly generated region identifier.
HBase会自动为大表分region,这个对用户是透明的。
Implementation
Just as HDFS and MapReduce are built of clients, slaves and a coordinating master—namenode and datanodes in HDFS and jobtracker and tasktrackers in MapReduce—so is HBase characterized with an HBase master node orchestrating a cluster of one or more regionserver slaves . The HBase master is responsible for bootstrapping
a virgin install, for assigning regions to registered regionservers, and for recovering regionserver failures. The master node is lightly loaded. The regionservers carry zero or more regions and field client read/write requests. They also manage region splits informing the HBase master about the new daughter regions for it to manage the offlining of parent region and assignment of the replacement daughters. HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state.
HBase Versus RDBMS
HBase and other column-oriented databases are often compared to more traditional and popular relational databases or RDBMSs. As described previously, HBase is a distributed, column-oriented data storage system. It picks up where Hadoop left off by providing random reads and writes on top of HDFS.
Strictly speaking, an RDBMS is a database that follows Codd’s 12 Rules . Typical RDBMSs are fixed-schema , row-oriented databases with ACID properties and a sophisticated SQL query engine. The emphasis is on strong consistency , referential integrity , abstraction from the physical layer , and complex queries through the SQL language. You can easily create secondary indexes, perform complex inner and outer joins, count, sum, sort, group, and page your data across a number of tables, rows, and columns.
For a majority of small- to medium-volume applications, there is no substitute for the ease of use, flexibility, maturity, and powerful feature set of available open source RDBMS solutions like MySQL and PostgreSQL. However, if you need to scale up in terms of dataset size, read/write concurrency, or both, you’ll soon find that the conveniences of an RDBMS come at an enormous performance penalty and make distribution inherently difficult. The scaling of an RDBMS usually involves breaking Codd’s rules, loosening ACID restrictions, forgetting conventional DBA wisdom , and on the way losing most of the desirable properties that made relational databases so convenient in the first place.
Countless applications, businesses, and websites have successfully achieved scalable, fault-tolerant, and distributed data systems built on top of RDBMSs and are likely using many of the previous strategies. But what you end up with is something that is no longer a true RDBMS , sacrificing features and conveniences for compromises and complexities.
HBase
Enter HBase, which has the following characteristics:
No real indexes
Rows are stored sequentially, as are the columns within each row. Therefore, no issues with index bloat, and insert performance is independent of table size.
Automatic partitioning
As your tables grow, they will automatically be split into regions and distributed across all available nodes.
Scale linearly and automatically with new nodes
Add a node, point it to the existing cluster, and run the regionserver. Regions will automatically rebalance and load will spread evenly.
Commodity hardware
Clusters are built on $1,000–$5,000 nodes rather than $50,000 nodes. RDBMS are hungry I/O, which is the most costly type of hardware.
Fault tolerance
Lots of nodes means each is relatively insignificant. No need to worry about individual node downtime.
Batch processing
MapReduce integration allows fully parallel, distributed jobs against your data with locality awareness.
If you stay up at night worrying about your database (uptime, scale, or speed), then you should seriously consider making a jump from the RDBMS world to HBase.
总结,关系数据库什么都好,成熟稳定,严谨,但就是没有办法处理large scale data, 现在有很多正对关系数据库的large scale 的优化的solution,不过那些都破坏了其本源的那些准则和属性,这样使得它很多优点无法体现。
所以面对Large Scale data,来试试HBase吧,上面列出了那么多优点,还犹豫什么^_^
同类型的NoSql数据库
here are other projects competing for the same position in the stack, in particular Facebook''s Cassandra and LinkedIn''s Project Voldemort
ZooKeeper
分布式环境中大多数服务是允许部分失败,也允许数据不一致,但有些最基础的服务是需要高可靠性,高一致性的,这些服务是其他分布式服务运转的基础,比如naming service、分布式lock等,这些分布式的基础服务有以下要求:
- 高可用性
- 高一致性
- 高性能
对 于这种有些挑战CAP原则的服务该如何设计,是一个挑战,也是一个不错的研究课题,Apache的ZooKeeper也许给了我们一个不错的答案。 ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,它暴露了一个简单的原语集,分布式应用程序可以基于它实现同步服务,配置维护和命 名服务等。(http://blog.csdn.net/cutesource/archive/2010/08/18/5822459.aspx)
对于Hadoop,它的namenode或jobtracker就是一种重要的基础的node,必须要保证他们的高可用性,这就可以用 zookeeper来维护他们。对于Hbase也是这样的,依赖zookeeper来协调MasterServer,和RegionServer。
那么ZooKeeper的工作原理是什么了
ZooKeeper is a highly available, high-performance coordination service.
Data Model
ZooKeeper maintains a hierarchical tree of nodes called znodes . A znode stores data and has an associated ACL. ZooKeeper is designed for coordination (which typically uses small data files), not high-volume data storage, so there is a limit of 1 MB on the amount of data that may be stored in any znode.
ZooKeeper是用来存放一些coordination信息的,这些信息一般比较小,在zookeeper中是以一种hierarchical tree的形式来存放,这些具体的信息就store在tree node中,称为znode。
Data access is atomic. 所有读写操作都是原子的
Znodes are referenced by paths , which in ZooKeeper are represented as slash-delimited Unicode character strings, like filesystem paths in Unix. 比如/zoo/duck, /zoo/cow
Ephemeral znodes
Znodes can be one of two types: ephemeral or persistent . A znode’s type is set at creation time and may not be changed later. An ephemeral znode is deleted by ZooKeeper when the creating client’s session ends.
Ephemeral znodes are ideal for building applications that needs to know when certain distributed resources are available.
这边znode是有一种叫零时znode,这种当client session结束时会自动delete,后面讲到的lock service就会用到
Sequence numbers
A sequential znode is given a sequence number by ZooKeeper as a part of its name. If a znode is created with the sequential flag set, then the value of a monotonically increasing counter (maintained by the parent znode) is appended to its name.
Sequence numbers can be used to impose a global ordering on events in a distributed system, and may be used by the client to infer the ordering. 后面讲到的lock service,you will learn how to use sequential znodes to build a shared lock
Watches
Watches allow clients to get notifications when a znode changes in some way. Watches are set by operations on the ZooKeeper service, and are triggered by other operations on the service.
There is an example in “A Configuration Service” demonstrating how to use watches to update configuration across a cluster.
Watch就是可以在znode对某种操作加上trigger,如exist,change,当这种操作发生时,watch就会发notification给client
Operations
Operation Description
create Creates a znode (the parent znode must already exist)
delete Deletes a znode (the znode may not have any children)
exists Tests whether a znode exists and retrieves its metadata
getACL, setACL Gets/sets the ACL for a znode
getChildren Gets a list of the children of a znode
getData, setData Gets/sets the data associated with a znode
sync Synchronizes a client’s view of a znode with ZooKeeper
这儿列出对znode的操作
Implementation
The ZooKeeper service can run in two modes. In standalone mode, there is a single ZooKeeper server, which is useful for testing due to its simplicity (it can even be embedded in unit tests), but provides no guarantees of high-availability or resilience. In production, ZooKeeper runs in replicated mode, on a cluster of machines called an ensemble . ZooKeeper achieves high-availability through replication, and can provide a service as long as a majority of the machines in the ensemble are up.
前面说了,Zookeeper通过hierarchical tree来保存信息,但是standalone模式,其实没有实用价值的,单点局限,一个挂了就挂了。所以只有replicated模式才是high- availability的,只要cluster中majority servers是正常的,那么Zookeeper服务就是可用的
但是有个问题, 你在多台server,及ensemble上保存data,怎样保证所有server上数据的一致性,即Consistency,他是用了如下的方法
ZooKeeper uses a protocol called Zab that runs in two phases, which may be repeated indefinitely:
Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished member, called the leader . The other machines are termed followers . This phase is finished once a majority (or quorum) of followers have synchronized their state with the leader.
Phase 2: Atomic broadcast
All write requests are forwarded to the leader , which broadcasts the update to the followers. When a majority have persisted the change, the leader commits the update , and the client gets a response saying the update succeeded. The protocol for achieving consensus is designed to be atomic, so a change either succeeds or fails.
It resembles two-phase commit.
If the leader fails, the remaining machines hold another leader election and continue as before with the new leader. If the old leader later recovers, it then starts as a follower.
Does ZooKeeper Use Paxos?
No. ZooKeeper’s Zab protocol is not the same as the well-known Paxos algorithm
Google’s Chubby Lock Service , which shares similar goals with ZooKeeper, is based on Paxos.
但是可用认为Zookeeper是对Paxos的优化实现,对于Paxos的相关资料如下
Paxos Made Simple【翻译】http://blog.csdn.net/sparkliang/archive/2010/07/16/5740882.aspx
Paxos在大型系统中常见的应用场景 http://timyang.net/distributed/paxos-scenarios/
Consistency
The terms “leader” and “follower” for the machines in an ensemble are apt, for they make the point that a follower may lag the leader by a number of updates. This is a consequence of the fact that only a majority and not all of the ensemble needs to have persisted a change before it is committed.
follower对数据的更新肯定会lag于leader,而leader当majority的follower persisted achange的时候就会commit
Every update made to the znode tree is given a globally unique identifier , called a zxid (which stands for “ZooKeeper transaction ID”). Updates are ordered, so if zxid z1 is less than z2, then z1 happened before z2, according to ZooKeeper, which is the single authority on ordering in the distributed system.
只是为什么所有的更新都要发给leader,需要一个globlly id来保证update的时序性
The following guarantees for data consistency flow from ZooKeeper’s design:
Sequential consistency
Updates from any particular client are applied in the order that they are sent.
Atomicity
Updates either succeed or fail. This means that if an update fails, no client will ever see it.
Single system image
A client will see the same view of the system regardless of the server it connects to.
Durability
Once an update has succeeded, it will persist and will not be undone. This means updates will survive server failures.
每个znode的更新都是先更新persist设备,即硬盘,再更新memory
Timeliness
The lag in any client’s view of the system is bounded, so it will not be out of date by more than some multiple of tens of seconds. This means that rather than allow a client to see data that is very stale, a server will shut down, forcing the client to switch to a more up-to-date server.
下面举两个利用zookeeper的例子吧
A Configuration Service
One of the most basic services that a distributed application needs is a configuration service so that common pieces of configuration information can be shared by machines in a cluster. At the simplest level, ZooKeeper can act as a highly available store for configuration, allowing application participants to retrieve or update configuration files. Using ZooKeeper watches, it is possible to create an active configuration service, where interested clients are notified of changes in configuration.
A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a collection of processes. At any one time, only a single process may hold the lock. Distributed locks can be used for leader election in a large distributed system, where the leader is the process that holds the lock at any point in time.
这儿的leader election,不同于zookeeper的leader election, 这儿讲的是一种通用的算法。
The pseudocode for lock acquisition is as follows:
1. Create an ephemeral sequential znode named lock- under the lock znode and remember its actual path name (the return value of the create operation).
2. Get the children of the lock znode and set a watch.
3. If the path name of the znode created in 1 has the lowest number of the children returned in 2, then the lock has been acquired. Exit.
4. Wait for the notification from the watch set in 2 and go to step 2.
The idea is simple: first designate a lock znode, typically describing the entity being locked on, say /leader; then clients that want to acquire the lock create sequential ephemeral znodes as children of the lock znode. At any point in time, the client with the lowest sequence number holds the lock.
其实过程是这样的
当一个client需要aquire lock的时候,和zookeeper建立session,并创建一个ephemeral sequential znode, 所以产生的znodename是按照这个client aquire时的情况递增的,比如前面已有client 产生过lock-1,这时候你去aquire就会产生lock-2 znode
lock-number最小的那个znode所对应的client得到这个lock,当它用完这个lock,需要释放lock时,这需要断开这个 client session,因为创建的是ephemeral znode,所以当session断开的时候,znode会自动删除。
It will be notified that it has the lock by creating a watch that fires when znodes go away.
你可用看到, client设置的watch只能触发一次,所以当有个znode被删除的时候,会给所有的client发notification,client收到后 check删除的是不是前一个znode,如果是那么就得到了lock,如果不是还要继续设置watch,就是上面第4步
这个的触发模式,当client很多时,会比较低效,一下要发出大量的notification,而其中只有一条是有用的,所以应该优化成watch 某一个znode被删除的情况。
Server Monitor
Imagine a group of servers that provide some service to clients.
必须保持一个group membership list用于用户查询那些server可用,并当server fail的时候将他从list里面删除,server recover后自动加到list中。
The membership list clearly cannot be stored on a single node in the network, as the failure of that node would mean the failure of the whole system (we would like the list to be highly available). Suppose for a moment that we had a robust way of storing the list. We would still have the problem of how to remove a server from the list if it failed. Some process needs to be responsible for removing failed servers, but note that it can’t be the servers themselves, since they are no longer running!
你看这个问题还是比较麻烦的, 首先不能存在单点,不然单点fail了,整个service都挂了, 那就是要存在多台服务器上,保持replica,那么多台服务器上的data consistency就是一个很大的问题。就算这个问题解决了, 我们怎么样监控这个server,并动态的把fail的从list中删除,我们可用单独的进程去做这事,但如果这个进程所在的server崩溃了,怎么 办,好,是不是已经头大了
OK,Zookeeper可以比较好的解决这个问题, 只是我的理解,书中没说
对于每个服务器,当它启动时,自动建立一个client和zookeeper建立session,并创建一个ephemeral znode, client会不断的发送heartbeat保持这个session
这样当所有server都启动时, 他们在zookeeper上都有一个代表自己的znode, 而zookeeper的这个hierarchical tree就构成了这个服务group membership list,那么zookeeper是replica的,不用担心单点问题
当某个服务器crash,那么它建立的client的session会结束,那么它创建的那个znode,会被自动删除,因为是ephemeral的znode
这样就不需要单独的进程去监控server情况,并特意把fail的server从list中删除
同样当server recover的时候,会再次自动的创建client 建立session,也不需要其他进程干涉
这样用户只要通过查询zookeeper就可用知道那些server是可用的了
艾,牛啊,这个zookeeper真是很不容易理解。。。
Hive
为什么需要Hive
When we started using Hadoop, we very quickly became impressed by its scalability and availability. However, we were worried about widespread adoption, primarily because of the complexity involved in writing MapReduce programs in Java (as well as the cost of training users to write them). We were aware that a lot of engineers and analysts in the company understood SQL as a tool to query and analyze data and that a lot of them were proficient in a number of scripting languages like PHP and Python. As a result, it was imperative for us to develop software that could bridge this gap between the languages that the users were proficient in and the languages required to program Hadoop .
Hadoop在scalability和availability方面非常的好,但是对于用Java来编写Map Reduce程序比较麻烦也比较困难。大多数程序员对SQL,和类似PHP,Python的脚本语言比较熟悉,所以如果我们能够直接用SQL-like语 言来对HDFS中存放的海量数据进行查询和处理就会非常方便,那么Hive就可以提供这样的功能。
Hive产生的动机和Pig比较相似,都是为了开发一套基于Hadoop的统一编程接口,来降低开发和使用Map Reduce的门槛,他们的之间的是有一定的overlap的。不过Pig使用Pig latin脚本语言,而Hive使用SQL-like语言,Pig Latin is procedural, where SQL is declarative.
所以他们使用的usecase还是有所不同的,Pig Latin更适合用来描述这个Data flow的处理过程,而Hive适合用于对海量数据进行查询访问
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/
对于两者的区别参看上面的这个链接,说的比较清楚
Let me begin with a little background on processing and using large data. Data processing often splits into three separate tasks: data collection, data preparation, and data presentation.
The data preparation phase is often known as ETL (Extract Transform Load) or the data factory .
The data presentation phase is usually referred to as the data warehouse .
Pig (combined with a workflow system such as Oozie) is best suited for the data factory , and Hive for the data warehouse .
Hive是什么
Hive is a data warehouse infrastructure built on top of Hadoop and serves as the predominant tool that is used to query the data stored in Hadoop at Facebook.
A system that could model data as tables and partitions and that could also provide a SQL-like language for query and analysis. Also essential was the ability to plug in customized MapReduce programs written in the programming language of the user’s choice into the query.
这儿说Hive是数据仓库,model data as tables,partitions,自然会想到Hive和Hbase有什么不同
http://stackoverflow.com/questions/24179/how-does-hive-compare-to-hbase
Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...
HBase is a column based key value store based on BigTable. You can''t do queries per say, though you can run map reduce jobs over HBase. It''s primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a ''family'' of columns.
从上面这段可以看出,Hbase和Hive其实问题域是不一样的,Hbase主要是为Hadoop提供low latency的随机访问能力,而Hive是为Hadoop提供一套SQL-like的分析和查询工具,Hive并不能保证low latency。
Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries . For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.
From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner, a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary.
Data organization
Data is organized consistently across all datasets and is stored compressed, partitioned, and sorted:
Compression
Almost all datasets are stored as sequence files using gzip codec. Older datasets are recompressed to use the bzip codec that gives substantially more compression than gzip. Bzip is slower than gzip, but older data is accessed much less frequently and this performance hit is well worth the savings in terms of disk space.
Partitioning
Most datasets are partitioned by date. Going forward, we are also going to be partitioning data on multiple attributes (for example, country and date).
Sorting
Each partition within a table is often sorted (and hash-partitioned) by unique ID (if one is present).