Actully, we're going to talk about Apache Hadoop ecosystem.
Most from Apache Offical Docs
1/apache kafka It is a message quenue. Other message quenues: RabbitMQ, Redis,.
what is kafka?
kafka is a distributed, partipationed, replicated commit log service,. It provides the functionlity of a messaging system, but with a unique design.
Simply, it is a log messaging system. It reminds of RabbitMQ which also a message system.
So, google its differences.
TL;DR; Reference: http://www.quora.com/What-are-the-differences-between-Apache-Kafka-and-RabbitMQ
And, kafka is dependent on zookeeper.
2/apache zookeeper
what is zookeeper?
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
what's his aim?
ZooKeeper aims at distilling the essence of these different services into a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable. Consensus, group management, and presence protocols will be implemented by the service so that the applications do not need to implement them on their own. Application specific uses of these will consist of a mixture of specific components of Zoo Keeper and application specific conventions. ZooKeeper Recipes shows how this simple service can be used to build much more powerful abstractions.
3/apache storm
what is storm?
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
where to use it?
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
4/apache spark. Actually spark is aslo called spark ecosystem. It contains some compoents such as Spark SQL which improve the Hive.
what is spark?
Apache Spark™ is a fast and general engine for large-scale data processing.
5/apache hive SQL on Hadoop
what is hive?
The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
So, it is a sql-like language. Find it on IBM: http://www-01.ibm.com/software/data/infosphere/hadoop/hive/ Their docs are always good.
6/apache pig
what is pig?
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Conclusion:
1. most of messaging system based on producer-consumer pattern.
2.pig and hive are like language, sql-language.
PLUS:
CDH: Cloudera Distribution Hadoop.
It's an open source distribution including apache hadoop. It's a commerial version.
Cloudera Impala:
The Leading Open Source Analytic Database for Apache Hadoop
Apache HBase:
Hbase ia scalable, distributed data store that runs on top of hdfs.
For what?
To access significant data.