In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
—Grace Hopper
1. Data! Data!
==>We live in the data age. Mashups between different information sources make for unexpected and hitherto unimaginable applications. It has been said that “More data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm).
==>The good news is that Big Data is here. The bad news is that we are struggling to store
and analyze it.
2. Data storage and analysis
==>The problem is simple: while the storage capacities of hard drives have increased mas-sively over the years, access speeds—the rate at which data can be read from drives—have not kept up.
==>This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once.
==>Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them.
There’s more to being able to read and write data in parallel to or from multiple disks, though.
==>The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available.(HDFS)
==>The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. (MapReduce)
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
3. Comparison with Other Systems
The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data, and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights.
3.1 RDBMS
Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is
MapReduce needed?
==>The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate.
==>If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
==>On the other hand, for updating a small proportion of records in a database, a traditional B-Tree works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
The differences between the two systems are:
==>MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data.
==>MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.
==>Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on. MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. Relational data is often normalized to retain its integrity and remove redundancy.
==>MapReduce is a linearly scalable programming model.
Traditional RDBMS |
MapReduce |
|
Data size |
Gigabytes |
Petabytes |
Access |
Interactive and batch |
Batch |
Updates |
Read and write many times |
Write once, read many times |
Structure |
Static schema |
Dynamic schema |
Integrity |
High |
Low |
Scaling |
Nonlinear |
Linear |
3.2 Grid Computing
==>Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file system, hosted by a SAN. This works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.
MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. This feature, known as data locality, is at the heart of MapReduce and is the reason for its good performance.
==>MPI gives great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis.
MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.
==>Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure—when you don’t know if a remote process has failed or not—and still making progress with the overall computation.
MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one other.
By contrast, MPI programs have to explicitly manage their own check-pointing and recovery, which gives more control to the programmer, but makes them more difficult to write.
3.3 Volunteer Computing
Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
==>The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world, since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.
==>MapReduce is designed to run jobs those last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth inter-connects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
4. Hadoop Ecosystem
==>Common
A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures).
==>Avro
A serialization system for efficient, cross-language RPC, and persistent data storage
==>MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines.
==>HDFS
A distributed file system that runs on large clusters of commodity machines.
==>Pig
A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
==>Hive
A distributed data warehouse
Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
==>HBase
A distributed, column-oriented database
HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
==>ZooKeeper
A distributed, highly available coordination service
ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
==>Sqoop
A tool for efficiently moving data between relational databases and HDFS