    function map(String name, String document):
      // name: document name
      // document: document contents
      for each word w in document:
        emit (w, 1)
    function reduce(String word, Iterator partialCounts):
      // word: a word
      // partialCounts: a list of aggregated partial counts
      sum = 0
      for each pc in partialCounts:
        sum += pc
      emit (word, sum)

    The prototypical MapReduce example counts the appearance of each word in a set of documents:[14]

    Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word.

     SELECT age, AVG(contacts)
        FROM social.person
    GROUP BY age
    ORDER BY age
    function Map is
        input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
        for each social.person record in the K1 batch do
            let Y be the person's age
            let N be the number of contacts the person has
            produce one output record (Y,(N,1))
    end function
    function Reduce is
        input: age (in years) Y
        for each input record (Y,(N,C)) do
            Accumulate in S the sum of N*C
            Accumulate in Cnew the sum of C
        let A be S/Cnew
        produce one output record (Y,(A,Cnew))
    end function
    -- map output #1: age, quantity of contacts
    10, 9
    10, 9
    10, 9
    -- map output #2: age, quantity of contacts
    10, 9
    10, 9
    -- map output #3: age, quantity of contacts
    10, 10
    -- reduce step #1: age, average of contacts
    10, 9



    imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age


    The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

    • an input reader
    • a Map function
    • a partition function
    • a compare function
    • a Reduce function
    • an output writer



    MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.[1][2]

    A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

    The model is a specialization of the split-apply-combine strategy for data analysis.[3] It is inspired by the map and reduce functions commonly used in functional programming,[4] although their purpose in the MapReduce framework is not the same as in their original forms.[5] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's[6] reduce[7] and scatter[8] operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.[9] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.[10]

    MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.[12]

