• Hama——BSP、Graph教程


    1. BSP

    Hama提供纯BSP模型,支持消息传递与全局通信。BSP模型由一系列超步组成,每一个超步包括3个部分:

      1)本地计算

      2)进程通信

      3)障栅同步

    针对大量的科学计算问题,使用BSP模型可以编写高性能的并行计算算法。

    通过继承 org.apache.hama.bsp.BSP 类,创建自己的BSP类。

    继承类必须实现如下方法:

      public abstract void bsp(BSPPeer<K1, V1, K2, V2, M extends Writable> peer) throws IOException, SyncException, InterruptedException{}
    

    每一个BSP程序有一些系列的超步组成,但是BSP方法只被调用一次,这一点与MapReduce有所不同。在计算的前后,可以选择实现setup()和cleanup()方法,对每次计算的数据作进一步处理。建议在计算结束或计算失败时执行cleanup()。

    配置job:

      HamaConfiguration conf = new HamaConfiguration();
      BSPJob job = new BSPJob(conf, MyBSP.class);
      job.setJobName("My BSP program");
      job.setBspClass(MyBSP.class);
      job.setInputFormat(NullInputFormat.class);
      job.setOutputKeyClass(Text.class);
      ...
      job.waitForCompletion(true);
    

    用户接口  

    输入输出

    对BSPJob进行设置时,输入输出路径形式如下:

     job.setInputPath(new Path("/tmp/sequence.dat");
      job.setInputFormat(org.apache.hama.bsp.SequenceFileInputFormat.class);
      or,
      SequenceFileInputFormat.addInputPath(job, new Path("/tmp/sequence.dat"));
      or,
      SequenceFileInputFormat.addInputPaths(job, "/tmp/seq1.dat,/tmp/seq2.dat,/tmp/seq3.dat");
      
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      job.setOutputFormat(TextOutputFormat.class);
      FileOutputFormat.setOutputPath(job, new Path("/tmp/result"));
      
    

    以上三种方式可以任选一种作为输入代码。

    然后,是对输入的数据的读取和输出数据。BSP创建一个方法,以BSPPeer作为参数。BSPPeer包含了通信、计数器和IO接口。读取一个文件,代码如下:

    @Override
      public final void bsp(
          BSPPeer<LongWritable, Text, Text, LongWritable, Text> peer)
          throws IOException, InterruptedException, SyncException {
          
          // this method reads the next key value record from file
          KeyValuePair<LongWritable, Text> pair = peer.readNext();
    
          // the following lines do the same:
          LongWritable key = new LongWritable();
          Text value = new Text();
          peer.readNext(key, value);
          
          // write
          peer.write(value, key);
      }
    

    可以对输入文件进行重复读取:

    for(int i = 0; i < 5; i++){
        LongWritable key = new LongWritable();
        Text value = new Text();
        while (peer.readNext(key, value)) {
           // read everything
        }
        // reopens the input
        peer.reopenInput() //***************
      }
    

    通信:  

     

     方法 描述 
     send(String peerName, BSPMessage msg)  向另外一个peer发送消息
     getCurrentMessage()  返回接收到的消息
     getNumCurrentMessages()  返回接收到的消息数
     sync()  障栅同步
     getPeerName()  返回peer的名称
     getAllPeerNames()  返回所有peer的名称
     getSuperstepCount()  返回超步数

      以上方法都比较灵活,下面是一个向所有peer传递消息的代码:

      @Override
      public void bsp(
          BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
          throws IOException, SyncException, InterruptedException {
        for (String peerName : peer.getAllPeerNames()) {
          peer.send(peerName, 
            new Text("Hello from " + peer.getPeerName(), System.currentTimeMillis()));
        }
    
        peer.sync();
      }
    

    同步:

    当所有的进程都进入同步状态,接下来将就进入下一个超步。需要注意的是,sync()方法并不是BSP Job的结束。如前所述,所有的通信方法都非常的灵活。例如,可以在一个for循环中执行sync(),这样就可以对迭代顺序进行控制。

     @Override
      public void bsp(
          BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
          throws IOException, SyncException, InterruptedException {
        for (int i = 0; i < 100; i++) {
          // send some messages
          peer.sync();
        }
      }
    

    最后,给出一个求取PI值的完整例子:

      private static Path TMP_OUTPUT = new Path("/tmp/pi-" + System.currentTimeMillis());
    
      public static class MyEstimator extends
          BSP<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> {
        public static final Log LOG = LogFactory.getLog(MyEstimator.class);
        private String masterTask;
        private static final int iterations = 10000;
    
        @Override
        public void bsp(
            BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
            throws IOException, SyncException, InterruptedException {
    
          int in = 0;
          for (int i = 0; i < iterations; i++) {
            double x = 2.0 * Math.random() - 1.0, y = 2.0 * Math.random() - 1.0;
            if ((Math.sqrt(x * x + y * y) < 1.0)) {
              in++;
            }
          }
    
          double data = 4.0 * in / iterations;
    
          peer.send(masterTask, new DoubleWritable(data));
          peer.sync();
        }
    
        @Override
        public void setup(
            BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
            throws IOException {
          // Choose one as a master
          this.masterTask = peer.getPeerName(peer.getNumPeers() / 2);
        }
    
        @Override
        public void cleanup(
            BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
            throws IOException {
          if (peer.getPeerName().equals(masterTask)) {
            double pi = 0.0;
            int numPeers = peer.getNumCurrentMessages();
            DoubleWritable received;
            while ((received = peer.getCurrentMessage()) != null) {
              pi += received.get();
            }
    
            pi = pi / numPeers;
            peer.write(new Text("Estimated value of PI is"), new DoubleWritable(pi));
          }
        }
      }
    
      static void printOutput(HamaConfiguration conf) throws IOException {
        FileSystem fs = FileSystem.get(conf);
        FileStatus[] files = fs.listStatus(TMP_OUTPUT);
        for (int i = 0; i < files.length; i++) {
          if (files[i].getLen() > 0) {
            FSDataInputStream in = fs.open(files[i].getPath());
            IOUtils.copyBytes(in, System.out, conf, false);
            in.close();
            break;
          }
        }
    
        fs.delete(TMP_OUTPUT, true);
      }
    
      public static void main(String[] args) throws InterruptedException,
          IOException, ClassNotFoundException {
        // BSP job configuration
        HamaConfiguration conf = new HamaConfiguration();
    
        BSPJob bsp = new BSPJob(conf, PiEstimator.class);
        // Set the job name
        bsp.setJobName("Pi Estimation Example");
        bsp.setBspClass(MyEstimator.class);
        bsp.setInputFormat(NullInputFormat.class);
        bsp.setOutputKeyClass(Text.class);
        bsp.setOutputValueClass(DoubleWritable.class);
        bsp.setOutputFormat(TextOutputFormat.class);
        FileOutputFormat.setOutputPath(bsp, TMP_OUTPUT);
    
        BSPJobClient jobClient = new BSPJobClient(conf);
        ClusterStatus cluster = jobClient.getClusterStatus(true);
    
        if (args.length > 0) {
          bsp.setNumBspTask(Integer.parseInt(args[0]));
        } else {
          // Set to maximum
          bsp.setNumBspTask(cluster.getMaxTasks());
        }
    
        long startTime = System.currentTimeMillis();
        if (bsp.waitForCompletion(true)) {
          printOutput(conf);
          System.out.println("Job Finished in "
              + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
        }
      }
    

    2. Graph

    hama提供了Graph包,支持顶点为中心的图计算,使用较少的代码就可以实现google Pregel风格的应用。

    Vertex API

    实现一个Hama Graph应用包括对预定义的Vertex类进行子类化,模板参数涉及3种类型,顶点、边和消息(vertices, edges, and messages):

    public abstract class Vertex<V extends Writable, E extends Writable, M extends Writable>
          implements VertexInterface<V, E, M> {
    
        public void compute(Iterator<M> messages) throws IOException;
        ..
    
      }
    

    用户重写compute()方法,该方法将在每个超步的活跃顶点中执行。Compute()方法可以查询当前顶点及其边的信息,并向其他顶点发送消息。

    VertexReader API

    通过继承 org.apache.hama.graph.VertexInputReader 类,根据自己的文件格式创建自己的 VertexReader,示例如下:

      public static class PagerankTextReader extends
          VertexInputReader<LongWritable, Text, Text, NullWritable, DoubleWritable> {
    
        /**
         * 输入文件的格式
         * The text file essentially should look like: <br/>
         * VERTEX_ID\t(n-tab separated VERTEX_IDs)<br/>
         * E.G:<br/>
         * 1\t2\t3\t4<br/>
         * 2\t3\t1<br/>
         * etc.
         */
        @Override
      /***
       * 解析节点,如hadoop类似,以行为一个单位进行输入。以制表符作为分割符,
       * 将每一行分割为String类型的数组,最后转化为vertex类的一个实例
       */
        public boolean parseVertex(LongWritable key, Text value,
            Vertex<Text, NullWritable, DoubleWritable> vertex) throws Exception {
          String[] split = value.toString().split("\t");
          for (int i = 0; i < split.length; i++) {
            if (i == 0) {
              vertex.setVertexID(new Text(split[i]));
            } else {
              vertex
                  .addEdge(new Edge<Text, NullWritable>(new Text(split[i]), null));
            }
          }
          return true;
        }
    
      }
    

    PageRank的例子,很简单,不解释了:

    public static class PageRankVertex extends
          Vertex<Text, NullWritable, DoubleWritable> {
    
        @Override
        public void compute(Iterator<DoubleWritable> messages) throws IOException {
          if (this.getSuperstepCount() == 0) {
            this.setValue(new DoubleWritable(1.0 / (double) this.getNumVertices()));
          }
    
          if (this.getSuperstepCount() >= 1) {
            double sum = 0;
            while (messages.hasNext()) {
              DoubleWritable msg = messages.next();
              sum += msg.get();
            }
    
            double ALPHA = (1 - 0.85) / (double) this.getNumVertices();
            this.setValue(new DoubleWritable(ALPHA + (0.85 * sum)));
          }
    
          if (this.getSuperstepCount() < this.getMaxIteration()) {
            int numEdges = this.getOutEdges().size();
            sendMessageToNeighbors(new DoubleWritable(this.getValue().get()
                / numEdges));
          }
        }
      }
    

      

      

      

    参考资料:

    1、http://hama.apache.org/hama_bsp_tutorial.html

    2、http://hama.apache.org/hama_graph_tutorial.html

    转载请保留:http://www.cnblogs.com/Deron/archive/2013/06/09/3128135.html

      

      

  • 相关阅读:
    JS清除IE浏览器缓存的方法
    大数据基础2
    CI/CD
    手机连接fiddler
    npm run build 报错
    django.core.exceptions.ImproperlyConfigured: mysqlclient 1.3.13 or newer is required; you have 0.9.2
    读取ini文件的方法
    ES小知识
    svn连接pycharm
    创建python文件时添加相关信息
  • 原文地址:https://www.cnblogs.com/Deron/p/3128135.html
Copyright © 2020-2023  润新知