• 在Hadoop分布式文件系统的索引和搜索


    FROM:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3

    在今天的信息饱和的世界,地理分布的数据,需要一种系统的巨大增长,有利于快速检索有意义的结果的解析。分布式数据的可搜索的索引去加速的过程很长的路要走。在这篇文章中,我演示了如何使用Lucene和Java的基本数据索引和搜索,如何使用RAM目录索引和搜索,如何创建居住在HDF的数据索引,以及如何搜索这些索引。由开发环境,Eclipse的Java 1.6的Lucene的2.4.0,3.4.2,和Hadoop 0.19.1上运行微软Windows XP SP3。

    为了解决这个任务,我把Hadoop的。Apache Hadoop项目的开发可靠,可扩展,分布式计算开源软件,Hadoop分布式文件系统(HDFS)是专为跨广域网的存储和共享文件。HDFS是建立在商品硬件上运行,并提供了容错,资源管理,以及最重要的是,应用程序数据访问的高吞吐量。

    在本地文件系统上创建索引

    第一步是创建一个索引存储在本地文件系统上的数据。开始通过创建一个Eclipse项目中,创建一个类,然后添加所需的JAR文件添加到项目。以这个例子发现在Web服务器中的日志文件的应用程序的数据:

    2010-04-21 02:24:01 GET /blank 200 120

    此数据被映射到某些字段:

    • 2010-04-21 - 日期字段
    • 2时24分01秒 - 时间字段
    • GET - 法域(GET或POST) - 我们将记为“CS-方法”
    • /空白 - 请求的URL字段 - 我们将表示​​为“CS-URI”
    • 200 - 状态代码的请求 - 我们会记为“SC-状态”
    • 120 - 时间采取现场(完成请求所需的时间)
      目前在我们的样本文件的数据位于一个"E:DataFile"名为“test.txt的”如下:
      2010-04-21 02:24:01 GET /blank 200 120
      2010-04-21 02:24:01 GET /US/registrationFrame 200 605
      2010-04-21 02:24:02 GET /US/kids/boys 200 785
      2010-04-21 02:24:02 POST /blank 304 56
      2010-04-21 02:24:04 GET /blank 304 233
      2010-04-21 02:24:04 GET /blank 500 567
      2010-04-21 02:24:04 GET /blank 200 897
      2010-04-21 02:24:04 POST /blank 200 567
      2010-04-21 02:24:05 GET /US/search 200 658
      2010-04-21 02:24:05 POST /US/shop 200 768
      2010-04-21 02:24:05 GET /blank 200 347

    我们要建立索引的数据出现在这个“test.txt的”文件,并保存到本地文件系统的索引。下面的Java代码,这样做。(注意每个部分的代码做什么的详细信息)的意见。

     1 // Creating IndexWriter object and specifying the path where Indexed
     2 //files are to be stored.
     3 IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);
     4              
     5 // Creating BufferReader object and specifying the path of the file
     6 //whose data is required to be indexed.
     7 BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));
     8              
     9 String row=null;
    10          
    11 // Reading each line present in the file.
    12 while ((row=reader.readLine())!= null)
    13 {
    14 // Getting each field present in a row into an Array and file delimiter is "space separated"
    15 String Arow[] = row.split(" ");
    16                  
    17 // For each row, creating a document and adding data to the document with the associated fields.
    18 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
    19                  
    20 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
    21 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
    22 document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
    23 document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
    24 document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
    25 document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
    26                  
    27 // Adding document to the index file.
    28 indexWriter.addDocument(document);
    29 }        
    30 indexWriter.optimize();
    31 indexWriter.close();
    32 reader.close();

    的Java代码一旦被执行,将创建和索引文件存放在“E :/ /DataFile/ IndexFiles的位置。”

    现在,我们可以搜索索引文件中的数据,我们刚刚创建的。基本上,搜索的“场”的数据上完成。您可以使用Lucene搜索引擎支持各种搜索语义搜索,你可以在一个特定的字段​​或字段组合执行搜索。下面的Java代码搜索索引:

     1 // Creating Searcher object and specifying the path where Indexed files are stored.
     2 Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");
     3 Analyzer analyzer = new StandardAnalyzer();
     4  
     5 // Printing the total number of documents or entries present in the index file.
     6 System.out.println("Total Documents = "+searcher.maxDoc()) ;
     7              
     8 // Creating the QueryParser object and specifying the field name on
     9 //which search has to be done.
    10 QueryParser parser = new QueryParser("cs-uri", analyzer);
    11              
    12 // Creating the Query object and specifying the text for which search has to be done.
    13 Query query = parser.parse("/blank");
    14              
    15 // Below line performs the search on the index file and
    16 Hits hits = searcher.search(query);
    17              
    18 // Printing the number of documents or entries that match the search query.
    19 System.out.println("Number of matching documents = "+ hits.length());
    20  
    21 // Printing documents (or rows of file) that matched the search criteria.
    22 for (int i = 0; i < hits.length(); i++)
    23 {
    24     Document doc = hits.doc(i);
    25     System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
    26     doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

    在这个例子中,搜索完成领域cs的uri的cs的uri的字段/空白内搜索的文本因此,搜索代码运行时,所有的文件(或行)的CS-URI字段包含/空白,显示在输出中。的输出如下所示:

    1 Total Documents = 11
    2 Number of matching documents = 7
    3 2010-04-21 02:24:01 GET /blank 200 120
    4 2010-04-21 02:24:02 POST /blank 304 56
    5 2010-04-21 02:24:04 GET /blank 304 233
    6 2010-04-21 02:24:04 GET /blank 500 567
    7 2010-04-21 02:24:04 GET /blank 200 897
    8 2010-04-21 02:24:04 POST /blank 200 567
    9 2010-04-21 02:24:05 GET /blank 200 347

    HDFS上的基于内存的索引

    现在考虑数据的情况下,位于一个像Hadoop DFS分布式文件系统。上述代码将无法正常工作分布式数据上直接创建索引,所以我们不得不完成前几步的诉讼程序,如从HDFS数据复制到本地文件系统,创建索引的数据出现在本地文件系统,最后将索引文件存储到HDFS。同样的步骤将需要搜索。但这种方法耗时且最理想的,相反,让我们的索引和搜索我们的数据使用HDFS节点的内存中的数据是居住。

    假设数据文件“Test.txt的”早期使用现在居住在HDFS上,里面一个工作目录文件夹,名为“/数据文件/ Test.txt的。” 创建另一个称为“/ IndexFiles”HDFS的工作目录里面的文件夹,我们生成的索引文件将被存储。下面的Java代码在内存中的文件存储在HDFS上创建索引文件:

     1 // Path where the index files will be stored.
     2 String Index_DIR="/IndexFiles/";
     3 // Path where the data file is stored.
     4 String File_DIR="/DataFile/test.txt";
     5 // Creating FileSystem object, to be able to work with HDFS
     6 Configuration config = new Configuration();
     7 config.set("fs.default.name","hdfs://127.0.0.1:9000/");
     8 FileSystem dfs = FileSystem.get(config);
     9 // Creating a RAMDirectory (memory) object, to be able to create index in memory.
    10 RAMDirectory rdir = new RAMDirectory();
    11 
    12 // Creating IndexWriter object for the Ram Directory
    13 IndexWriter indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true);
    14             
    15 // Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.
    16 FSDataInputStream filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR));
    17 String row=null;
    18         
    19 // Reading each line present in the file.
    20 while ((row=reader.readLine())!=null)
    21 {
    22 
    23 // Getting each field present in a row into an Array and file //delimiter is "space separated".
    24 String Arow[]=row.split(" ");
    25                 
    26 // For each row, creating a document and adding data to the document 
    27 //with the associated fields.
    28 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
    29                 
    30 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
    31 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
    32 document.add(new Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
    33 document.add(new Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
    34 document.add(new Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
    35 document.add(new Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
    36                 
    37 // Adding document to the index file.
    38 indexWriter.addDocument(document);
    39 }          
    40 indexWriter.optimize();
    41 indexWriter.close();
    42 reader.close();

    因此,对于“test.txt的”居住在HDFS上的文件,我们现在有在内存中创建索引文件。存储索引文件,在HDFS文件夹

     1 // Getting files present in memory into an array.
     2 String fileList[]=rdir.list();
     3  
     4 // Reading index files from memory and storing them to HDFS.
     5 for (int i = 0; I < fileList.length; i++)
     6 {
     7     IndexInput indxfile = rdir.openInput(fileList[i].trim());
     8     long len = indxfile.length();
     9     int len1 = (int) len;
    10  
    11     // Reading data from file into a byte array.
    12     byte[] bytarr = new byte[len1];
    13     indxfile.readBytes(bytarr, 0, len1);
    14              
    15 // Creating file in HDFS directory with name same as that of    
    16 //index file   
    17 Path src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());
    18     dfs.createNewFile(src);
    19  
    20     // Writing data from byte array to the file in HDFS
    21 FSDataOutputStream fs = dfs.create(new    Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);
    22     fs.write(bytarr);
    23     fs.close();

    现在我们有必要的Test.txt的“数据文件创建并存储在HDFS目录的索引文件。

    基于内存搜索HDFS上

    我们现在可以搜索存储在HDFS中的索引。首先,我们必须使HDFS的索引文件在内存中进行搜索。下面的代码是用于这一过程:

     1 // Creating FileSystem object, to be able to work with HDFS
     2 Configuration config = new Configuration();
     3 config.set("fs.default.name","hdfs://127.0.0.1:9000/");
     4 FileSystem dfs = FileSystem.get(config);
     5  
     6 // Creating a RAMDirectory (memory) object, to be able to create index in memory.
     7 RAMDirectory rdir = new RAMDirectory();
     8              
     9 // Getting the list of index files present in the directory into an array.
    10 Path pth = new Path(dfs.getWorkingDirectory()+Index_DIR);
    11 FileSystemDirectory fsdir = new FileSystemDirectory(dfs,pth,false,config);
    12 String filelst[] = fsdir.list();
    13 FSDataInputStream filereader = null;
    14 for (int i = 0; i<filelst.length; i++)
    15 {
    16 // Reading data from index files on HDFS directory into filereader object.
    17 filereader = dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i]));
    18                  
    19     int size = filereader.available();
    20  
    21     // Reading data from file into a byte array.           
    22     byte[] bytarr = new byte[size];
    23     filereader.read(bytarr, 0, size);
    24      
    25 // Creating file in RAM directory with names same as that of
    26 //index files present in HDFS directory.
    27     IndexOutput indxout = rdir.createOutput(filelst[i]);
    28  
    29     // Writing data from byte array to the file in RAM directory
    30     indxout.writeBytes(bytarr,bytarr.length);
    31     indxout.flush();       
    32     indxout.close();               
    33 }
    34 filereader.close();

    现在我们有了所有所需的索引文件在RAM中的目录(或存储器),所以我们可以直接执行搜索索引文件。搜索代码将被用于搜索本地文件系统类似,唯一的变化是,现在将使用RAM的目录对象(RDIR),而不是使用本地文件系统目录路径创建搜索对象

     1 Searcher searcher = new IndexSearcher(rdir);
     2 Analyzer analyzer = new StandardAnalyzer();
     3  
     4 System.out.println("Total Documents = "+searcher.maxDoc()) ;
     5              
     6 QueryParser parser = new QueryParser("time", analyzer);
     7              
     8 Query query = parser.parse("02\:24\:04");
     9              
    10 Hits hits = searcher.search(query);
    11              
    12 System.out.println("Number of matching documents = "+ hits.length());
    13  
    14 for (int i = 0; i < hits.length(); i++)
    15 {
    16 Document doc = hits.doc(i);
    17 System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
    18 doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

    以下输出,搜索是场上的“时间”和“时间”字段内的文本搜索“02 24 04。” 因此,运行代码时,所有的文件(或行)的“时间”字段中包含“02: 24 04”,在输出中显示:

    1 Total Documents = 11
    2 Number of matching documents = 4
    3 2010-04-21 02:24:04 GET /blank 304 233
    4 2010-04-21 02:24:04 GET /blank 500 567
    5 2010-04-21 02:24:04 GET /blank 200 897
    6 2010-04-21 02:24:04 POST /blank 200 567

    结论

    像HDFS分布式文件系统是一个强大的工具,用于存储和访问大量的数据提供给我们的今天。随着内存的索引和搜索,访问数据,你真的想找到你不关心数据的群山之中得到稍微容易一些。

  • 相关阅读:
    python高级(2)—— 基础回顾2
    Java System Reports
    EWA不能及时通过邮件接收
    LA服务可用性4个9是什么意思?怎么达到?
    安装HANA Rules Framework(HRF)
    RFC destination fails with error Incomplete Logon Data after system copy
    为满足中国税改,SAP该如何打SPS
    HANA数据库无法停止
    SR开启时LOG_MODE必须是normal
    2743711
  • 原文地址:https://www.cnblogs.com/wq920/p/3288830.html
Copyright © 2020-2023  润新知