Stanford NLP tools提供了处理中文的三个工具,分别是分词、Parser
具体参考:
http://nlp.stanford.edu/software/parser-faq.shtml#o
1.中文分词
这个包比较大,运行时候需要的内存也多,因而如果用eclipse运行的时候需要修改虚拟内存空间大小:
运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)
demo
2.依存句法分析
可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o
http://blog.csdn.NET/leeharry/archive/2008/03/06/2153583.aspx
根据输入的训练库不同,可以处理英文,也可以处理中文。输入是分词好的句子,输出词性、句子的语法树(依赖关系)
英文demo(下载的压缩文件中有):
- LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
- lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
- String[] sent = { "This", "is", "an", "easy", "sentence", "." };
- Tree parse = (Tree) lp.apply(Arrays.asList(sent));
- parse.pennPrint();
- System.out.println();
- TreebankLanguagePack tlp = new PennTreebankLanguagePack();
- GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
- GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
- Collection tdl = gs.typedDependenciesCollapsed();
- System.out.println(tdl);
- System.out.println();
- TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
- tp.printTree(parse);
中文有些不同:
- //LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
- LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
- //lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
- // String[] sent = { "This", "is", "an", "easy", "sentence", "." };
- String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
- String sentence = "他和我在学校里常打台球。";
- Tree parse = (Tree) lp.apply(Arrays.asList(sent));
- //Tree parse = (Tree) lp.apply(sentence);
- parse.pennPrint();
- System.out.println();
- /*
- TreebankLanguagePack tlp = new PennTreebankLanguagePack();
- GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
- GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
- Collection tdl = gs.typedDependenciesCollapsed();
- System.out.println(tdl);
- System.out.println();
- */
- //only for English
- //TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
- //chinese
- TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
- tp.printTree(parse);
然而有些时候我们不是光只要打印出来的语法依赖关系,而是希望得到关于语法树(图),则需要采用如下的程序:
- String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
- ParserSentence ps = new ParserSentence();
- Tree parse = ps.parserSentence(sent);
- parse.pennPrint();
- TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
- GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
- GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
- Collection tdl = gs.typedDependenciesCollapsed();
- System.out.println(tdl);
- System.out.println();
- for(int i = 0;i < tdl.size();i ++)
- {
- //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
- TypedDependency td = (TypedDependency)tdl.toArray()[i];
- System.out.println(td.toString());
- }
//采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系
分界线---------------------------------------------------------------------------------------------------------------------------------
在线试了一下斯坦福的依存句法分析工具:
输入中文句子 ------>分词-------->词性标注-------->依存句法分析