简介
Ansj基于n-Gram+CRF+HMM的中文分词的java实现。 分词准确率能达到96%以上,可以应用到自然语言处理和对分词效果要求高的场景
支持:
- 中文分词
- 中文姓名识别
- 用户自定义词典
- 关键字提取
- 自动摘要
- 关键字标记
github地址:https://github.com/NLPchina/ansj_seg
文档地址:https://github.com/NLPchina/ansj_seg/wiki
使用说明
maven 引用
<dependencies> <!-- https://mvnrepository.com/artifact/org.ansj/ansj_seg --> <dependency> <groupId>org.ansj</groupId> <artifactId>ansj_seg</artifactId> <version>5.1.6</version> </dependency> <dependency> <groupId>org.nlpcn</groupId> <artifactId>nlp-lang</artifactId> <version>1.7.9</version> </dependency> <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.5.0</version> </dependency> </dependencies>
代码示例
设置文件字典
package com.example.ansjseg; import org.ansj.domain.Result; import org.ansj.domain.Term; import org.ansj.library.DicLibrary; import org.ansj.splitWord.analysis.*; import org.ansj.util.MyStaticValue; import org.nlpcn.commons.lang.tire.domain.Forest; import org.nlpcn.commons.lang.tire.library.Library; import java.io.InputStream; import java.net.URL; import java.util.List; public class Test { public static void main(String[] args) { Forest forest = null; try { URL resource = Test.class.getResource("/library/default.dic"); System.err.println(resource.getPath()); forest= Library.makeForest(Test.class.getResourceAsStream("/library/default.dic"));//加载字典文件 String str = "献血者预约献血时间"; Result result=DicAnalysis.parse(str,forest);//传入forest List<Term> termList=result.getTerms(); for(Term term:termList){ System.out.println(term.getName()+":"+term.getNatureStr()); } System.err.println("-------------------"); result=NlpAnalysis.parse(str,forest);//传入forest termList=result.getTerms(); for(Term term:termList){ System.out.println(term.getName()+":"+term.getNatureStr()); } } catch (Exception e) { e.printStackTrace(); } } }
设置 jdbc 字典
package com.example.ansjseg; import org.ansj.dic.PathToStream; import org.ansj.domain.Result; import org.ansj.domain.Term; import org.ansj.library.DicLibrary; import org.ansj.splitWord.analysis.*; import org.ansj.util.MyStaticValue; import org.nlpcn.commons.lang.tire.domain.Forest; import org.nlpcn.commons.lang.tire.library.Library; import java.io.InputStream; import java.net.URL; import java.util.List; public class Test { public static void main(String[] args) { Forest forest = null; try { forest= Library.makeForest(PathToStream.stream("jdbc://jdbc:postgresql://127.0.0.1:5432/myapp|postgres|123456|select name as name,nature,freq from dic_table"));//加载字典文件 String str = "献血者预约献血时间"; Result result=DicAnalysis.parse(str,forest);//传入forest List<Term> termList=result.getTerms(); for(Term term:termList){ System.out.println(term.getName()+":"+term.getNatureStr()); } System.err.println("-------------------"); result=NlpAnalysis.parse(str,forest);//传入forest termList=result.getTerms(); for(Term term:termList){ System.out.println(term.getName()+":"+term.getNatureStr()); } } catch (Exception e) { e.printStackTrace(); } } }
输出结果
13:49:13.709 [main] INFO org.ansj.util.MyStaticValue - init userLibrary to env value is : /library/default.dic 13:49:13.713 [main] INFO org.ansj.dic.impl.File2Stream - path to stream library/ambiguity.dic 13:49:13.714 [main] ERROR org.ansj.library.AmbiguityLibrary - Init ambiguity library error :org.ansj.exception.LibraryException: path :library/ambiguity.dic file:D:\pro\screw-demo\library\ambiguity.dic not found or can not to read, path: library/ambiguity.dic 13:49:13.715 [main] DEBUG org.ansj.library.DicLibrary - begin init dic ! 13:49:13.715 [main] INFO org.ansj.dic.impl.File2Stream - path to stream library/default.dic 13:49:13.722 [main] INFO org.ansj.library.DicLibrary - load dic use time:7 path is : library/default.dic 13:49:14.389 [main] INFO org.ansj.library.DATDictionary - init core library ok use time : 580 13:49:14.749 [main] INFO org.ansj.library.NgramLibrary - init ngram ok use time :357 献血者预约:0 献血:78 时间:n 13:49:14.755 [main] DEBUG org.ansj.library.CrfLibrary - begin init crf model! ------------------- 13:49:15.883 [main] INFO org.ansj.app.crf.Model - load crf model ok ! use time :1125 13:49:15.883 [main] INFO org.ansj.library.CrfLibrary - load crf use time:1128 path is : jar://crf.model 献血者:n 预约:v 献血:v 时间:n Process finished with exit code 0