目录
HanLP 在汉字转拼音时,可以解决多音字问题,显示输出声调,声母、韵母,通过训练语料库,
本文代码为《自然语言处理入门》配套版本 HanLP-1.7.5
HanLP 里,汉字转简单,简体繁体转换,都用到了 双数组字典树 (Double-array Trie)、Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机 需要先熟悉
对重载不是重任
进行转拼音,效果如下:
原文:重载不是重任 拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4, 拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn, 拼音(无音调):chong,zai,bu,shi,zhong,ren, 声调:2,3,2,4,4,4, 声母:ch,z,b,sh,zh,r, 韵母:ong,ai,u,i,ong,en, 输入法头:ch,z,b,sh,zh,r,
语料库
pinyin.txt
一丁点儿=yi1,ding1,dian3,er5 一不小心=yi1,bu4,xiao3,xin1 一丘之貉=yi1,qiu1,zhi1,he2 一丝不差=yi4,si1,bu4,cha1 一丝不苟=yi1,si1,bu4,gou3 一个=yi1,ge4 一个半个=yi1,ge4,ban4,ge4 一个巴掌拍不响=yi1,ge4,ba1,zhang3,pai1,bu4,xiang3 一个萝卜一个坑=yi1,ge4,luo2,bo5,yi1,ge4,keng1 一举两得=yi1,ju3,liang3,de2 一之为甚=yi1,zhi1,wei2,shen4
训练模型
训练,生成 pinyin.txt.bin
加载语料库
HanLP-1.7.5\src\main\java\com\hankcs\hanlp\corpus\dictionary\SimpleDictionary.java
加载语料库,每行读取,按 =
分隔,放入字典 trie
中
根据 =
右边每个字的拼音,通过 Pinyin.valueOf("yi1")
得到枚举中声母、韵母、音调、包含音调的字符串形式、不含音调的字符串形式
public enum Pinyin { a1(Shengmu.none, Yunmu.a, 1, "ā", "a", Head.a, 'a'), a2(Shengmu.none, Yunmu.a, 2, "á", "a", Head.a, 'a'), a3(Shengmu.none, Yunmu.a, 3, "ǎ", "a", Head.a, 'a'), a4(Shengmu.none, Yunmu.a, 4, "à", "a", Head.a, 'a'), a5(Shengmu.none, Yunmu.a, 5, "a", "a", Head.a, 'a'), ai1(Shengmu.none, Yunmu.ai, 1, "āi", "ai", Head.a, 'a'), ai2(Shengmu.none, Yunmu.ai, 2, "ái", "ai", Head.a, 'a'), ai3(Shengmu.none, Yunmu.ai, 3, "ǎi", "ai", Head.a, 'a'), ai4(Shengmu.none, Yunmu.ai, 4, "ài", "ai", Head.a, 'a'), ...... }
训练模型
将Map构建成双数组树`trie.build(map)``,可查看:HanLP — 双数组字典树 (Double-array Trie) 实现原理 -- 代码 + 图文,看不懂你来打我
public void build(TreeMap<String, V> map) { // 把值保存下来 v = (V[]) map.values().toArray(); l = new int[v.length]; Set<String> keySet = map.keySet(); // 构建二分trie树 addAllKeyword(keySet); // 在二分trie树的基础上构建双数组trie树 buildDoubleArrayTrie(keySet); used = null; // 构建failure表并且合并output表 constructFailureStates(); rootState = null; loseWeight(); }
保存模型
通过 saveDat(path, trie, map.entrySet());
生成模型文件
static boolean saveDat(String path, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, Set<Map.Entry<String, Pinyin[]>> entrySet) { try { DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT))); out.writeInt(entrySet.size()); for (Map.Entry<String, Pinyin[]> entry : entrySet) { Pinyin[] value = entry.getValue(); out.writeInt(value.length); for (Pinyin pinyin : value) { out.writeInt(pinyin.ordinal()); } } trie.save(out); out.close(); } catch (Exception e) { logger.warning("缓存值dat" + path + "失败"); return false; } return true; }
/** * 持久化 * * @param out 一个DataOutputStream * @throws Exception 可能的IO异常等 */ public void save(DataOutputStream out) throws Exception { out.writeInt(size); for (int i = 0; i < size; i++) { out.writeInt(base[i]); out.writeInt(check[i]); out.writeInt(fail[i]); int output[] = this.output[i]; if (output == null) { out.writeInt(0); } else { out.writeInt(output.length); for (int o : output) { out.writeInt(o); } } } out.writeInt(l.length); for (int length : l) { out.writeInt(length); } }
预测
加载模型
// path = data/dictionary/pinyin/pinyin.txt static boolean loadDat(String path) { ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT); if (byteArray == null) return false; int size = byteArray.nextInt(); Pinyin[][] valueArray = new Pinyin[size][]; for (int i = 0; i < valueArray.length; ++i) { int length = byteArray.nextInt(); valueArray[i] = new Pinyin[length]; for (int j = 0; j < length; ++j) { valueArray[i][j] = pinyins[byteArray.nextInt()]; } } if (!trie.load(byteArray, valueArray)) return false; return true; } public boolean load(ByteArray byteArray, V[] value) { if (byteArray == null) return false; size = byteArray.nextInt(); base = new int[size + 65535]; // 多留一些,防止越界 check = new int[size + 65535]; fail = new int[size + 65535]; output = new int[size + 65535][]; int length; for (int i = 0; i < size; ++i) { base[i] = byteArray.nextInt(); check[i] = byteArray.nextInt(); fail[i] = byteArray.nextInt(); length = byteArray.nextInt(); if (length == 0) continue; output[i] = new int[length]; for (int j = 0; j < output[i].length; ++j) { output[i][j] = byteArray.nextInt(); } } length = byteArray.nextInt(); l = new int[length]; for (int i = 0; i < l.length; ++i) { l[i] = byteArray.nextInt(); } v = value; return true; }
计算
通过 HanLP — Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机 找出汉字的拼音
// HanLP-1.7.5\src\main\java\com\hankcs\hanlp\dictionary\py\PinyinDictionary.java protected static List<Pinyin> segLongest(char[] charArray, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, boolean remainNone) { final Pinyin[][] wordNet = new Pinyin[charArray.length][]; trie.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<Pinyin[]>() { @Override public void hit(int begin, int end, Pinyin[] value) { int length = end - begin; if (wordNet[begin] == null || length > wordNet[begin].length) { wordNet[begin] = length == 1 ? new Pinyin[]{value[0]} : value; } } }); List<Pinyin> pinyinList = new ArrayList<Pinyin>(charArray.length); for (int offset = 0; offset < wordNet.length; ) { if (wordNet[offset] == null) { if (remainNone) { pinyinList.add(Pinyin.none5); } ++offset; continue; } for (Pinyin pinyin : wordNet[offset]) { pinyinList.add(pinyin); } offset += wordNet[offset].length; } return pinyinList; }
调用
public static void main(String[] args) { String text = "重载不是重任"; List<Pinyin> pinyinList = HanLP.convertToPinyinList(text); System.out.print("原文:"); for (char c : text.toCharArray()) { System.out.printf("%c", c); } System.out.println(); System.out.print("拼音(数字音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin); } System.out.println(); System.out.print("拼音(符号音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getPinyinWithToneMark()); } System.out.println(); System.out.print("拼音(无音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getPinyinWithoutTone()); } System.out.println(); System.out.print("声调:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getTone()); } System.out.println(); System.out.print("声母:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getShengmu()); } System.out.println(); System.out.print("韵母:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getYunmu()); } System.out.println(); System.out.print("输入法头:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getHead()); } System.out.println(); }
输出:
原文:重载不是重任 拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4, 拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn, 拼音(无音调):chong,zai,bu,shi,zhong,ren, 声调:2,3,2,4,4,4, 声母:ch,z,b,sh,zh,r, 韵母:ong,ai,u,i,ong,en, 输入法头:ch,z,b,sh,zh,r,