NLP项目（二）——拼写纠错

2023-05-08 236

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

NLP自然语言处理_基础版，每接口每天50万次

NLP 自学习平台，3个模型定制额度 1个月

NLP自然语言处理_高级版，每接口累计50万次

简介： NLP项目（二）——拼写纠错

前言

在自然语言处理的过程中，我们常常会遇到一些拼写错误的单词，这时候我们需要用到拼写纠错来处理这些单词。

一、数据集介绍

1-1、spell-errors.txt

# 该数据集包含正确的单词以及常见的拼写错误的单词。

1-2、vocab.txt

# 该数据集是一个词库，

1-3、testdata.txt

# 该数据集是测试集。

二、拼写纠错代码

Part0：构建词库

import numpy as np
import re
import pandas as pd
# 构建词库
word_dic = []
# 通过迭代器访问: for word in f
# 用列表生成式直接将数据加入到一个空的列表中去
with open('./vocab.txt', 'r') as f:
    word_dic = set([word.rstrip() for word in f])

Part1：生成所有的候选集合

import string
def generate_candidates(word=''):
    """
    word: 给定的错误输入
    返回的是所有的候选集合
    生成编辑距离为1的单词
    1、insert
    2、delete
    3、replace
    """
    # string.ascii_lowercase: 所有的小写字母
    letters = ''.join([word for word in string.ascii_lowercase])
    # 将单词分割成一个元组，把所有的可能性添加到一个列表中去。
    # [('', 'abcd'), ('a', 'bcd'), ('ab', 'cd'), ('abc', 'd'), ('abcd', '')]
    splits = [(word[:i],word[i:]) for i in range(len(word)+1)]
    # 遍历字母，遍历所有的分割，把他们组合起来
    # 插入到所有可能的位置
    inserts = [L+i+R for L,R in splits for i in letters]
    # delete
    # 每次都是删除R的第一个元素（如果R存在的话）
    deletes = [L+R[1:] for L,R in splits if R]
    # replace
    # 替换嘛。就是插入和删除的合体。
    replaces = [L+i+R[1:] for L,R in splits if R for i in letters]
    return set(inserts+deletes+replaces)
def generate_edit_two(word=''):
    """
    给定一个字符串，生成编辑距离不大于2的字符串。
    """
#     # 第一步，先生成编辑距离为1的候选集合。
#     edit_one = generate_candidates(word)
#     # 第二部，遍历编辑距离为1的候选集合，对每个元素都再次使用函数
#     all_lis = []
#     for i in edit_one:
#         all_lis.extend(generate_candidates(i))
    # 上边的方法也可以直接写成一个列表生成式 
    return set([j for i in generate_candidates(word) for j in generate_candidates(i)])

Part2：读取语料库，为构建语言模型准备

# shift+tab 来调出函数的具体说明
# 读取一些句子，为了构建语言模型做准备。
# 从nltk中导入路透社语料库
# 路透社语料库
from nltk.corpus import reuters
# 输出语料库包含的类别
categories = reuters.categories()
# corpus：包含许多句子的集合。
# 每个句子是列表形式：['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE']
corpus = reuters.sents(categories=categories)

Part3：构建语言模型，Bigram

# term_count: 代表所有字符以及其个数组成的一个字典。（单个字符）
term_count = {}
# bigram_count：双字符字典
bigram_count = {}
for doc in corpus:
    # 每一个句子都加上起始符
    doc = ['<s>'] + doc
    # 遍历每一个句子的每一个字符，并将其个数记载入term_count字典里。
    for i in range(len(doc)-1):
        # term: 当前字符
        term = doc[i]
        # bigram:当前字符以及后一个字符组成的列表
        bigram = doc[i:i + 2]
        if term in term_count:
            term_count[term] += 1
        else:
            term_count[term] = 1
        # 把bigram变换成一个字符串。
        bigram = ' '.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram] += 1
        else:
            bigram_count[bigram] = 1

Part4：构建每个单词的错误单词输入概率的词典。

# 用户通常输入错的概率 - channel probability
channel_prob={}
# 打开拼写纠错记事本
with open("./spell-errors.txt", 'r', encoding='utf8') as f:
    # 遍历每一行
    for line in f:
        # 用冒号来进行分割
        # raining: rainning, raning变为['raining', ' rainning, raning\n']
        temp=line.split(":")
        # 正确的单词是列表里的第一个字符串并且去除掉前后空格
        correct=temp[0].strip()
        # 错误的单词是列表里的第二个字符串并且以逗号分隔开的几个单词。
        mistakes=[sub_mis.strip() for sub_mis in temp[1].strip().split(",")]
        # 将每一个单词和他的每个错误单词的比例组成一个键值对。
        # 键是正确单词，值是一个花括号。
        channel_prob[correct]={}
        for mis in mistakes:
            # 嵌套词典
            # 值是该错误单词占所有错误单词的比例
            channel_prob[correct][mis]=1.0/len(mistakes)
            # 最终结果如下
            # {'raining': {'rainning': 0.5, 'raning': 0.5}}

Part5：使用测试数据来进行拼写纠错

V = len(term_count)
# 打开测试数据
with open("./testdata.txt", 'r', encoding='utf8') as f:
    # 遍历每一行
    for line in f:
        # 去掉每一行右边的空格。并且以制表符来分割整个句子
        items = line.rstrip().split('\t')
        # items:
        # ['1', '1', 'They told Reuter correspondents in Asian capitals a U.S.  
        # Move against Japan might boost protectionst sentiment in the  U.S. And lead to curbs on 
        # American imports of their products.']
        # 把\.去掉，每个句子刚好在items的下标为2的位置。
        line = re.sub('\.', '', items[2])
        # 去掉逗号，并且分割句子为每一个单词，返回列表
        line= re.sub(',', '', line).split()
        # line:['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 
        # 'capitals', 'a', 'US', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionst', 
        # 'sentiment', 'in', 'the', 'US', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products']
        # 遍历词语列表
        for word in line:
            # 去除每一个单词前后的逗号和句号。
            word=word.strip('.')
            word=word.strip(',')
            # 如果这个单词不在词库中。
            # 就要把这个单词替换成正确的单词
            if word not in word_dic:
                # Step1: 生成所有的(valid)候选集合
                candidates_one = generate_candidates(word)
                # 把生成的所有在词库中的单词拿出来。
                candidates= [word for word in candidates_one if word in word_dic]
                # 一种方式： if candidate = [], 多生成几个candidates, 比如生成编辑距离不大于2的
                # TODO ： 根据条件生成更多的候选集合
                # 如果candidates为空的话，则接着生成编辑距离为2的。
                if len(candidates) < 1:
                    candidates_two = generate_edit_two(word)
                    candidates = [word for word in candidates_two if word in word_dic]
                    if len(candidates)<1:
                        continue
                probs = []
                # 计算所有候选单词的分数。
                # score = p(correct)*p(mistake|correct)
                #       = log p(correct) + log p(mistake|correct)
#                 log p(mistake|correct)= log(p(correct/mistake)*p(mistake)/p(correct))
                # 遍历候选词汇
                # 返回score最大的candidate
                # score既考虑了单个单词的概率，也考虑了与前边单词组合的概率。
                for candi in candidates:
                    prob = 0
                    # a. 计算channel probability
                    # 如果候选词在channel_prob字典中，并且错误单词刚好在候选词对应的值处。
                    if candi in channel_prob and word in channel_prob[candi]:
                        prob += np.log(channel_prob[candi][word])
                    else:
                        prob += np.log(0.00001)
                    # b. 计算语言模型的概率
                    sentence= re.sub('\.', '', items[2])
                    # 得到单词在原来句子中的索引
                    idx = re.sub(',', '', sentence).split().index(word)
                    # 
                    # items:
                    # ['1', '1', 'They told Reuter correspondents in Asian capitals a U.S.  
                    # Move against Japan might boost protectionst sentiment in the  U.S. And lead to curbs on 
                    # American imports of their products.']
                    # 把当前单词和他的前一个单词拼接到一起。
                    bigram_1 = ' '.join([items[2].split()[idx-1],candi])
                    # 如果bigram_1在双字符词典里，并且前一个单词也在词典里
                    if bigram_1 in bigram_count and items[2].split()[idx-1] in term_count:
                        prob += np.log((bigram_count[bigram_1] + 1.0) / (
                                term_count[items[2].split()[idx-1]] + V))
                    else:
                        prob += np.log(1.0 / V)
                    # TODO: 也要考虑当前 [word, post_word]
                    #   prob += np.log(bigram概率)
                    if idx + 1 < len(items[2].split()):
                        bigram_2 = ' '.join([candi,items[2].split()[idx + 1]])
                        if bigram_2 in bigram_count and candi in term_count:
                            prob += np.log((bigram_count[bigram_2] + 1.0) / (
                                    term_count[candi] + V))
                        else:
                            prob += np.log(1.0 / V)
                        # 所有候选单词的分数都添加到probs列表里。
                        probs.append(prob)
                # 
                print(probs)
                if probs:
                    # 得到probs列表候选单词里最大的分数，把索引拿出来
                    max_idx = probs.index(max(probs))
                    # 该索引同时也对应着候选集合里的正确单词，输出错误单词和正确单词。
                    print(word, candidates[max_idx])
                else:
                    print("False")

总结

关注点赞私信我获取数据集！代码来源于网络，本人仅作学习使用，如有侵权请联系我删除。

NLP项目（二）——拼写纠错

前言

一、数据集介绍

1-1、spell-errors.txt

1-2、vocab.txt

1-3、testdata.txt

二、拼写纠错代码

Part0：构建词库

Part1：生成所有的候选集合

Part2：读取语料库，为构建语言模型准备

Part3：构建语言模型，Bigram

Part4：构建每个单词的错误单词输入概率的词典。

Part5：使用测试数据来进行拼写纠错

总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

NLP项目（二）——拼写纠错

前言

一、数据集介绍

1-1、spell-errors.txt

1-2、vocab.txt

1-3、testdata.txt

二、拼写纠错代码

Part0：构建词库

Part1：生成所有的候选集合

Part2：读取语料库，为构建语言模型准备

Part3：构建语言模型，Bigram

Part4：构建每个单词的错误单词输入概率的词典。

Part5：使用测试数据来进行拼写纠错

总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景