计算两个字符串相(或句子)似度的方法

简介: 主要方法有:编辑距离、余弦相似度、模糊相似度百分比

主要方法有:编辑距离、余弦相似度、模糊相似度百分比


1 编辑距离



def levenshtein(first, second):
        ''' 编辑距离算法(LevD) 
            Args: 两个字符串
            returns: 两个字符串的编辑距离 int
        '''
        if len(first) > len(second):
            first, second = second, first
        if len(first) == 0:
            return len(second)
        if len(second) == 0:
            return len(first)
        first_length = len(first) + 1
        second_length = len(second) + 1
        distance_matrix = [list(range(second_length)) for x in range(first_length)]
        # print distance_matrix
        for i in range(1, first_length):
            for j in range(1, second_length):
                deletion = distance_matrix[i - 1][j] + 1
                insertion = distance_matrix[i][j - 1] + 1
                substitution = distance_matrix[i - 1][j - 1]
                if first[i - 1] != second[j - 1]:
                    substitution += 1
                distance_matrix[i][j] = min(insertion, deletion, substitution)
                # print distance_matrix
        return distance_matrix[first_length - 1][second_length - 1]


str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance


4


2 余弦相似度


60.png


61.png


import math
import re
import datetime
import time
text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."
def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    # print(words1)
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        # print(word)
        if word != '' and word in words1_dict: # 这里改动了
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
        else:
            continue
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1
        else:
            continue
    print(words1_dict)
    print(words2_dict)
    # 排序
    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
    print(dic1)
    print(dic2)
    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    # print(words_key)
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    print(vect1)
    print(vect2)
    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    # print(result)
    return result
if __name__ == '__main__':
    result=compute_cosine(text1, text2)
    print(result)


{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97


3 FuzzyWuzzy



from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")


97


4 difflib 字符串比较


ratio()返回两个字符串之间的相似度,范围是[0,1]


62.png


def get_diffs(text1,text2):
    """获取diff 距离"""
    print("get diffs ratios...")
    seq=difflib.SequenceMatcher()
    diffs=[]
    for wid1,wid2 in tqdm(zip(text1,text2)):
        seq.set_seqs(wid1.lower(),wid2.lower())
        diffs.append(seq.ratio())
    return diffs


difflib模块的其他用法请看:

相关文章
markdown常用语法--花括号(超详细)
markdown常用语法--花括号(超详细)
|
消息中间件 存储 监控
云消息队列 RocketMQ 版(原ONS)体验
云消息队列 RocketMQ 版(原ONS)是阿里云基于 Apache RocketMQ 构建的低延迟、高并发、高可用、高可靠的分布式“消息、事件、流”统一处理平台。它在阿里集团内部业务、阿里云以及开源社区中得到广泛应用。最新的版本进一步优化了高可靠低延迟的特性,并提供了多场景容灾解决方案,使其成为金融级业务消息的首选方案。由于专业及能力问题,本次我只能从产品功能体验方面进行简单的一些分析。
1736 64
|
存储 安全 对象存储
OSS对接-STS认证模式接入参考文档
背景之前项目中用到文件上传的场景中,都是由服务端做转发到OSS,存在着性能损耗。我们在 高德文件直传能力建设 项目中需要探索使用客户端直连OSS的方式来做,了解到OSS提供了STS认证的方式,通过子账号生成的临时AK作为客户端短期访问OSS的凭证,也不同担心AK安全的问题。具体方案见官方文档:STS临时授权访问OSSOSS可以通过阿里云STS(Security Token Service)进行临时
2751 0
OSS对接-STS认证模式接入参考文档
|
8月前
|
测试技术
通义千问团队开源全新的过程奖励模型PRM!
近年来,大型语言模型(LLMs)在数学推理方面取得了显著进展,但它们仍可能在过程中犯错误,如计算错误或逻辑错误,导致得出不正确的结论;即使最终答案正确,这些强大的模型也可能编造看似合理的推理步骤,这削弱了 LLMs 推理过程的可靠性和可信度。
648 14
|
6月前
|
存储 人工智能 API
OWL:告别繁琐任务!开源多智能体系统实现自动化协作,效率提升10倍
OWL 是基于 CAMEL-AI 框架开发的多智能体协作系统,通过智能体之间的动态交互实现高效的任务自动化,支持角色分配、任务分解和记忆功能,适用于代码生成、文档撰写、数据分析等多种场景。
1378 13
OWL:告别繁琐任务!开源多智能体系统实现自动化协作,效率提升10倍
|
机器学习/深度学习 算法
【2023高教社杯】B题 多波束测线问题 问题分析、数学模型及参考文献
本文介绍了2023年高教社杯数学建模竞赛B题,聚焦于多波束测深系统的覆盖宽度和重叠率问题,包括问题分析、数学模型构建和参考文献,并针对不同场景下的测线设计提出了解决方案。
377 0
【2023高教社杯】B题 多波束测线问题 问题分析、数学模型及参考文献
|
10月前
|
API
全球天气预报1天-经纬度版免费API接口教程
该接口用于获取全球任意地区的天气信息,需提供经纬度参数。支持POST和GET请求,返回包括天气、气温、气压、湿度等详细信息。详情及示例参见API文档。
|
Linux 程序员 Perl
老程序员分享:Linux查看系统开机时间
老程序员分享:Linux查看系统开机时间
546 0
|
自然语言处理 算法 搜索推荐
字符串相似度算法完全指南:编辑、令牌与序列三类算法的全面解析与深入分析
在自然语言处理领域,人们经常需要比较字符串,这些字符串可能是单词、句子、段落甚至是整个文档。如何快速判断两个单词或句子是否相似,或者相似度是好还是差。这类似于我们使用手机打错一个词,但手机会建议正确的词来修正它,那么这种如何判断字符串相似度呢?本文将详细介绍这个问题。
677 1
|
消息中间件 安全 网络安全
【网络安全 | Kali】基于Docker的Vulhub安装教程指南
【网络安全 | Kali】基于Docker的Vulhub安装教程指南
490 1

热门文章

最新文章