计算两个字符串相(或句子)似度的方法

简介: 主要方法有:编辑距离、余弦相似度、模糊相似度百分比

主要方法有:编辑距离、余弦相似度、模糊相似度百分比


1 编辑距离



def levenshtein(first, second):
        ''' 编辑距离算法(LevD) 
            Args: 两个字符串
            returns: 两个字符串的编辑距离 int
        '''
        if len(first) > len(second):
            first, second = second, first
        if len(first) == 0:
            return len(second)
        if len(second) == 0:
            return len(first)
        first_length = len(first) + 1
        second_length = len(second) + 1
        distance_matrix = [list(range(second_length)) for x in range(first_length)]
        # print distance_matrix
        for i in range(1, first_length):
            for j in range(1, second_length):
                deletion = distance_matrix[i - 1][j] + 1
                insertion = distance_matrix[i][j - 1] + 1
                substitution = distance_matrix[i - 1][j - 1]
                if first[i - 1] != second[j - 1]:
                    substitution += 1
                distance_matrix[i][j] = min(insertion, deletion, substitution)
                # print distance_matrix
        return distance_matrix[first_length - 1][second_length - 1]


str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance


4


2 余弦相似度


60.png


61.png


import math
import re
import datetime
import time
text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."
def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    # print(words1)
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        # print(word)
        if word != '' and word in words1_dict: # 这里改动了
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
        else:
            continue
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1
        else:
            continue
    print(words1_dict)
    print(words2_dict)
    # 排序
    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
    print(dic1)
    print(dic2)
    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    # print(words_key)
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    print(vect1)
    print(vect2)
    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    # print(result)
    return result
if __name__ == '__main__':
    result=compute_cosine(text1, text2)
    print(result)


{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97


3 FuzzyWuzzy



from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")


97


4 difflib 字符串比较


ratio()返回两个字符串之间的相似度,范围是[0,1]


62.png


def get_diffs(text1,text2):
    """获取diff 距离"""
    print("get diffs ratios...")
    seq=difflib.SequenceMatcher()
    diffs=[]
    for wid1,wid2 in tqdm(zip(text1,text2)):
        seq.set_seqs(wid1.lower(),wid2.lower())
        diffs.append(seq.ratio())
    return diffs


difflib模块的其他用法请看:

相关文章
|
6月前
一个16位的数以4位为一组分割,然后将各部分相加获取最终结果。
一个16位的数以4位为一组分割,然后将各部分相加获取最终结果。
|
1月前
|
编解码 算法 数据可视化
lintsampler:高效从任意概率分布生成随机样本的新方法
在实际应用中,从复杂概率密度函数(PDF)中抽取随机样本的需求非常普遍,涉及统计估计、蒙特卡洛模拟和物理仿真等领域。`lintsampler` 是一个纯 Python 库,旨在高效地从任意概率分布中生成随机样本。它通过线性插值采样算法,简化了复杂分布的采样过程,提供了比传统方法如 MCMC 和拒绝采样更简便和高效的解决方案。`lintsampler` 的设计目标是让用户能够轻松生成高质量的样本,而无需复杂的参数调整。
23 1
lintsampler:高效从任意概率分布生成随机样本的新方法
|
2月前
|
算法 开发工具 git
使用 fuzzywuzzy 模块计算两个字符串之间的相似度
使用 fuzzywuzzy 模块计算两个字符串之间的相似度
51 1
|
6月前
|
算法 测试技术 C#
【多数组合 数学 字符串】2514. 统计同位异构字符串数目
【多数组合 数学 字符串】2514. 统计同位异构字符串数目
|
3月前
|
存储 自然语言处理 索引
|
6月前
编译原理——构造预测分析表(判断某字符串是否是文法G(E)的句子)
编译原理——构造预测分析表(判断某字符串是否是文法G(E)的句子)
75 0
|
6月前
1657.确定两个字符串是否接近
1657.确定两个字符串是否接近
49 0
|
6月前
|
人工智能 自然语言处理 算法
【动态规划】【字符串】【前缀和】1639通过给定词典构造目标字符串的方案数
【动态规划】【字符串】【前缀和】1639通过给定词典构造目标字符串的方案数
|
6月前
|
自然语言处理
数字与图像/自然语言之间的相互映射
数字与图像/自然语言之间的相互映射
|
安全 算法 索引
对字符串进行分割并且补位的算法解析
重点掌握StringBuilder和StringBuffer和String的区别
对字符串进行分割并且补位的算法解析