计算两个字符串相(或句子)似度的方法

简介: 主要方法有:编辑距离、余弦相似度、模糊相似度百分比

主要方法有:编辑距离、余弦相似度、模糊相似度百分比


1 编辑距离



def levenshtein(first, second):
        ''' 编辑距离算法(LevD) 
            Args: 两个字符串
            returns: 两个字符串的编辑距离 int
        '''
        if len(first) > len(second):
            first, second = second, first
        if len(first) == 0:
            return len(second)
        if len(second) == 0:
            return len(first)
        first_length = len(first) + 1
        second_length = len(second) + 1
        distance_matrix = [list(range(second_length)) for x in range(first_length)]
        # print distance_matrix
        for i in range(1, first_length):
            for j in range(1, second_length):
                deletion = distance_matrix[i - 1][j] + 1
                insertion = distance_matrix[i][j - 1] + 1
                substitution = distance_matrix[i - 1][j - 1]
                if first[i - 1] != second[j - 1]:
                    substitution += 1
                distance_matrix[i][j] = min(insertion, deletion, substitution)
                # print distance_matrix
        return distance_matrix[first_length - 1][second_length - 1]


str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance


4


2 余弦相似度


60.png


61.png


import math
import re
import datetime
import time
text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."
def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    # print(words1)
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        # print(word)
        if word != '' and word in words1_dict: # 这里改动了
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
        else:
            continue
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1
        else:
            continue
    print(words1_dict)
    print(words2_dict)
    # 排序
    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
    print(dic1)
    print(dic2)
    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    # print(words_key)
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    print(vect1)
    print(vect2)
    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    # print(result)
    return result
if __name__ == '__main__':
    result=compute_cosine(text1, text2)
    print(result)


{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97


3 FuzzyWuzzy



from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")


97


4 difflib 字符串比较


ratio()返回两个字符串之间的相似度,范围是[0,1]


62.png


def get_diffs(text1,text2):
    """获取diff 距离"""
    print("get diffs ratios...")
    seq=difflib.SequenceMatcher()
    diffs=[]
    for wid1,wid2 in tqdm(zip(text1,text2)):
        seq.set_seqs(wid1.lower(),wid2.lower())
        diffs.append(seq.ratio())
    return diffs


difflib模块的其他用法请看:

相关文章
单词 sift 的含义和使用场景介绍
单词 sift 的含义和使用场景介绍
|
1月前
一个16位的数以4位为一组分割,然后将各部分相加获取最终结果。
一个16位的数以4位为一组分割,然后将各部分相加获取最终结果。
|
1月前
|
算法 测试技术 C#
【多数组合 数学 字符串】2514. 统计同位异构字符串数目
【多数组合 数学 字符串】2514. 统计同位异构字符串数目
|
1月前
|
算法 测试技术 编译器
【算法 | 实验18】在字符矩阵中查找给定字符串的所有匹配项
题目描述 题目 在字符矩阵中查找给定字符串的所有匹配项 给定一个M×N字符矩阵,以及一个字符串S,找到在矩阵中所有可能的连续字符组成的S的次数。所谓的连续字符,是指一个字符可以和位于其上下左右,左上左下,右上右下8个方向的字符组成字符串。用回溯法求解。
50 1
|
1月前
编译原理——构造预测分析表(判断某字符串是否是文法G(E)的句子)
编译原理——构造预测分析表(判断某字符串是否是文法G(E)的句子)
30 0
|
1月前
|
安全
单词 remediation 的含义和使用场合
单词 remediation 的含义和使用场合
|
安全 算法 索引
对字符串进行分割并且补位的算法解析
重点掌握StringBuilder和StringBuffer和String的区别
对字符串进行分割并且补位的算法解析
从单词嵌入到文档距离 :WMD一种有效的文档分类方法
从单词嵌入到文档距离 :WMD一种有效的文档分类方法
133 0
从单词嵌入到文档距离 :WMD一种有效的文档分类方法
|
算法 Java 索引
java实现编辑距离算法(levenshtein distance),计算字符串或者是文本之间的相似度【附代码】
java实现编辑距离算法(levenshtein distance),计算字符串或者是文本之间的相似度【附代码】
515 0
【计算理论】计算理论总结 ( 正则表达式转为非确定性有限自动机 NFA | 示例 ) ★★(一)
【计算理论】计算理论总结 ( 正则表达式转为非确定性有限自动机 NFA | 示例 ) ★★(一)
141 0
【计算理论】计算理论总结 ( 正则表达式转为非确定性有限自动机 NFA | 示例 ) ★★(一)