600字范文,内容丰富有趣,生活中的好帮手!
600字范文 > python实现文本相似度算法的对比及

python实现文本相似度算法的对比及

时间:2019-07-19 10:02:22

相关推荐

python实现文本相似度算法的对比及

文本相似度算法的对比及python实现

前言

通常我们有这样的需求:对两篇文章或者产品内容进行重复率查询。

为了解决类似的问题,罗列了一些常见的相似度算法,用python代码实现。

五种常见的相似度算法:余弦相似度(cosine_similarity)、jaccard相似度、编辑距离(Levenshtein)、MinHash、SimHash + 海明距离。

代码是一位前辈留下的,做一下整理分享出来。算法的具体理论这里就不硬搬生套了,大家可以自行搜索。有任何问题欢迎留言,谢谢!

余弦相似度cosine_similarity

# -*- coding: utf-8 -*-# 正则包import re# html 包import html# 自然语言处理包import jiebaimport jieba.analyse# 机器学习包from sklearn.metrics.pairwise import cosine_similarityclass CosineSimilarity(object):"""余弦相似度"""def __init__(self, content_x1, content_y2):self.s1 = content_x1self.s2 = content_y2@staticmethoddef extract_keyword(content): # 提取关键词# 正则过滤 html 标签re_exp = pile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)content = re_exp.sub(' ', content)# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)return keywords@staticmethoddef one_hot(word_dict, keywords): # oneHot编码# cut_code = [word_dict[word] for word in keywords]cut_code = [0]*len(word_dict)for word in keywords:cut_code[word_dict[word]] += 1return cut_codedef main(self):# 去除停用词jieba.analyse.set_stop_words('./files/stopwords.txt')# 提取关键词keywords1 = self.extract_keyword(self.s1)keywords2 = self.extract_keyword(self.s2)# 词的并集union = set(keywords1).union(set(keywords2))# 编码word_dict = {}i = 0for word in union:word_dict[word] = ii += 1# oneHot编码s1_cut_code = self.one_hot(word_dict, keywords1)s2_cut_code = self.one_hot(word_dict, keywords2)# 余弦相似度计算sample = [s1_cut_code, s2_cut_code]# 除零处理try:sim = cosine_similarity(sample)return sim[1][0]except Exception as e:print(e)return 0.0# 测试if __name__ == '__main__':with open('./files/sample_x.txt', 'r') as x, open('./files/sample_y.txt', 'r') as y:content_x = x.read()content_y = y.read()similarity = CosineSimilarity(content_x, content_y)similarity = similarity.main()print('相似度: %.2f%%' % (similarity*100))

输出结果:

Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model cost 0.915 seconds.Prefix dict has been built succesfully.相似度: 79.67%

jaccard相似度

# -*- coding: utf-8 -*-# 正则包import re# 自然语言处理包import jiebaimport jieba.analyse# html 包import htmlclass JaccardSimilarity(object):"""jaccard相似度"""def __init__(self, content_x1, content_y2):self.s1 = content_x1self.s2 = content_y2@staticmethoddef extract_keyword(content): # 提取关键词# 正则过滤 html 标签re_exp = pile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)content = re_exp.sub(' ', content)# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)return keywordsdef main(self):# 去除停用词jieba.analyse.set_stop_words('./files/stopwords.txt')# 分词与关键词提取keywords_x = self.extract_keyword(self.s1)keywords_y = self.extract_keyword(self.s2)# jaccard相似度计算intersection = len(list(set(keywords_x).intersection(set(keywords_y))))union = len(list(set(keywords_x).union(set(keywords_y))))# 除零处理sim = float(intersection)/union if union != 0 else 0return sim# 测试if __name__ == '__main__':with open('./files/sample_x.txt', 'r') as x, open('./files/sample_y.txt', 'r') as y:content_x = x.read()content_y = y.read()similarity = JaccardSimilarity(content_x, content_y)similarity = similarity.main()print('相似度: %.2f%%' % (similarity*100))

输出结果:

Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model cost 0.893 seconds.Prefix dict has been built succesfully.相似度: 66.20%

编辑距离Levenshtein

# -*- coding: utf-8 -*-# 正则包import re# html 包import html# 自然语言处理包import jiebaimport jieba.analyse# 编辑距离包import Levenshteinclass LevenshteinSimilarity(object):"""编辑距离"""def __init__(self, content_x1, content_y2):self.s1 = content_x1self.s2 = content_y2@staticmethoddef extract_keyword(content): # 提取关键词# 正则过滤 html 标签re_exp = pile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)content = re_exp.sub(' ', content)# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)return keywordsdef main(self):# 去除停用词jieba.analyse.set_stop_words('./files/stopwords.txt')# 提取关键词keywords1 = ', '.join(self.extract_keyword(self.s1))keywords2 = ', '.join(self.extract_keyword(self.s2))# ratio计算2个字符串的相似度,它是基于最小编辑距离distances = Levenshtein.ratio(keywords1, keywords2)return distances# 测试if __name__ == '__main__':with open('./files/sample_x.txt', 'r') as x, open('./files/sample_y.txt', 'r') as y:content_x = x.read()content_y = y.read()distance = LevenshteinSimilarity(content_x, content_y)distance = distance.main()print('相似度: %.2f%%' % (distance * 100))

输出结果:

Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model cost 0.786 seconds.Prefix dict has been built succesfully.相似度: 82.24%

MinHash

# -*- coding: utf-8 -*-# 正则包import re# 自然语言处理包import jiebaimport jieba.analyse# html 包import html# 数据集处理包from datasketch import MinHashclass MinHashSimilarity(object):"""MinHash"""def __init__(self, content_x1, content_y2):self.s1 = content_x1self.s2 = content_y2@staticmethoddef extract_keyword(content): # 提取关键词# 正则过滤 html 标签re_exp = pile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)content = re_exp.sub(' ', content)# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=False)return keywordsdef main(self):# 去除停用词jieba.analyse.set_stop_words('./files/stopwords.txt')# MinHash计算m1, m2 = MinHash(), MinHash()# 提取关键词s1 = self.extract_keyword(self.s1)s2 = self.extract_keyword(self.s2)for data in s1:m1.update(data.encode('utf8'))for data in s2:m2.update(data.encode('utf8'))return m1.jaccard(m2)# 测试if __name__ == '__main__':with open('./files/sample_x.txt', 'r') as x, open('./files/sample_y.txt', 'r') as y:content_x = x.read()content_y = y.read()similarity = MinHashSimilarity(content_x, content_y)similarity = similarity.main()print('相似度: %.2f%%' % (similarity*100))

输出结果:

Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model cost 0.901 seconds.Prefix dict has been built succesfully.相似度: 64.84%

SimHash + 海明距离

# -*- coding: utf-8 -*-# 正则import re# html 包import html# 数学包import math# 自然语言处理包import jiebaimport jieba.analyseclass SimHashSimilarity(object):"""SimHash"""def __init__(self, content_x1, content_y2):self.s1 = content_x1self.s2 = content_y2@staticmethoddef get_bin_str(source): # 字符串转二进制if source == "":return 0else:t = ord(source[0]) << 7m = 1000003mask = 2 ** 128 - 1for c in source:t = ((t * m) ^ ord(c)) & maskt ^= len(source)if t == -1:t = -2t = bin(t).replace('0b', '').zfill(64)[-64:]return str(t)@staticmethoddef extract_keyword(content): # 提取关键词# 正则过滤 html 标签re_exp = pile(r'(<style>.*?</style>)|(<[^>]+>)', re.S)content = re_exp.sub(' ', content)# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg), topK=200, withWeight=True)return keywordsdef run(self, keywords):ret = []for keyword, weight in keywords:bin_str = self.get_bin_str(keyword)key_list = []for c in bin_str:weight = math.ceil(weight)if c == "1":key_list.append(int(weight))else:key_list.append(-int(weight))ret.append(key_list)# 对列表进行"降维"rows = len(ret)cols = len(ret[0])result = []for i in range(cols):tmp = 0for j in range(rows):tmp += int(ret[j][i])if tmp > 0:tmp = "1"elif tmp <= 0:tmp = "0"result.append(tmp)return "".join(result)def main(self):# 去除停用词jieba.analyse.set_stop_words('./files/stopwords.txt')# 提取关键词s1 = self.extract_keyword(self.s1)s2 = self.extract_keyword(self.s2)sim_hash1 = self.run(s1)sim_hash2 = self.run(s2)# print(f'相似哈希指纹1: {sim_hash1}\n相似哈希指纹2: {sim_hash2}')length = 0for index, char in enumerate(sim_hash1):if char == sim_hash2[index]:continueelse:length += 1return length# 测试if __name__ == '__main__':with open('./files/sample_x.txt', 'r') as x, open('./files/sample_y.txt', 'r') as y:content_x = x.read()content_y = y.read()similarity = SimHashSimilarity(content_x, content_y)similarity = similarity.main()# 阀值threshold = 3print(f'海明距离:{similarity} 判定距离:{threshold} 是否相似:{similarity <= threshold}')

输出结果:

Building prefix dict from the default dictionary ...Loading model from cache /tmp/jieba.cacheLoading model cost 0.903 seconds.Prefix dict has been built succesfully.海明距离:17 判定距离:3 是否相似:False

讨论

几种算法对比:

余弦相似度

计算复杂度偏高。

相关研究中,基于物品协同过滤系统的相似性度量方法普遍使用余弦相似性。 然而,在许多实际应用中,数据稀疏度过高,通过余弦相似度计算会产生误导性结果。

jaccard相似度

在产品描述中,很多运营人员为了偷懒,喜欢复制粘贴稍作修改,造成产品描述重复度高。通过提取产品描述的关键词,再计算两组关键词的交集并集非常适合在此场景下检测产品描述的重复度,即杰卡德相似度。

编辑距离

计算复杂度偏高,且在此场景效果并不理想。

MinHash

在大数据集中求杰尔德相似度的解决方案,通过对数据文本的降维,大大提高计算速度。

SimHash + 海明距离

对单词数量低于500的文章误差较大。

MinHash与SimHash的区别:/dm_ustc/article/details/45626907

根据需求场景,选择合适的算法,或者集成算法。代码详情见github:/downdawn/Similarity

笔记,有任何问题欢迎留言,谢谢!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。