将向量提取器用于平行语料对齐的一个小示例

简介: 将向量提取器用于平行语料对齐的一个小示例
from sentence_transformers import SentenceTransformer 
import numpy as np
from os import path
model_path = (
    '/data/m3e-base' 
    if path.isdir('/data/m3e-base') 
    else 'moka-ai/m3e-base'
)
model = SentenceTransformer(model_path)
zh_list = [
    "国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。",
    "瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。",
    "2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。",
    "新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。",
    "费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻,难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。",
] 
en_list = [
    "On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years. " ,
    "New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today. " ,
    "QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators." ,
    "Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others. " ,
    "The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research." ,
]
zh_vecs = model.encode(zh_list)
en_vecs = model.encode(en_list)
def l2_norm(arr, axis=-1):
    return (arr ** 2).sum(axis=axis, keepdims=True) ** 0.5
en_vecs /= l2_norm(en_vecs)
zh_vecs /= l2_norm(zh_vecs)
sim_mat = en_vecs @ zh_vecs.T
sims = np.sort(sim_mat, axis=-1)[:, ::-1]
idcs = np.argsort(sim_mat, axis=-1)[:, ::-1]
idcs_top1 = idcs[:, 0].ravel()
sims_top1 = sims[:, 0].ravel()
for i, (j, sim) in enumerate(zip(idcs_top1, sims_top1)):
    print(en_list[i] + '\n' + zh_list[j] + f'\n相似度:{sim}\n' + '=' * 30)
'''
On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years.
2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。
相似度:0.7973945736885071
==============================
New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today.
新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。
相似度:0.8789420127868652
==============================
QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators.
国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。
相似度:0.8807516098022461
==============================
Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others.
费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻, 难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原 因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。
相似度:0.8909085988998413
==============================
The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research.
瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。
相似度:0.8677741289138794
==============================
'''
相关文章
|
移动开发 文字识别 算法
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
本文简要介绍Pattern Recognition 2019论文“SegLink++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping”的主要工作。该论文提出一种对文字实例敏感的自下而上的文字检测方法,解决了自然场景中密集文本和不规则文本的检测问题。
1947 0
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
|
6月前
|
计算机视觉
论文介绍:像素级分类并非语义分割的唯一选择
【5月更文挑战第24天】论文《像素级分类并非语义分割的唯一选择》提出了MaskFormer模型,该模型通过掩模分类简化语义与实例级分割任务,无需修改模型结构、损失函数或训练过程。在ADE20K和COCO数据集上取得优异性能,显示处理大量类别时的优势。MaskFormer结合像素级、Transformer和分割模块,提高效率和泛化能力。掩模分类方法对比边界框匹配更具效率,且MaskFormer的掩模头设计降低计算成本。该方法为语义分割提供新思路,但实际应用与小物体处理仍有待检验。[链接](https://arxiv.org/abs/2107.06278)
52 3
|
5月前
7.处理多维特征的输入
7.处理多维特征的输入
|
6月前
|
人工智能 文字识别 算法
垂直领域大模型——文档图像大模型的思考与探索
12月1日,2023中国图象图形学学会青年科学家会议在广州召开。超1400名研究人员齐聚一堂,进行学术交流与研讨,共同探索促进图象图形领域“产学研”交流合作。
bert知识库问答 实现建筑领域的问答匹配 文本相似性计算 完整代码数据
bert知识库问答 实现建筑领域的问答匹配 文本相似性计算 完整代码数据
100 0
|
自然语言处理 算法 测试技术
参数减半、与CLIP一样好,视觉Transformer从像素入手实现图像文本统一
参数减半、与CLIP一样好,视觉Transformer从像素入手实现图像文本统一
126 0
|
机器学习/深度学习 资源调度 算法
图像提取特征(下)| 学习笔记
快速学习图像提取特征(下),介绍了图像提取特征(下)系统机制, 以及在实际应用过程中如何使用。
图像提取特征(下)| 学习笔记
|
机器学习/深度学习 存储 缓存
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)
185 0
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)
|
机器学习/深度学习 自然语言处理 计算机视觉
三种能有效融合文本和图像信息的方法——特征拼接、跨模态注意、条件批量归一化
三种能有效融合文本和图像信息的方法——特征拼接、跨模态注意、条件批量归一化
三种能有效融合文本和图像信息的方法——特征拼接、跨模态注意、条件批量归一化
|
Python
【16】查看中间层特征矩阵并保存图像与参数
【16】查看中间层特征矩阵并保存图像与参数
118 0