将向量提取器用于平行语料对齐的一个小示例

简介: 将向量提取器用于平行语料对齐的一个小示例
from sentence_transformers import SentenceTransformer 
import numpy as np
from os import path
model_path = (
    '/data/m3e-base' 
    if path.isdir('/data/m3e-base') 
    else 'moka-ai/m3e-base'
)
model = SentenceTransformer(model_path)
zh_list = [
    "国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。",
    "瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。",
    "2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。",
    "新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。",
    "费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻,难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。",
] 
en_list = [
    "On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years. " ,
    "New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today. " ,
    "QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators." ,
    "Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others. " ,
    "The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research." ,
]
zh_vecs = model.encode(zh_list)
en_vecs = model.encode(en_list)
def l2_norm(arr, axis=-1):
    return (arr ** 2).sum(axis=axis, keepdims=True) ** 0.5
en_vecs /= l2_norm(en_vecs)
zh_vecs /= l2_norm(zh_vecs)
sim_mat = en_vecs @ zh_vecs.T
sims = np.sort(sim_mat, axis=-1)[:, ::-1]
idcs = np.argsort(sim_mat, axis=-1)[:, ::-1]
idcs_top1 = idcs[:, 0].ravel()
sims_top1 = sims[:, 0].ravel()
for i, (j, sim) in enumerate(zip(idcs_top1, sims_top1)):
    print(en_list[i] + '\n' + zh_list[j] + f'\n相似度:{sim}\n' + '=' * 30)
'''
On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years.
2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。
相似度:0.7973945736885071
==============================
New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today.
新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。
相似度:0.8789420127868652
==============================
QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators.
国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。
相似度:0.8807516098022461
==============================
Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others.
费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻, 难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原 因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。
相似度:0.8909085988998413
==============================
The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research.
瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。
相似度:0.8677741289138794
==============================
'''
相关文章
|
移动开发 文字识别 算法
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
本文简要介绍Pattern Recognition 2019论文“SegLink++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping”的主要工作。该论文提出一种对文字实例敏感的自下而上的文字检测方法,解决了自然场景中密集文本和不规则文本的检测问题。
1875 0
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
|
5月前
|
自然语言处理 Python
【Python自然语言处理】文本向量化的六种常见模型讲解(独热编码、词袋模型、词频-逆文档频率模型、N元模型、单词-向量模型、文档-向量模型)
【Python自然语言处理】文本向量化的六种常见模型讲解(独热编码、词袋模型、词频-逆文档频率模型、N元模型、单词-向量模型、文档-向量模型)
281 0
|
4月前
|
人工智能 API 开发工具
如何将多种模态转换为向量?
本文介绍如何通过模型服务灵积DashScope进行多模态向量生成,并入库至向量检索服务DashVector中进行向量检索。
|
4月前
|
人工智能 自然语言处理 API
如何将文本转换为向量?
本文介绍如何通过模型服务灵积DashScope将文本转换为向量,并入库至向量检索服务DashVector中进行向量检索。
如何将文本转换为向量?
|
机器学习/深度学习 编解码 PyTorch
基于MeshCNN和PyTorch的三维对象分类和分割
基于MeshCNN和PyTorch的三维对象分类和分割
272 0
基于MeshCNN和PyTorch的三维对象分类和分割
|
12月前
|
自然语言处理 算法 测试技术
参数减半、与CLIP一样好,视觉Transformer从像素入手实现图像文本统一
参数减半、与CLIP一样好,视觉Transformer从像素入手实现图像文本统一
|
文字识别
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)
|
机器学习/深度学习 资源调度 算法
图像提取特征(下)| 学习笔记
快速学习图像提取特征(下),介绍了图像提取特征(下)系统机制, 以及在实际应用过程中如何使用。
99 0
图像提取特征(下)| 学习笔记
|
机器学习/深度学习 存储 缓存
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)
137 0
【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)