将向量提取器用于平行语料对齐的一个小示例

简介: 将向量提取器用于平行语料对齐的一个小示例
from sentence_transformers import SentenceTransformer 
import numpy as np
from os import path
model_path = (
    '/data/m3e-base' 
    if path.isdir('/data/m3e-base') 
    else 'moka-ai/m3e-base'
)
model = SentenceTransformer(model_path)
zh_list = [
    "国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。",
    "瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。",
    "2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。",
    "新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。",
    "费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻,难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。",
] 
en_list = [
    "On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years. " ,
    "New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today. " ,
    "QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators." ,
    "Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others. " ,
    "The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research." ,
]
zh_vecs = model.encode(zh_list)
en_vecs = model.encode(en_list)
def l2_norm(arr, axis=-1):
    return (arr ** 2).sum(axis=axis, keepdims=True) ** 0.5
en_vecs /= l2_norm(en_vecs)
zh_vecs /= l2_norm(zh_vecs)
sim_mat = en_vecs @ zh_vecs.T
sims = np.sort(sim_mat, axis=-1)[:, ::-1]
idcs = np.argsort(sim_mat, axis=-1)[:, ::-1]
idcs_top1 = idcs[:, 0].ravel()
sims_top1 = sims[:, 0].ravel()
for i, (j, sim) in enumerate(zip(idcs_top1, sims_top1)):
    print(en_list[i] + '\n' + zh_list[j] + f'\n相似度:{sim}\n' + '=' * 30)
'''
On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years.
2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。
相似度:0.7973945736885071
==============================
New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today.
新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。
相似度:0.8789420127868652
==============================
QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators.
国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。
相似度:0.8807516098022461
==============================
Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others.
费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻, 难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原 因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。
相似度:0.8909085988998413
==============================
The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research.
瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。
相似度:0.8677741289138794
==============================
'''
相关文章
|
机器学习/深度学习 监控 TensorFlow
数据分割
在机器学习和数据分析中,数据分割是指将可用数据集划分为训练集、验证集和测试集等子集的过程。这种分割的目的是为了评估和验证机器学习模型的性能,并对其进行调优和泛化能力的评估。下面我将解释为什么要进行数据分割,以及如何进行数据分割,并提供一个简单的示例。
349 0
|
移动开发 文字识别 算法
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
本文简要介绍Pattern Recognition 2019论文“SegLink++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping”的主要工作。该论文提出一种对文字实例敏感的自下而上的文字检测方法,解决了自然场景中密集文本和不规则文本的检测问题。
1954 0
论文推荐|[PR 2019]SegLink++:基于实例感知与组件组合的任意形状密集场景文本检测方法
|
27天前
|
人工智能
LongAlign:港大推出的提升文本到图像扩散模型处理长文本对齐方法
LongAlign是由香港大学研究团队推出的文本到图像扩散模型的改进方法,旨在提升长文本输入的对齐精度。通过段级编码技术和分解偏好优化,LongAlign显著提高了模型在长文本对齐任务上的性能,超越了现有的先进模型。
38 1
LongAlign:港大推出的提升文本到图像扩散模型处理长文本对齐方法
|
2月前
|
数据采集
遥感语义分割数据集中的切图策略
该脚本用于遥感图像的切图处理,支持大尺寸图像按指定大小和步长切割为多个小图,适用于语义分割任务的数据预处理。通过设置剪裁尺寸(cs)和步长(ss),可灵活调整输出图像的数量和大小。此外,脚本还支持标签图像的转换,便于后续模型训练使用。
23 0
|
7月前
|
测试技术
Vript:最为详细的视频文本数据集,每个视频片段平均超过140词标注 | 多模态大模型,文生视频
[Vript](https://github.com/mutonix/Vript) 是一个大规模的细粒度视频文本数据集,包含12K个高分辨率视频和400k+片段,以视频脚本形式进行密集注释,每个场景平均有145个单词的标题。除了视觉信息,还转录了画外音,提供额外背景。新发布的Vript-Bench基准包括三个挑战性任务:Vript-CAP(详细视频描述)、Vript-RR(视频推理)和Vript-ERO(事件时序推理),旨在推动视频理解的发展。
146 1
Vript:最为详细的视频文本数据集,每个视频片段平均超过140词标注 | 多模态大模型,文生视频
|
7月前
|
人工智能 文字识别 算法
垂直领域大模型——文档图像大模型的思考与探索
12月1日,2023中国图象图形学学会青年科学家会议在广州召开。超1400名研究人员齐聚一堂,进行学术交流与研讨,共同探索促进图象图形领域“产学研”交流合作。
bert知识库问答 实现建筑领域的问答匹配 文本相似性计算 完整代码数据
bert知识库问答 实现建筑领域的问答匹配 文本相似性计算 完整代码数据
106 0
|
JSON 算法 数据格式
优化cv2.findContours()函数提取的目标边界点,使语义分割进行远监督辅助标注
可以看到cv2.findContours()函数可以将目标的所有边界点都进行导出来,但是他的点存在一个问题,太过密集,如果我们想将语义分割的结果重新导出成labelme格式的json文件进行修正时,这就会存在点太密集没有办法进行修改,这里展示一个示例:没有对导出的结果进行修正,在labelme中的效果图。
239 0
|
算法 固态存储
分别使用SAD匹配,NCC匹配,SSD匹配三种算法提取双目图像的深度信息
分别使用SAD匹配,NCC匹配,SSD匹配三种算法提取双目图像的深度信息
182 0
分别使用SAD匹配,NCC匹配,SSD匹配三种算法提取双目图像的深度信息
|
文字识别
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)
【数图大作业】基于模板匹配的文字识别(二)(文字行列分割)

热门文章

最新文章