seq2seq介绍
模型架构:
Seq2Seq(Sequence-to-Sequence)模型是一种在自然语言处理(NLP)中广泛应用的架构,其核心思想是将一个序列作为输入,并输出另一个序列。这种模型特别适用于机器翻译、聊天机器人、自动文摘等场景,其中输入和输出的长度都是可变的。
- embedding层在seq2seq模型中起着将离散单词转换为连续向量表示的关键作用,为后续的自然语言处理任务提供了有效的特征输入。
数据集
下载: https://download.pytorch.org/tutorial/data.zip
🍸️步骤:
基于GRU的seq2seq模型架构实现翻译的过程:
- 导入必备的工具包.
- 对文件中数据进行处理,满足模型训练要求.
- 构建基于GRU的编码器和解码
- 构建模型训练函数,并进行训练
- 构建模型评估函数,并进行测试以及Attention效果分析
from io import open import unicodedata import re import random import torch import torch.nn as nn import torch.nn.functional as F from torch import optim device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
数据预处理
将指定语言中的词汇映射成数值💫
SOS_token = 0 EOS_token = 1 class Lang: def __init__(self, name): self.name = name self.word2index = {} self.index2word = {0: "SOS", 1: "EOS"} self.n_words = 2 def addSentence(self, sentence): for word in sentence.split(' '): self.addWord(word) def addWord(self, word): if word not in self.word2index: self.word2index[word] = self.n_words self.index2word[self.n_words] = words self.n_words += 1
- 测试:实例化参数:
name = "eng" sentence = "hello I am Jay" engl = Lang(name) engl.addSentence(sentence) print("word2index:", engl.word2index) print("index2word:", engl.index2word) print("n_words:", engl.n_words) word2index: {'hello': 2, 'I': 3, 'am': 4, 'Jay': 5} index2word: {0: 'SOS', 1: 'EOS', 2: 'hello', 3: 'I', 4: 'am', 5: 'Jay'} n_words: 6
字符规范化💫
def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' ) def normalizeString(s): s = unicodeToAscii(s.lower().strip()) s = re.sub(r"([.!?])", r" \1", s) s = re.sub(r"[^a-zA-Z.!?]+", r" ", s) return s
将文件中的数据加载到内存,实例化类Lang💫
data_path = 'eng-fra.txt' def readLangs(lang1, lang2): """读取语言函数, 参数lang1是源语言的名字, 参数lang2是目标语言的名字 返回对应的class Lang对象, 以及语言对列表""" lines = open(data_path, encoding='utf-8').read().strip().split('\n') pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines] input_lang = Lang(lang1) output_lang = Lang(lang2) return input_lang, output_lang, pairs
- 测试:输入参数:
lang1 = "eng" lang2 = "fra" input_lang, output_lang, pairs = readLangs(lang1, lang2) print("pairs中的前五个:", pairs[:5]) pairs中的前五个: [['go .', 'va !'], ['run !', 'cours !'], ['run !', 'courez !'], ['wow !', 'ca alors !'], ['fire !', 'au feu !']]
过滤出符合我们要求的语言对💫
MAX_LENGTH = 10 eng_prefixes = ( "i am ", "i m ", "he is", "he s ", "she is", "she s ", "you are", "you re ", "we are", "we re ", "they are", "they re " ) def filterPair(p): return len(p[0].split(' ')) < MAX_LENGTH and \ p[0].startswith(eng_prefixes) and \ len(p[1].split(' ')) < MAX_LENGTH def filterPairs(pairs): return [pair for pair in pairs if filterPair(pair)]
对以上数据准备函数进行整合💫
def prepareData(lang1, lang2): input_lang, output_lang, pairs = readLangs(lang1, lang2) pairs = filterPairs(pairs) for pair in pairs: input_lang.addSentence(pair[0]) output_lang.addSentence(pair[1]) return input_lang, output_lang, pairs
将语言对转化为模型输入需要的张量💫
def tensorFromSentence(lang, sentence): indexes = [lang.word2index[word] for word in sentence.split(' ')] indexes.append(EOS_token) return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1) def tensorsFromPair(pair): input_tensor = tensorFromSentence(input_lang, pair[0]) target_tensor = tensorFromSentence(output_lang, pair[1]) return (input_tensor, target_tensor)
- 测试输入:
pair = pairs[0] pair_tensor = tensorsFromPair(pair) print(pair_tensor) (tensor([[2], [3], [4], [1]]), tensor([[2], [3], [4], [5], [1]]))
使用seq2seq架构实现英译法(二)+https://developer.aliyun.com/article/1544784?spm=a2c6h.13148508.setting.28.22454f0eHFZZj3