使用神经网络为图像生成标题（下）-阿里云开发者社区

使用神经网络为图像生成标题（下）

2022-12-18 83

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用神经网络为图像生成标题

上面的代码将生成一个字典，其中每个令牌都被编码为整数，反之亦然。示例输出如下所示

tokenizer.word_index {'a': 1,
  'end': 2,
  'start': 3,
  'in': 4,
  'the': 5,
  'on': 6,
  'is': 7,
  'and': 8,
  'dog': 9,
  'with': 10,
  'man': 11,
  'of': 12,
  'two': 13,
  'black': 14,
  'white': 15,
  'boy': 16,
  'woman': 17,
  'girl': 18,
  'wearing': 19,
  'are': 20,
  'brown': 21.....}

在此之后，我们需要找到词汇表的长度和最长标题的长度。让我们看看这两种方法在创建模型时的重要性。

词汇长度:词汇长度基本上是我们语料库中唯一单词的数量。此外，输出层中的神经元将等于词汇表长度+ 1(+ 1表示由于填充序列而产生的额外空白)，因为在每次迭代时，我们需要模型从语料库中生成一个新单词。

最大标题长度:因为在我们的数据集中，即使对于相同的图像，标题也是可变长度的。让我们试着更详细地理解这个

正如您所看到的，每个标题都有不同的长度，因此我们不能将它们用作我们的LSTM模型的输入。为了解决这个问题，我们填充填充每个标题到最大标题的长度。

注意，每个序列都有一组额外的0来增加它的长度到最大序列。

# compute length of vocabulary and maximum length of a caption (for padding)
vocab_len = len(tokenizer.word_counts) + 1
print(f"Vocabulary length - {vocab_len}")
max_caption_len = max([len(x.split(" ")) for x in all_captions])
print(f"Maximum length of caption - {max_caption_len}")

接下来，我们需要为指定输入和输出的模型创建训练数据集。对于我们的问题，我们有两个输入和一个输出。为了便于理解，让我们更详细地看看这个

对于每个图像我们都有

图像特征(X1)：利用ResNet50模型提取的形状的Numpy数组(18432，)

输入序列(X2)：这需要更多的解释。每个标题只是一个序列列表，我们的模型试图预测序列中下一个最好的元素。因此，对于每个标题，我们将首先从序列中的第一个元素开始，对该元素的相应输出将是下一个元素。在下一次迭代中，前一次迭代的输出将和前一次迭代的输入(内存)一起成为新的输入，这样一直进行，直到我们到达序列的末尾。

输出(y)：序列中的下一个单词。

下面的代码可以用来实现上面创建训练数据集的逻辑-

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
# generator function to generate inputs for model
def create_trianing_data(captions, images, tokenizer, max_caption_length, vocab_len, photos_per_batch):
    X1, X2, y = list(), list(), list()
    n=0
    # loop through every image
    while 1:
        for key, cap in captions.items():
            n+=1
            # retrieve the photo feature
            image = images[key]
            for c in cap:
                # encode the sequence
                sequnece = [tokenizer.word_index[word] for word in c.split(' ') if word in list(tokenizer.word_index.keys())]
                # split one sequence into multiple X, y pairs
                for i in range(1, len(sequence)):
                    # creating input, output
                    inp, out = sequence[:i], sequence[i]
                    # padding input                    
                    input_seq = pad_sequences([inp], maxlen=max_caption_length)[0]
                    # encode output sequence
                    output_seq = to_categorical([out], num_classes=vocab_len)[0]
                    # store
                    X1.append(image)
                    X2.append(input_seq)
                    y.append(output_seq)
            # yield the batch data
            if n==photos_per_batch:
                yield ([np.array(X1), np.array(X2)], np.array(y))
                X1, X2, y = list(), list(), list()
                n=0

合并两个子网络

现在我们已经开发了两个子网络(用于生成字幕的图像特征提取器和LSTM)，让我们结合这两个网络来创建我们的最终模型。

对于任何一幅新图像(必须与训练中使用的图像相似)，我们的模型将根据它在训练相似的图像和字幕集时获得的知识生成标题。

下面的代码创建了最终的模型

import keras
def create_model(max_caption_length, vocab_length):
    # sub network for handling the image feature part
    input_layer1 = keras.Input(shape=(18432))
    feature1 = keras.layers.Dropout(0.2)(input_layer1)
    feature2 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature1)
    feature3 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature2)
    feature4 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature3)
    feature5 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature4)
    # sub network for handling the text generation part
    input_layer2 = keras.Input(shape=(max_caption_length,))
    cap_layer1 = keras.layers.Embedding(vocab_length, 300, input_length=max_caption_length)(input_layer2)
    cap_layer2 = keras.layers.Dropout(0.2)(cap_layer1)
    cap_layer3 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer2)
    cap_layer4 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer3)
    cap_layer5 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer4)
    cap_layer6 = keras.layers.LSTM(max_caption_length*4, activation='relu')(cap_layer5)
    # merging the two sub network
    decoder1 = keras.layers.merge.add([feature5, cap_layer6])
    decoder2 = keras.layers.Dense(256, activation='relu')(decoder1)
    decoder3 = keras.layers.Dense(256, activation='relu')(decoder2)
    # output is the next word in sequence
    output_layer = keras.layers.Dense(vocab_length, activation='softmax')(decoder3)
    model = keras.models.Model(inputs=[input_layer1, input_layer2], outputs=output_layer)
    model.summary()
    return model

在编译模型之前，我们需要给嵌入层添加权重。这是通过为语料库(词汇表)中出现的每个标记创建单词嵌入(在高维向量空间中表示标记)来实现的。有一些非常流行的字嵌入模型可以用于这个目的(GloVe, Gensim嵌入模型等)。

我们将使用Spacy内建的“en_core_web_lg”模型来创建令牌的向量表示(即每个令牌将被表示为(300，)numpy数组)。

下面的代码可以用于创建单词嵌入，并将其添加到我们的模型嵌入层。

# create word embeddings
import spacy
nlp = spacy.load('en_core_web_lg')
# create word embeddings
embedding_dimension = 300
embedding_matrix = np.zeros((vocab_len, embedding_dimension))
# travel through every word in vocabulary and get its corresponding vector
for word, index in tokenizer.word_index.items():
    doc = nlp(word)
    embedding_vector = np.array(doc.vector)
    embedding_matrix[index] = embedding_vector
# adding embeddings to model
predictive_model.layers[2]
predictive_model.layers[2].set_weights([embedding_matrix])
predictive_model.layers[2].trainable = False

现在我们已经创建了所有的东西，我们只需要编译和训练我们的模型。

注意:由于我们任务的复杂性，这个网络的训练时间会非常长(具有大量的epoch)

# get training data
train_data = create_trianing_data(train_image_captions, train_image_features, tokenizer, max_caption_len, vocab_length, 32)
# initialize model
model = create_model(max_caption_len, vocab_len)
steps_per_epochs = len(train_image_captions)//32
# compile model
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit_generator(train_data, epochs=100, steps_per_epoch=steps_per_epochs)

为了生成新的标题，我们首先需要将一幅图像转换为与训练数据集(18432)图像相同维数的numpy数组，并使用<start>作为模型的输入。

在序列生成过程中，一旦在输出中遇到<end>，我们就会终止这个过程。

import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
%matplotlib inline
# method for generating captions
def generate_captions(model, image, tokenizer.word_index, max_caption_length, tokenizer.index_word):
    # input is <start>
    input_text = '<start>'
    # keep generating words till we have encountered <end>
    for i in range(max_caption_length):
        seq = [tokenizer.word_index[w] for w in in_text.split() if w in list(tokenizer.word_index.keys())]
        seq = pad_sequences([sequence], maxlen=max_caption_length)
        prediction = model.predict([photo,sequence], verbose=0)
        prediction = np.argmax(prediction)
        word = tokenizer.index_word[prediction]
        input_text += ' ' + word
        if word == '<end>':
            break
    # remove <start> and <end> from output and return string
    output = in_text.split()
    output = output[1:-1]
    output = ' '.join(output)
    return output
# traverse through testing images to generate captions
count = 0
for key, value in test_image_features.items():
    test_image = test_image_features[key]
    test_image = np.expand_dims(test_image, axis=0)
    final_caption = generate_captions(predictive_model, test_image, tokenizer.word_index, max_caption_len, tokenizer.index_word)
    plt.figure(figsize=(7,7))
    image = Image.open(image_path + "//" + key + ".jpg")
    plt.imshow(image)
    plt.title(final_caption)
    count = count + 1
    if count == 3:
        break