使用机器学习生成图像描述（下）-阿里云开发者社区

使用机器学习生成图像描述（下）

2022-12-22 180

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用机器学习生成图像描述

数据准备

这是该项目最重要的方面之一。对于图像，我们需要使用Inception V3模型将它们转换为固定大小的矢量，如前所述。

#Belowpathcontainsalltheimagesall_images_path='dataset/Flickr8k_Dataset/Flicker8k_Dataset/'#Createalistofallimagenamesinthedirectoryall_images=glob.glob(all_images_path+'*.jpg')
#Createalistofallthetrainingandtestingimageswiththeirfullpathnamesdefcreate_list_of_images(file_path):
images_names=set(open(file_path, 'r').read().strip().split('\n'))
images= []
forimageinall_images:
ifimage[len(all_images_path):] inimage_names:
images.append(image)
returnimagestrain_images_path='dataset/Flickr8k_text/Flickr_8k.trainImages.txt'test_images_path='dataset/Flickr8k_text/Flickr_8k.testImages.txt'train_images=create_list_of_images(train_images_path)
test_images=create_list_of_images(test_images_path)
#preprocessingtheimagesdefpreprocess(image_path):
img=image.load_img(image_path, target_size=(299, 299))
x=image.img_to_array(img)
x=np.expand_dims(x, axis=0)
x=preprocess_input(x)
returnx#Loadtheinceptionv3modelmodel=InceptionV3(weights='imagenet')
#Createanewmodel, byremovingthelastlayer (outputlayer) fromtheinceptionv3model_new=Model(model.input, model.layers[-2].output)
#Encodingagivenimageintoavectorofsize (2048, )
defencode(image):
image=preprocess(image)
fea_vec=model_new.predict(image)
fea_vec=np.reshape(fea_vec, fea_vec.shape[1])
returnfea_vecencoding_train= {}
forimgintrain_images:
encoding_train[img[len(all_images_path):]] =encode(img)
encoding_test= {}
forimgintest_images:
encoding_test[img[len(all_images_path):]] =encode(img)
#Savethebottleneckfeaturestodiskwithopen("encoded_files/encoded_train_images.pkl", "wb") asencoded_pickle:
pickle.dump(encoding_train, encoded_pickle)
withopen("encoded_files/encoded_test_images.pkl", "wb") asencoded_pickle:
pickle.dump(encoding_test, encoded_pickle)
train_features=load(open("encoded_files/encoded_train_images.pkl", "rb"))

第1-22行：将训练和测试图像的路径加载到单独的列表中
第25–53行：循环训练和测试集中的每个图像，将它们加载为固定大小，对其进行预处理，使用InceptionV3模型提取特征，最后对其进行重塑。
第56–63行：将提取的特征保存到磁盘

现在，我们不会一次预测所有的标题文字，因为我们不只是将图像提供给计算机，并要求它为其生成文字。我们要做的就是给它图像的特征向量，以及标题的第一个单词，并让它预测第二个单词。然后我们给它给出前两个单词，并让它预测第三个单词。让我们考虑数据集部分中给出的图像和标题“一个女孩正在进入木结构建筑”。在这种情况下，在添加令牌“ startseq”和“ endseq”之后，以下分别是我们的输入（Xi）和输出（Yi）。

此后，我们将使用我们创建的“索引”字典来更改输入和输出中的每个词以映射索引。在进行批处理时，我们希望所有序列的长度均等，这就是为什么要在每个序列后附加0直到它们成为最大长度（如上所述计算为34）的原因。正如人们所看到的那样，这是大量的数据，将其立即加载到内存中是根本不可行的，为此，我们将使用一个数据生成器将其加载到小块中降低是用的内存。

#datagenerator, intendedtobeusedinacalltomodel.fit_generator()
defdata_generator(descriptions, photos, wordtoix, max_length, num_photos_per_batch):
X1, X2, y=list(), list(), list()
n=0#loopforeveroverimageswhile1:
forkey, desc_listindescriptions.items():
n+=1#retrievethephotofeaturephoto=photos[key+'.jpg']
fordescindesc_list:
#encodethesequenceseq= [wordtoix[word] forwordindesc.split(' ') ifwordinwordtoix]
#splitonesequenceintomultipleX, ypairsforiinrange(1, len(seq)):
#splitintoinputandoutputpairin_seq, out_seq=seq[:i], seq[i]
#padinputsequencein_seq=pad_sequences([in_seq], maxlen=max_length)[0]
#encodeoutputsequenceout_seq=to_categorical([out_seq], num_classes=vocab_size)[0]
#storeX1.append(photo)
X2.append(in_seq)
y.append(out_seq)
#yieldthebatchdataifn==num_photos_per_batch:
yield [[array(X1), array(X2)], array(y)]
X1, X2, y=list(), list(), list()
n=0

上面的代码遍历所有图像和描述，并生成表中的数据项。yield将使函数再次从同一行运行，因此，让我们分批加载数据

模型架构和训练

如前所述，我们的模型在每个点都有两个输入，一个输入特征图像矢量，另一个输入部分文字。我们首先将0.5的Dropout应用于图像矢量，然后将其与256个神经元层连接。对于部分文字，我们首先将其连接到嵌入层，并使用如上所述经过GLOVE训练的嵌入矩阵的权重。然后，我们应用Dropout 0.5和LSTM（长期短期记忆）。最后，我们将这两种方法结合在一起，并将它们连接到256个神经元层，最后是一个softmax层，该层预测我们词汇中每个单词的概率。可以使用下图概括高级体系结构：

以下是训练期间选择的超参数：损失被选择为“categorical-loss entropy”，优化器为“Adam”。该模型总共训练了30轮，但对于前20轮，批次大小和学习率分别为0.001和3，而接下来的10轮分别为0.0001和6。

inputs1=Input(shape=(2048,))
fe1=Dropout(0.5)(inputs1)
fe2=Dense(256, activation='relu')(fe1)
inputs2=Input(shape=(max_length1,))
se1=Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2=Dropout(0.5)(se1)
se3=LSTM(256)(se2)
decoder1=add([fe2, se3])
decoder2=Dense(256, activation='relu')(decoder1)
outputs=Dense(vocab_size, activation='softmax')(decoder2)
model=Model(inputs=[inputs1, inputs2], outputs=outputs)
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable=Falsemodel.compile(loss='categorical_crossentropy', optimizer='adam')
epochs=20number_pics_per_batch=3steps=len(train_descriptions)//number_pics_per_batchgenerator=data_generator(train_descriptions, train_features, wordtoix, max_length1, number_pics_per_batch)
history=model.fit_generator(generator, epochs=20, steps_per_epoch=steps, verbose=1)
model.optimizer.lr=0.0001epochs=10number_pics_per_batch=6steps=len(train_descriptions)//number_pics_per_batchgenerator=data_generator(train_descriptions, train_features, wordtoix, max_length1, number_pics_per_batch)
history1=model.fit_generator(generator, epochs=10, steps_per_epoch=steps, verbose=1)
model.save('saved_model/model_'+str(30) +'.h5')

让我们来解释一下代码：

第1-11行：定义模型架构

第13–14行：将嵌入层的权重设置为上面创建的嵌入矩阵，并且还设置trainable = False，因此该层将不再受任何训练

第16–33行：如上所述，使用超参数在两个单独的间隔中训练模型

推理

下面显示了前20轮的训练损失，然后是接下来的10轮的训练损失：

为了进行推断，我们编写了一个函数，该函数根据我们的模型（即贪心）将下一个单词预测为具有最大概率的单词

defgreedySearch(photo):
in_text='startseq'foriinrange(max_length1):
sequence= [wordtoix[w] forwinin_text.split() ifwinwordtoix]
sequence=pad_sequences([sequence], maxlen=max_length1)
yhat=model.predict([photo,sequence], verbose=0)
yhat=np.argmax(yhat)
word=ixtoword[yhat]
in_text+=' '+wordifword=='endseq':
breakfinal=in_text.split()
final=final[1:-1]
final=' '.join(final)
returnfinalz=1pic=list(encoding_test.keys())[999]
image=encoding_test[pic].reshape((1,2048))
x=plt.imread(images+pic)
plt.imshow(x)
plt.show()
print("Greedy:",greedySearch(image))

效果还不错

如果你需要完整源代码，看这个连接：https://github.com/Noumanmufc1/Image-Captioning

使用机器学习生成图像描述（下）

数据准备

模型架构和训练

推理

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用机器学习生成图像描述（下）

数据准备

模型架构和训练

推理

热门文章

最新文章

相关课程

相关电子书

相关实验场景