【DSW Gallery】基于EasyNLP的多模态CLIP图文检索

本文涉及的产品
交互式建模 PAI-DSW,5000CU*H 3个月
简介: EasyNLP提供多种模型的训练及预测功能,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。本文以多模态图文检索为例,为您介绍如何在PAI-DSW中基于EasyNLP快速使用CLIP进行跨模态图文检索任务的训练、评估、预测。

直接使用

请打开基于EasyNLP的多模态CLIP图文检索,并点击右上角 “ 在DSW中打开” 。

image.png


基于EasyNLP的多模态CLIP图文检索

EasyNLP是阿里云机器学习PAI算法团队基于PyTorch开发的易用且丰富的NLP算法框架( https://github.com/alibaba/EasyNLP ),支持常用的中文预训练模型和大模型落地技术,并且提供了从训练到部署的一站式NLP开发体验。EasyNLP提供了简洁的接口供用户开发NLP模型,包括NLP应用AppZoo和预训练ModelZoo,同时提供技术帮助用户高效地落地超大预训练模型到业务,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。由于跨模态理解需求的不断增加,EasyNLP也将支持各种跨模态模型,特别是中文领域的跨模态模型,希望能够服务更多的NLP和多模态算法开发者和研究者。

图文检索是跨模态检索的一种主流任务,广泛应用于各种网络应用中。本文将为您介绍如何在PAI-DSW中基于EasyNLP快速使用CLIP进行跨模态图文检索。

关于CLIP

CLIP是2021年2月由OpenAI提出的一种基于对比学习的图文预训练表征模型,全称是Contrastive Language-Image Pre-training。模型分别构建了图像和文本的Encoder,对图像和文本进行特征抽取。其中图像Encoder使用的Backbone可以是经典的ResNet系列模型,也可以是更先进的Transformer类模型,如VIT;文本Encoder则一般使用BERT类模型进行特征抽取,也包括RoBERTa等。CLIP在大规模图文数据集上进行了对比学习训练,在多个数据集上的准确度表明,CLIP优于各种基于ImageNet的模型,也具有良好的零样本学习(Zero-shot Learning)能力。

运行环境要求

建议用户使用:Python 3.6,Pytorch 1.8镜像,GPU机型 P100 or V100,内存至少为 32G

EasyNLP安装

建议从GitHub下载EasyNLP源代码进行安装,命令如下:

! git clone https://github.com/alibaba/EasyNLP.git
! pip install -r EasyNLP/requirements.txt
! cd EasyNLP && python setup.py install

您可以使用如下命令验证是否安装成功:

! which easynlp
/home/pai/bin/easynlp

如果您系统内已经安装完easynlp的CLI工具,则说明EasyNLP代码库已经安装。

数据准备

首先,您需要进入指定目录,下载用于本示例的训练数据与验证数据,以及用于提取向量进行向量检索的单列测试数据。命令如下:

! cd examples/appzoo_tutorials/text_vision
! wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_train_base64_part.tsv
! wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_valid_base64_part.tsv
! wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_test_base64_part_text.tsv
! wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_test_base64_part_image.tsv
/bin/bash: line 0: cd: examples/appzoo_tutorials/text_vision: No such file or directory
--2022-07-20 11:40:29--  https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_train_base64_part.tsv
Resolving atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)... 47.101.88.27
Connecting to atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)|47.101.88.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7466122 (7.1M) [text/tab-separated-values]
Saving to: ‘MUGE_MR_train_base64_part.tsv’
MUGE_MR_train_base6 100%[===================>]   7.12M  14.1MB/s    in 0.5s    
2022-07-20 11:40:30 (14.1 MB/s) - ‘MUGE_MR_train_base64_part.tsv’ saved [7466122/7466122]
--2022-07-20 11:40:31--  https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_valid_base64_part.tsv
Resolving atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)... 47.101.88.27
Connecting to atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)|47.101.88.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3783806 (3.6M) [text/tab-separated-values]
Saving to: ‘MUGE_MR_valid_base64_part.tsv’
MUGE_MR_valid_base6 100%[===================>]   3.61M  13.8MB/s    in 0.3s    
2022-07-20 11:40:31 (13.8 MB/s) - ‘MUGE_MR_valid_base64_part.tsv’ saved [3783806/3783806]
--2022-07-20 11:40:31--  https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_test_base64_part_text.tsv
Resolving atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)... 47.101.88.27
Connecting to atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)|47.101.88.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 168 [text/tab-separated-values]
Saving to: ‘MUGE_MR_test_base64_part_text.tsv’
MUGE_MR_test_base64 100%[===================>]     168  --.-KB/s    in 0s      
2022-07-20 11:40:32 (125 MB/s) - ‘MUGE_MR_test_base64_part_text.tsv’ saved [168/168]
--2022-07-20 11:40:32--  https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/CLIP/MUGE_MR_test_base64_part_image.tsv
Resolving atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)... 47.101.88.27
Connecting to atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)|47.101.88.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104557 (102K) [text/tab-separated-values]
Saving to: ‘MUGE_MR_test_base64_part_image.tsv’
MUGE_MR_test_base64 100%[===================>] 102.11K  --.-KB/s    in 0.08s   
2022-07-20 11:40:32 (1.20 MB/s) - ‘MUGE_MR_test_base64_part_image.tsv’ saved [104557/104557]

训练数据和验证数据都为.tsv文件。每行为一个数据,以制表符\t分隔为两列,第一列为文本,第二列为图片的base64编码。用于提取向量进行向量检索的测试数据为单列,仅包含文本或图片的base64编码。

初始化

在Python 3.6环境下,我们首先从刚刚安装好的EasyNLP中引入模型运行需要的各种库,并做一些初始化。在本教程中,我们使用的CLIP模型为clip_chinese_roberta_large_with_vit_large,其图像Encoder采用VIT,文本Encoder采用中文RoBERTa。EasyNLP中集成了丰富的预训练模型库,如果想尝试其他预训练模型,或其他组合的CLIP模型,也可以在user_defined_parameters中进行相应修改,具体的模型名称可见模型列表

# 为了避免EasyNLP中的args与Jupyter系统的冲突,需要手动设置,否则无法进行初始化。
# 在命令行或py文件中运行文中代码则可忽略下述代码。
import sys
sys.argv = ['main.py']
import torch.cuda
from easynlp.appzoo import MultiModalDataset
from easynlp.appzoo import get_application_predictor, get_application_model, get_application_evaluator, get_application_model_for_evaluation
from easynlp.core import Trainer, PredictorManager
from easynlp.utils import initialize_easynlp, get_args, get_pretrain_model_path
from easynlp.utils.global_vars import parse_user_defined_parameters
initialize_easynlp()
args = get_args()
user_defined_parameters = parse_user_defined_parameters('pretrain_model_name_or_path=clip_chinese_roberta_large_with_vit_large fix_vision=True mode=finetune')
args.checkpoint_dir = "./clip_model/"
args.pretrained_model_name_or_path = "clip_chinese_roberta_large_with_vit_large"
/home/pai/lib/python3.6/site-packages/OpenSSL/crypto.py:12: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
  from cryptography import x509
Please ignore the following import error if you are using tunnel table io.
No module named '_common_io'
No module named 'easy_predict'
------------------------ arguments ------------------------
  app_name ........................................ text_classify
  append_cols ..................................... None
  buckets ......................................... None
  checkpoint_dir .................................. None
  chief_hosts ..................................... 
  data_threads .................................... 10
  distributed_backend ............................. nccl
  do_lower_case ................................... False
  epoch_num ....................................... 3.0
  export_tf_checkpoint_type ....................... easytransfer
  first_sequence .................................. None
  gradient_accumulation_steps ..................... 1
  input_schema .................................... None
  is_chief ........................................ 
  is_master_node .................................. True
  job_name ........................................ None
  label_enumerate_values .......................... None
  label_name ...................................... None
  learning_rate ................................... 5e-05
  local_rank ...................................... None
  logging_steps ................................... 100
  master_port ..................................... 23456
  max_grad_norm ................................... 1.0
  micro_batch_size ................................ 2
  mode ............................................ train
  modelzoo_base_dir ............................... 
  n_cpu ........................................... 1
  n_gpu ........................................... 1
  odps_config ..................................... None
  optimizer_type .................................. AdamW
  output_schema ................................... 
  outputs ......................................... None
  predict_queue_size .............................. 1024
  predict_slice_size .............................. 4096
  predict_table_read_thread_num ................... 16
  predict_thread_num .............................. 2
  ps_hosts ........................................ 
  random_seed ..................................... 1234
  rank ............................................ 0
  read_odps ....................................... False
  restore_works_dir ............................... ./.easynlp_predict_restore_works_dir
  resume_from_checkpoint .......................... None
  save_all_checkpoints ............................ False
  save_checkpoint_steps ........................... None
  second_sequence ................................. None
  sequence_length ................................. 16
  skip_first_line ................................. False
  tables .......................................... None
  task_count ...................................... 1
  task_index ...................................... 0
  use_amp ......................................... False
  use_torchacc .................................... False
  user_defined_parameters ......................... None
  user_entry_file ................................. None
  user_script ..................................... None
  warmup_proportion ............................... 0.1
  weight_decay .................................... 0.0001
  worker_count .................................... 1
  worker_cpu ...................................... -1
  worker_gpu ...................................... -1
  worker_hosts .................................... None
  world_size ...................................... 1
-------------------- end of arguments ---------------------
> initializing torch distributed ...
Init dist done. World size: 1, rank 0, l_rank 0
> setting random seeds to 1234 ...

注意:上述代码如果出现“Address already in use”错误,则需要运行以下代码清理端口上正在执行的程序。

netstat -tunlp|grep 6000

kill -9 PID (需要替换成上一行代码执行结果中对应的程序ID)

载入数据

我们使用EasyNLP中自带的MultiModalDataset,对训练和测试数据进行载入。主要参数如下:

  • pretrained_model_name_or_path:预训练模型名称路径,这里我们使用封装好的get_pretrain_model_path函数,来处理模型名称"clip_chinese_roberta_large_with_vit_large"以得到其路径,并自动下载模型
  • max_seq_length:文本最大长度,超过将截断,不足将padding
  • input_schema:输入tsv数据的格式,逗号分隔的每一项对应数据文件中每行以\t分隔的一项,每项开头为其字段标识,如label、sent1等
  • first_sequence、second_sequence:用于说明input_schema中哪些字段作为第一/第二列输入数据
  • is_training:是否为训练过程,train_dataset为True,valid_dataset为False
train_dataset = MultiModalDataset(
            pretrained_model_name_or_path=get_pretrain_model_path("clip_chinese_roberta_large_with_vit_large"),
            data_file="MUGE_MR_train_base64_part.tsv",
            max_seq_length=32,
            input_schema="text:str:1,image:str:1",
            first_sequence="text",
            second_sequence="image",
            is_training=True)
valid_dataset = MultiModalDataset(
            pretrained_model_name_or_path=get_pretrain_model_path("clip_chinese_roberta_large_with_vit_large"),
            data_file="MUGE_MR_valid_base64_part.tsv",
            max_seq_length=32,
            input_schema="text:str:1,image:str:1",
            first_sequence="text",
            second_sequence="image",
            is_training=False)
`/root/.easynlp/modelzoo/alibaba-pai/clip_chinese_roberta_large_with_vit_large.tgz` already exists
`/root/.easynlp/modelzoo/alibaba-pai/clip_chinese_roberta_large_with_vit_large.tgz` already exists
/root/.local/lib/python3.6/site-packages/pai_easynlp-0.0.6-py3.6.egg/easynlp/modelzoo/tokenization_utils_base.py:1632: FutureWarning: Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated and won't be possible anymore in v5. Use a model identifier or the path to a directory instead.
  FutureWarning,

由于之前我们选用了clip_chinese_roberta_large_with_vit_large,因此这里也会对预训练模型进行自动下载并载入。

模型训练

处理好数据与模型载入后,我们开始训练模型。 我们使用EasyNLP中封装好的get_application_model函数进行训练时的模型构建,其参数如下:

  • app_name:任务名称,这里选择文本分类"clip"
  • pretrained_model_name_or_path:预训练模型名称路径,这里我们使用封装好的get_pretrain_model_path函数,来处理模型名称"clip_chinese_roberta_large_with_vit_large"以得到其路径,并自动下载模型
  • user_defined_parameters:用户自定义参数,直接填入刚刚处理好的自定义参数user_defined_parameters
model = get_application_model(app_name="clip",
                              pretrained_model_name_or_path=get_pretrain_model_path("clip_chinese_roberta_large_with_vit_large"),
                              user_defined_parameters=user_defined_parameters)
`/root/.easynlp/modelzoo/alibaba-pai/clip_chinese_roberta_large_with_vit_large.tgz` already exists
 Loaded weights of the model:
 [embeddings.position_ids,embeddings.word_embeddings.weight,embeddings.position_embeddings.weight,embeddings.token_type_embeddings.weight,embeddings.LayerNorm.weight,embeddings.LayerNorm.bias,encoder.layer.0.attention.self.query.weight,encoder.layer.0.attention.self.query.bias,encoder.layer.0.attention.self.key.weight,encoder.layer.0.attention.self.key.bias,encoder.layer.0.attention.self.value.weight,encoder.layer.0.attention.self.value.bias,encoder.layer.0.attention.output.dense.weight,encoder.layer.0.attention.output.dense.bias,encoder.layer.0.attention.output.LayerNorm.weight,encoder.layer.0.attention.output.LayerNorm.bias,encoder.layer.0.intermediate.dense.weight,encoder.layer.0.intermediate.dense.bias,encoder.layer.0.output.dense.weight,encoder.layer.0.output.dense.bias,encoder.layer.0.output.LayerNorm.weight,encoder.layer.0.output.LayerNorm.bias,encoder.layer.1.attention.self.query.weight,encoder.layer.1.attention.self.query.bias,encoder.layer.1.attention.self.key.weight,encoder.layer.1.attention.self.key.bias,encoder.layer.1.attention.self.value.weight,encoder.layer.1.attention.self.value.bias,encoder.layer.1.attention.output.dense.weight,encoder.layer.1.attention.output.dense.bias,encoder.layer.1.attention.output.LayerNorm.weight,encoder.layer.1.attention.output.LayerNorm.bias,encoder.layer.1.intermediate.dense.weight,encoder.layer.1.intermediate.dense.bias,encoder.layer.1.output.dense.weight,encoder.layer.1.output.dense.bias,encoder.layer.1.output.LayerNorm.weight,encoder.layer.1.output.LayerNorm.bias,encoder.layer.2.attention.self.query.weight,encoder.layer.2.attention.self.query.bias,encoder.layer.2.attention.self.key.weight,encoder.layer.2.attention.self.key.bias,encoder.layer.2.attention.self.value.weight,encoder.layer.2.attention.self.value.bias,encoder.layer.2.attention.output.dense.weight,encoder.layer.2.attention.output.dense.bias,encoder.layer.2.attention.output.LayerNorm.weight,encoder.layer.2.attention.output.LayerNorm.bias,encoder.layer.2.intermediate.dense.weight,encoder.layer.2.intermediate.dense.bias,encoder.layer.2.output.dense.weight,encoder.layer.2.output.dense.bias,encoder.layer.2.output.LayerNorm.weight,encoder.layer.2.output.LayerNorm.bias,encoder.layer.3.attention.self.query.weight,encoder.layer.3.attention.self.query.bias,encoder.layer.3.attention.self.key.weight,encoder.layer.3.attention.self.key.bias,encoder.layer.3.attention.self.value.weight,encoder.layer.3.attention.self.value.bias,encoder.layer.3.attention.output.dense.weight,encoder.layer.3.attention.output.dense.bias,encoder.layer.3.attention.output.LayerNorm.weight,encoder.layer.3.attention.output.LayerNorm.bias,encoder.layer.3.intermediate.dense.weight,encoder.layer.3.intermediate.dense.bias,encoder.layer.3.output.dense.weight,encoder.layer.3.output.dense.bias,encoder.layer.3.output.LayerNorm.weight,encoder.layer.3.output.LayerNorm.bias,encoder.layer.4.attention.self.query.weight,encoder.layer.4.attention.self.query.bias,encoder.layer.4.attention.self.key.weight,encoder.layer.4.attention.self.key.bias,encoder.layer.4.attention.self.value.weight,encoder.layer.4.attention.self.value.bias,encoder.layer.4.attention.output.dense.weight,encoder.layer.4.attention.output.dense.bias,encoder.layer.4.attention.output.LayerNorm.weight,encoder.layer.4.attention.output.LayerNorm.bias,encoder.layer.4.intermediate.dense.weight,encoder.layer.4.intermediate.dense.bias,encoder.layer.4.output.dense.weight,encoder.layer.4.output.dense.bias,encoder.layer.4.output.LayerNorm.weight,encoder.layer.4.output.LayerNorm.bias,encoder.layer.5.attention.self.query.weight,encoder.layer.5.attention.self.query.bias,encoder.layer.5.attention.self.key.weight,encoder.layer.5.attention.self.key.bias,encoder.layer.5.attention.self.value.weight,encoder.layer.5.attention.self.value.bias,encoder.layer.5.attention.output.dense.weight,encoder.layer.5.attention.output.dense.bias,encoder.layer.5.attention.output.LayerNorm.weight,encoder.layer.5.attention.output.LayerNorm.bias,encoder.layer.5.intermediate.dense.weight,encoder.layer.5.intermediate.dense.bias,encoder.layer.5.output.dense.weight,encoder.layer.5.output.dense.bias,encoder.layer.5.output.LayerNorm.weight,encoder.layer.5.output.LayerNorm.bias,encoder.layer.6.attention.self.query.weight,encoder.layer.6.attention.self.query.bias,encoder.layer.6.attention.self.key.weight,encoder.layer.6.attention.self.key.bias,encoder.layer.6.attention.self.value.weight,encoder.layer.6.attention.self.value.bias,encoder.layer.6.attention.output.dense.weight,encoder.layer.6.attention.output.dense.bias,encoder.layer.6.attention.output.LayerNorm.weight,encoder.layer.6.attention.output.LayerNorm.bias,encoder.layer.6.intermediate.dense.weight,encoder.layer.6.intermediate.dense.bias,encoder.layer.6.output.dense.weight,encoder.layer.6.output.dense.bias,encoder.layer.6.output.LayerNorm.weight,encoder.layer.6.output.LayerNorm.bias,encoder.layer.7.attention.self.query.weight,encoder.layer.7.attention.self.query.bias,encoder.layer.7.attention.self.key.weight,encoder.layer.7.attention.self.key.bias,encoder.layer.7.attention.self.value.weight,encoder.layer.7.attention.self.value.bias,encoder.layer.7.attention.output.dense.weight,encoder.layer.7.attention.output.dense.bias,encoder.layer.7.attention.output.LayerNorm.weight,encoder.layer.7.attention.output.LayerNorm.bias,encoder.layer.7.intermediate.dense.weight,encoder.layer.7.intermediate.dense.bias,encoder.layer.7.output.dense.weight,encoder.layer.7.output.dense.bias,encoder.layer.7.output.LayerNorm.weight,encoder.layer.7.output.LayerNorm.bias,encoder.layer.8.attention.self.query.weight,encoder.layer.8.attention.self.query.bias,encoder.layer.8.attention.self.key.weight,encoder.layer.8.attention.self.key.bias,encoder.layer.8.attention.self.value.weight,encoder.layer.8.attention.self.value.bias,encoder.layer.8.attention.output.dense.weight,encoder.layer.8.attention.output.dense.bias,encoder.layer.8.attention.output.LayerNorm.weight,encoder.layer.8.attention.output.LayerNorm.bias,encoder.layer.8.intermediate.dense.weight,encoder.layer.8.intermediate.dense.bias,encoder.layer.8.output.dense.weight,encoder.layer.8.output.dense.bias,encoder.layer.8.output.LayerNorm.weight,encoder.layer.8.output.LayerNorm.bias,encoder.layer.9.attention.self.query.weight,encoder.layer.9.attention.self.query.bias,encoder.layer.9.attention.self.key.weight,encoder.layer.9.attention.self.key.bias,encoder.layer.9.attention.self.value.weight,encoder.layer.9.attention.self.value.bias,encoder.layer.9.attention.output.dense.weight,encoder.layer.9.attention.output.dense.bias,encoder.layer.9.attention.output.LayerNorm.weight,encoder.layer.9.attention.output.LayerNorm.bias,encoder.layer.9.intermediate.dense.weight,encoder.layer.9.intermediate.dense.bias,encoder.layer.9.output.dense.weight,encoder.layer.9.output.dense.bias,encoder.layer.9.output.LayerNorm.weight,encoder.layer.9.output.LayerNorm.bias,encoder.layer.10.attention.self.query.weight,encoder.layer.10.attention.self.query.bias,encoder.layer.10.attention.self.key.weight,encoder.layer.10.attention.self.key.bias,encoder.layer.10.attention.self.value.weight,encoder.layer.10.attention.self.value.bias,encoder.layer.10.attention.output.dense.weight,encoder.layer.10.attention.output.dense.bias,encoder.layer.10.attention.output.LayerNorm.weight,encoder.layer.10.attention.output.LayerNorm.bias,encoder.layer.10.intermediate.dense.weight,encoder.layer.10.intermediate.dense.bias,encoder.layer.10.output.dense.weight,encoder.layer.10.output.dense.bias,encoder.layer.10.output.LayerNorm.weight,encoder.layer.10.output.LayerNorm.bias,encoder.layer.11.attention.self.query.weight,encoder.layer.11.attention.self.query.bias,encoder.layer.11.attention.self.key.weight,encoder.layer.11.attention.self.key.bias,encoder.layer.11.attention.self.value.weight,encoder.layer.11.attention.self.value.bias,encoder.layer.11.attention.output.dense.weight,encoder.layer.11.attention.output.dense.bias,encoder.layer.11.attention.output.LayerNorm.weight,encoder.layer.11.attention.output.LayerNorm.bias,encoder.layer.11.intermediate.dense.weight,encoder.layer.11.intermediate.dense.bias,encoder.layer.11.output.dense.weight,encoder.layer.11.output.dense.bias,encoder.layer.11.output.LayerNorm.weight,encoder.layer.11.output.LayerNorm.bias,encoder.layer.12.attention.self.query.weight,encoder.layer.12.attention.self.query.bias,encoder.layer.12.attention.self.key.weight,encoder.layer.12.attention.self.key.bias,encoder.layer.12.attention.self.value.weight,encoder.layer.12.attention.self.value.bias,encoder.layer.12.attention.output.dense.weight,encoder.layer.12.attention.output.dense.bias,encoder.layer.12.attention.output.LayerNorm.weight,encoder.layer.12.attention.output.LayerNorm.bias,encoder.layer.12.intermediate.dense.weight,encoder.layer.12.intermediate.dense.bias,encoder.layer.12.output.dense.weight,encoder.layer.12.output.dense.bias,encoder.layer.12.output.LayerNorm.weight,encoder.layer.12.output.LayerNorm.bias,encoder.layer.13.attention.self.query.weight,encoder.layer.13.attention.self.query.bias,encoder.layer.13.attention.self.key.weight,encoder.layer.13.attention.self.key.bias,encoder.layer.13.attention.self.value.weight,encoder.layer.13.attention.self.value.bias,encoder.layer.13.attention.output.dense.weight,encoder.layer.13.attention.output.dense.bias,encoder.layer.13.attention.output.LayerNorm.weight,encoder.layer.13.attention.output.LayerNorm.bias,encoder.layer.13.intermediate.dense.weight,encoder.layer.13.intermediate.dense.bias,encoder.layer.13.output.dense.weight,encoder.layer.13.output.dense.bias,encoder.layer.13.output.LayerNorm.weight,encoder.layer.13.output.LayerNorm.bias,encoder.layer.14.attention.self.query.weight,encoder.layer.14.attention.self.query.bias,encoder.layer.14.attention.self.key.weight,encoder.layer.14.attention.self.key.bias,encoder.layer.14.attention.self.value.weight,encoder.layer.14.attention.self.value.bias,encoder.layer.14.attention.output.dense.weight,encoder.layer.14.attention.output.dense.bias,encoder.layer.14.attention.output.LayerNorm.weight,encoder.layer.14.attention.output.LayerNorm.bias,encoder.layer.14.intermediate.dense.weight,encoder.layer.14.intermediate.dense.bias,encoder.layer.14.output.dense.weight,encoder.layer.14.output.dense.bias,encoder.layer.14.output.LayerNorm.weight,encoder.layer.14.output.LayerNorm.bias,encoder.layer.15.attention.self.query.weight,encoder.layer.15.attention.self.query.bias,encoder.layer.15.attention.self.key.weight,encoder.layer.15.attention.self.key.bias,encoder.layer.15.attention.self.value.weight,encoder.layer.15.attention.self.value.bias,encoder.layer.15.attention.output.dense.weight,encoder.layer.15.attention.output.dense.bias,encoder.layer.15.attention.output.LayerNorm.weight,encoder.layer.15.attention.output.LayerNorm.bias,encoder.layer.15.intermediate.dense.weight,encoder.layer.15.intermediate.dense.bias,encoder.layer.15.output.dense.weight,encoder.layer.15.output.dense.bias,encoder.layer.15.output.LayerNorm.weight,encoder.layer.15.output.LayerNorm.bias,encoder.layer.16.attention.self.query.weight,encoder.layer.16.attention.self.query.bias,encoder.layer.16.attention.self.key.weight,encoder.layer.16.attention.self.key.bias,encoder.layer.16.attention.self.value.weight,encoder.layer.16.attention.self.value.bias,encoder.layer.16.attention.output.dense.weight,encoder.layer.16.attention.output.dense.bias,encoder.layer.16.attention.output.LayerNorm.weight,encoder.layer.16.attention.output.LayerNorm.bias,encoder.layer.16.intermediate.dense.weight,encoder.layer.16.intermediate.dense.bias,encoder.layer.16.output.dense.weight,encoder.layer.16.output.dense.bias,encoder.layer.16.output.LayerNorm.weight,encoder.layer.16.output.LayerNorm.bias,encoder.layer.17.attention.self.query.weight,encoder.layer.17.attention.self.query.bias,encoder.layer.17.attention.self.key.weight,encoder.layer.17.attention.self.key.bias,encoder.layer.17.attention.self.value.weight,encoder.layer.17.attention.self.value.bias,encoder.layer.17.attention.output.dense.weight,encoder.layer.17.attention.output.dense.bias,encoder.layer.17.attention.output.LayerNorm.weight,encoder.layer.17.attention.output.LayerNorm.bias,encoder.layer.17.intermediate.dense.weight,encoder.layer.17.intermediate.dense.bias,encoder.layer.17.output.dense.weight,encoder.layer.17.output.dense.bias,encoder.layer.17.output.LayerNorm.weight,encoder.layer.17.output.LayerNorm.bias,encoder.layer.18.attention.self.query.weight,encoder.layer.18.attention.self.query.bias,encoder.layer.18.attention.self.key.weight,encoder.layer.18.attention.self.key.bias,encoder.layer.18.attention.self.value.weight,encoder.layer.18.attention.self.value.bias,encoder.layer.18.attention.output.dense.weight,encoder.layer.18.attention.output.dense.bias,encoder.layer.18.attention.output.LayerNorm.weight,encoder.layer.18.attention.output.LayerNorm.bias,encoder.layer.18.intermediate.dense.weight,encoder.layer.18.intermediate.dense.bias,encoder.layer.18.output.dense.weight,encoder.layer.18.output.dense.bias,encoder.layer.18.output.LayerNorm.weight,encoder.layer.18.output.LayerNorm.bias,encoder.layer.19.attention.self.query.weight,encoder.layer.19.attention.self.query.bias,encoder.layer.19.attention.self.key.weight,encoder.layer.19.attention.self.key.bias,encoder.layer.19.attention.self.value.weight,encoder.layer.19.attention.self.value.bias,encoder.layer.19.attention.output.dense.weight,encoder.layer.19.attention.output.dense.bias,encoder.layer.19.attention.output.LayerNorm.weight,encoder.layer.19.attention.output.LayerNorm.bias,encoder.layer.19.intermediate.dense.weight,encoder.layer.19.intermediate.dense.bias,encoder.layer.19.output.dense.weight,encoder.layer.19.output.dense.bias,encoder.layer.19.output.LayerNorm.weight,encoder.layer.19.output.LayerNorm.bias,encoder.layer.20.attention.self.query.weight,encoder.layer.20.attention.self.query.bias,encoder.layer.20.attention.self.key.weight,encoder.layer.20.attention.self.key.bias,encoder.layer.20.attention.self.value.weight,encoder.layer.20.attention.self.value.bias,encoder.layer.20.attention.output.dense.weight,encoder.layer.20.attention.output.dense.bias,encoder.layer.20.attention.output.LayerNorm.weight,encoder.layer.20.attention.output.LayerNorm.bias,encoder.layer.20.intermediate.dense.weight,encoder.layer.20.intermediate.dense.bias,encoder.layer.20.output.dense.weight,encoder.layer.20.output.dense.bias,encoder.layer.20.output.LayerNorm.weight,encoder.layer.20.output.LayerNorm.bias,encoder.layer.21.attention.self.query.weight,encoder.layer.21.attention.self.query.bias,encoder.layer.21.attention.self.key.weight,encoder.layer.21.attention.self.key.bias,encoder.layer.21.attention.self.value.weight,encoder.layer.21.attention.self.value.bias,encoder.layer.21.attention.output.dense.weight,encoder.layer.21.attention.output.dense.bias,encoder.layer.21.attention.output.LayerNorm.weight,encoder.layer.21.attention.output.LayerNorm.bias,encoder.layer.21.intermediate.dense.weight,encoder.layer.21.intermediate.dense.bias,encoder.layer.21.output.dense.weight,encoder.layer.21.output.dense.bias,encoder.layer.21.output.LayerNorm.weight,encoder.layer.21.output.LayerNorm.bias,encoder.layer.22.attention.self.query.weight,encoder.layer.22.attention.self.query.bias,encoder.layer.22.attention.self.key.weight,encoder.layer.22.attention.self.key.bias,encoder.layer.22.attention.self.value.weight,encoder.layer.22.attention.self.value.bias,encoder.layer.22.attention.output.dense.weight,encoder.layer.22.attention.output.dense.bias,encoder.layer.22.attention.output.LayerNorm.weight,encoder.layer.22.attention.output.LayerNorm.bias,encoder.layer.22.intermediate.dense.weight,encoder.layer.22.intermediate.dense.bias,encoder.layer.22.output.dense.weight,encoder.layer.22.output.dense.bias,encoder.layer.22.output.LayerNorm.weight,encoder.layer.22.output.LayerNorm.bias,encoder.layer.23.attention.self.query.weight,encoder.layer.23.attention.self.query.bias,encoder.layer.23.attention.self.key.weight,encoder.layer.23.attention.self.key.bias,encoder.layer.23.attention.self.value.weight,encoder.layer.23.attention.self.value.bias,encoder.layer.23.attention.output.dense.weight,encoder.layer.23.attention.output.dense.bias,encoder.layer.23.attention.output.LayerNorm.weight,encoder.layer.23.attention.output.LayerNorm.bias,encoder.layer.23.intermediate.dense.weight,encoder.layer.23.intermediate.dense.bias,encoder.layer.23.output.dense.weight,encoder.layer.23.output.dense.bias,encoder.layer.23.output.LayerNorm.weight,encoder.layer.23.output.LayerNorm.bias,pooler.dense.weight,pooler.dense.bias].
All weights are initialized.
 Loaded weights of the model:
 [vision_model.embeddings.class_embedding,vision_model.embeddings.position_ids,vision_model.embeddings.patch_embedding.weight,vision_model.embeddings.position_embedding.weight,vision_model.pre_layrnorm.weight,vision_model.pre_layrnorm.bias,vision_model.encoder.layers.0.self_attn.k_proj.weight,vision_model.encoder.layers.0.self_attn.k_proj.bias,vision_model.encoder.layers.0.self_attn.v_proj.weight,vision_model.encoder.layers.0.self_attn.v_proj.bias,vision_model.encoder.layers.0.self_attn.q_proj.weight,vision_model.encoder.layers.0.self_attn.q_proj.bias,vision_model.encoder.layers.0.self_attn.out_proj.weight,vision_model.encoder.layers.0.self_attn.out_proj.bias,vision_model.encoder.layers.0.layer_norm1.weight,vision_model.encoder.layers.0.layer_norm1.bias,vision_model.encoder.layers.0.mlp.fc1.weight,vision_model.encoder.layers.0.mlp.fc1.bias,vision_model.encoder.layers.0.mlp.fc2.weight,vision_model.encoder.layers.0.mlp.fc2.bias,vision_model.encoder.layers.0.layer_norm2.weight,vision_model.encoder.layers.0.layer_norm2.bias,vision_model.encoder.layers.1.self_attn.k_proj.weight,vision_model.encoder.layers.1.self_attn.k_proj.bias,vision_model.encoder.layers.1.self_attn.v_proj.weight,vision_model.encoder.layers.1.self_attn.v_proj.bias,vision_model.encoder.layers.1.self_attn.q_proj.weight,vision_model.encoder.layers.1.self_attn.q_proj.bias,vision_model.encoder.layers.1.self_attn.out_proj.weight,vision_model.encoder.layers.1.self_attn.out_proj.bias,vision_model.encoder.layers.1.layer_norm1.weight,vision_model.encoder.layers.1.layer_norm1.bias,vision_model.encoder.layers.1.mlp.fc1.weight,vision_model.encoder.layers.1.mlp.fc1.bias,vision_model.encoder.layers.1.mlp.fc2.weight,vision_model.encoder.layers.1.mlp.fc2.bias,vision_model.encoder.layers.1.layer_norm2.weight,vision_model.encoder.layers.1.layer_norm2.bias,vision_model.encoder.layers.2.self_attn.k_proj.weight,vision_model.encoder.layers.2.self_attn.k_proj.bias,vision_model.encoder.layers.2.self_attn.v_proj.weight,vision_model.encoder.layers.2.self_attn.v_proj.bias,vision_model.encoder.layers.2.self_attn.q_proj.weight,vision_model.encoder.layers.2.self_attn.q_proj.bias,vision_model.encoder.layers.2.self_attn.out_proj.weight,vision_model.encoder.layers.2.self_attn.out_proj.bias,vision_model.encoder.layers.2.layer_norm1.weight,vision_model.encoder.layers.2.layer_norm1.bias,vision_model.encoder.layers.2.mlp.fc1.weight,vision_model.encoder.layers.2.mlp.fc1.bias,vision_model.encoder.layers.2.mlp.fc2.weight,vision_model.encoder.layers.2.mlp.fc2.bias,vision_model.encoder.layers.2.layer_norm2.weight,vision_model.encoder.layers.2.layer_norm2.bias,vision_model.encoder.layers.3.self_attn.k_proj.weight,vision_model.encoder.layers.3.self_attn.k_proj.bias,vision_model.encoder.layers.3.self_attn.v_proj.weight,vision_model.encoder.layers.3.self_attn.v_proj.bias,vision_model.encoder.layers.3.self_attn.q_proj.weight,vision_model.encoder.layers.3.self_attn.q_proj.bias,vision_model.encoder.layers.3.self_attn.out_proj.weight,vision_model.encoder.layers.3.self_attn.out_proj.bias,vision_model.encoder.layers.3.layer_norm1.weight,vision_model.encoder.layers.3.layer_norm1.bias,vision_model.encoder.layers.3.mlp.fc1.weight,vision_model.encoder.layers.3.mlp.fc1.bias,vision_model.encoder.layers.3.mlp.fc2.weight,vision_model.encoder.layers.3.mlp.fc2.bias,vision_model.encoder.layers.3.layer_norm2.weight,vision_model.encoder.layers.3.layer_norm2.bias,vision_model.encoder.layers.4.self_attn.k_proj.weight,vision_model.encoder.layers.4.self_attn.k_proj.bias,vision_model.encoder.layers.4.self_attn.v_proj.weight,vision_model.encoder.layers.4.self_attn.v_proj.bias,vision_model.encoder.layers.4.self_attn.q_proj.weight,vision_model.encoder.layers.4.self_attn.q_proj.bias,vision_model.encoder.layers.4.self_attn.out_proj.weight,vision_model.encoder.layers.4.self_attn.out_proj.bias,vision_model.encoder.layers.4.layer_norm1.weight,vision_model.encoder.layers.4.layer_norm1.bias,vision_model.encoder.layers.4.mlp.fc1.weight,vision_model.encoder.layers.4.mlp.fc1.bias,vision_model.encoder.layers.4.mlp.fc2.weight,vision_model.encoder.layers.4.mlp.fc2.bias,vision_model.encoder.layers.4.layer_norm2.weight,vision_model.encoder.layers.4.layer_norm2.bias,vision_model.encoder.layers.5.self_attn.k_proj.weight,vision_model.encoder.layers.5.self_attn.k_proj.bias,vision_model.encoder.layers.5.self_attn.v_proj.weight,vision_model.encoder.layers.5.self_attn.v_proj.bias,vision_model.encoder.layers.5.self_attn.q_proj.weight,vision_model.encoder.layers.5.self_attn.q_proj.bias,vision_model.encoder.layers.5.self_attn.out_proj.weight,vision_model.encoder.layers.5.self_attn.out_proj.bias,vision_model.encoder.layers.5.layer_norm1.weight,vision_model.encoder.layers.5.layer_norm1.bias,vision_model.encoder.layers.5.mlp.fc1.weight,vision_model.encoder.layers.5.mlp.fc1.bias,vision_model.encoder.layers.5.mlp.fc2.weight,vision_model.encoder.layers.5.mlp.fc2.bias,vision_model.encoder.layers.5.layer_norm2.weight,vision_model.encoder.layers.5.layer_norm2.bias,vision_model.encoder.layers.6.self_attn.k_proj.weight,vision_model.encoder.layers.6.self_attn.k_proj.bias,vision_model.encoder.layers.6.self_attn.v_proj.weight,vision_model.encoder.layers.6.self_attn.v_proj.bias,vision_model.encoder.layers.6.self_attn.q_proj.weight,vision_model.encoder.layers.6.self_attn.q_proj.bias,vision_model.encoder.layers.6.self_attn.out_proj.weight,vision_model.encoder.layers.6.self_attn.out_proj.bias,vision_model.encoder.layers.6.layer_norm1.weight,vision_model.encoder.layers.6.layer_norm1.bias,vision_model.encoder.layers.6.mlp.fc1.weight,vision_model.encoder.layers.6.mlp.fc1.bias,vision_model.encoder.layers.6.mlp.fc2.weight,vision_model.encoder.layers.6.mlp.fc2.bias,vision_model.encoder.layers.6.layer_norm2.weight,vision_model.encoder.layers.6.layer_norm2.bias,vision_model.encoder.layers.7.self_attn.k_proj.weight,vision_model.encoder.layers.7.self_attn.k_proj.bias,vision_model.encoder.layers.7.self_attn.v_proj.weight,vision_model.encoder.layers.7.self_attn.v_proj.bias,vision_model.encoder.layers.7.self_attn.q_proj.weight,vision_model.encoder.layers.7.self_attn.q_proj.bias,vision_model.encoder.layers.7.self_attn.out_proj.weight,vision_model.encoder.layers.7.self_attn.out_proj.bias,vision_model.encoder.layers.7.layer_norm1.weight,vision_model.encoder.layers.7.layer_norm1.bias,vision_model.encoder.layers.7.mlp.fc1.weight,vision_model.encoder.layers.7.mlp.fc1.bias,vision_model.encoder.layers.7.mlp.fc2.weight,vision_model.encoder.layers.7.mlp.fc2.bias,vision_model.encoder.layers.7.layer_norm2.weight,vision_model.encoder.layers.7.layer_norm2.bias,vision_model.encoder.layers.8.self_attn.k_proj.weight,vision_model.encoder.layers.8.self_attn.k_proj.bias,vision_model.encoder.layers.8.self_attn.v_proj.weight,vision_model.encoder.layers.8.self_attn.v_proj.bias,vision_model.encoder.layers.8.self_attn.q_proj.weight,vision_model.encoder.layers.8.self_attn.q_proj.bias,vision_model.encoder.layers.8.self_attn.out_proj.weight,vision_model.encoder.layers.8.self_attn.out_proj.bias,vision_model.encoder.layers.8.layer_norm1.weight,vision_model.encoder.layers.8.layer_norm1.bias,vision_model.encoder.layers.8.mlp.fc1.weight,vision_model.encoder.layers.8.mlp.fc1.bias,vision_model.encoder.layers.8.mlp.fc2.weight,vision_model.encoder.layers.8.mlp.fc2.bias,vision_model.encoder.layers.8.layer_norm2.weight,vision_model.encoder.layers.8.layer_norm2.bias,vision_model.encoder.layers.9.self_attn.k_proj.weight,vision_model.encoder.layers.9.self_attn.k_proj.bias,vision_model.encoder.layers.9.self_attn.v_proj.weight,vision_model.encoder.layers.9.self_attn.v_proj.bias,vision_model.encoder.layers.9.self_attn.q_proj.weight,vision_model.encoder.layers.9.self_attn.q_proj.bias,vision_model.encoder.layers.9.self_attn.out_proj.weight,vision_model.encoder.layers.9.self_attn.out_proj.bias,vision_model.encoder.layers.9.layer_norm1.weight,vision_model.encoder.layers.9.layer_norm1.bias,vision_model.encoder.layers.9.mlp.fc1.weight,vision_model.encoder.layers.9.mlp.fc1.bias,vision_model.encoder.layers.9.mlp.fc2.weight,vision_model.encoder.layers.9.mlp.fc2.bias,vision_model.encoder.layers.9.layer_norm2.weight,vision_model.encoder.layers.9.layer_norm2.bias,vision_model.encoder.layers.10.self_attn.k_proj.weight,vision_model.encoder.layers.10.self_attn.k_proj.bias,vision_model.encoder.layers.10.self_attn.v_proj.weight,vision_model.encoder.layers.10.self_attn.v_proj.bias,vision_model.encoder.layers.10.self_attn.q_proj.weight,vision_model.encoder.layers.10.self_attn.q_proj.bias,vision_model.encoder.layers.10.self_attn.out_proj.weight,vision_model.encoder.layers.10.self_attn.out_proj.bias,vision_model.encoder.layers.10.layer_norm1.weight,vision_model.encoder.layers.10.layer_norm1.bias,vision_model.encoder.layers.10.mlp.fc1.weight,vision_model.encoder.layers.10.mlp.fc1.bias,vision_model.encoder.layers.10.mlp.fc2.weight,vision_model.encoder.layers.10.mlp.fc2.bias,vision_model.encoder.layers.10.layer_norm2.weight,vision_model.encoder.layers.10.layer_norm2.bias,vision_model.encoder.layers.11.self_attn.k_proj.weight,vision_model.encoder.layers.11.self_attn.k_proj.bias,vision_model.encoder.layers.11.self_attn.v_proj.weight,vision_model.encoder.layers.11.self_attn.v_proj.bias,vision_model.encoder.layers.11.self_attn.q_proj.weight,vision_model.encoder.layers.11.self_attn.q_proj.bias,vision_model.encoder.layers.11.self_attn.out_proj.weight,vision_model.encoder.layers.11.self_attn.out_proj.bias,vision_model.encoder.layers.11.layer_norm1.weight,vision_model.encoder.layers.11.layer_norm1.bias,vision_model.encoder.layers.11.mlp.fc1.weight,vision_model.encoder.layers.11.mlp.fc1.bias,vision_model.encoder.layers.11.mlp.fc2.weight,vision_model.encoder.layers.11.mlp.fc2.bias,vision_model.encoder.layers.11.layer_norm2.weight,vision_model.encoder.layers.11.layer_norm2.bias,vision_model.encoder.layers.12.self_attn.k_proj.weight,vision_model.encoder.layers.12.self_attn.k_proj.bias,vision_model.encoder.layers.12.self_attn.v_proj.weight,vision_model.encoder.layers.12.self_attn.v_proj.bias,vision_model.encoder.layers.12.self_attn.q_proj.weight,vision_model.encoder.layers.12.self_attn.q_proj.bias,vision_model.encoder.layers.12.self_attn.out_proj.weight,vision_model.encoder.layers.12.self_attn.out_proj.bias,vision_model.encoder.layers.12.layer_norm1.weight,vision_model.encoder.layers.12.layer_norm1.bias,vision_model.encoder.layers.12.mlp.fc1.weight,vision_model.encoder.layers.12.mlp.fc1.bias,vision_model.encoder.layers.12.mlp.fc2.weight,vision_model.encoder.layers.12.mlp.fc2.bias,vision_model.encoder.layers.12.layer_norm2.weight,vision_model.encoder.layers.12.layer_norm2.bias,vision_model.encoder.layers.13.self_attn.k_proj.weight,vision_model.encoder.layers.13.self_attn.k_proj.bias,vision_model.encoder.layers.13.self_attn.v_proj.weight,vision_model.encoder.layers.13.self_attn.v_proj.bias,vision_model.encoder.layers.13.self_attn.q_proj.weight,vision_model.encoder.layers.13.self_attn.q_proj.bias,vision_model.encoder.layers.13.self_attn.out_proj.weight,vision_model.encoder.layers.13.self_attn.out_proj.bias,vision_model.encoder.layers.13.layer_norm1.weight,vision_model.encoder.layers.13.layer_norm1.bias,vision_model.encoder.layers.13.mlp.fc1.weight,vision_model.encoder.layers.13.mlp.fc1.bias,vision_model.encoder.layers.13.mlp.fc2.weight,vision_model.encoder.layers.13.mlp.fc2.bias,vision_model.encoder.layers.13.layer_norm2.weight,vision_model.encoder.layers.13.layer_norm2.bias,vision_model.encoder.layers.14.self_attn.k_proj.weight,vision_model.encoder.layers.14.self_attn.k_proj.bias,vision_model.encoder.layers.14.self_attn.v_proj.weight,vision_model.encoder.layers.14.self_attn.v_proj.bias,vision_model.encoder.layers.14.self_attn.q_proj.weight,vision_model.encoder.layers.14.self_attn.q_proj.bias,vision_model.encoder.layers.14.self_attn.out_proj.weight,vision_model.encoder.layers.14.self_attn.out_proj.bias,vision_model.encoder.layers.14.layer_norm1.weight,vision_model.encoder.layers.14.layer_norm1.bias,vision_model.encoder.layers.14.mlp.fc1.weight,vision_model.encoder.layers.14.mlp.fc1.bias,vision_model.encoder.layers.14.mlp.fc2.weight,vision_model.encoder.layers.14.mlp.fc2.bias,vision_model.encoder.layers.14.layer_norm2.weight,vision_model.encoder.layers.14.layer_norm2.bias,vision_model.encoder.layers.15.self_attn.k_proj.weight,vision_model.encoder.layers.15.self_attn.k_proj.bias,vision_model.encoder.layers.15.self_attn.v_proj.weight,vision_model.encoder.layers.15.self_attn.v_proj.bias,vision_model.encoder.layers.15.self_attn.q_proj.weight,vision_model.encoder.layers.15.self_attn.q_proj.bias,vision_model.encoder.layers.15.self_attn.out_proj.weight,vision_model.encoder.layers.15.self_attn.out_proj.bias,vision_model.encoder.layers.15.layer_norm1.weight,vision_model.encoder.layers.15.layer_norm1.bias,vision_model.encoder.layers.15.mlp.fc1.weight,vision_model.encoder.layers.15.mlp.fc1.bias,vision_model.encoder.layers.15.mlp.fc2.weight,vision_model.encoder.layers.15.mlp.fc2.bias,vision_model.encoder.layers.15.layer_norm2.weight,vision_model.encoder.layers.15.layer_norm2.bias,vision_model.encoder.layers.16.self_attn.k_proj.weight,vision_model.encoder.layers.16.self_attn.k_proj.bias,vision_model.encoder.layers.16.self_attn.v_proj.weight,vision_model.encoder.layers.16.self_attn.v_proj.bias,vision_model.encoder.layers.16.self_attn.q_proj.weight,vision_model.encoder.layers.16.self_attn.q_proj.bias,vision_model.encoder.layers.16.self_attn.out_proj.weight,vision_model.encoder.layers.16.self_attn.out_proj.bias,vision_model.encoder.layers.16.layer_norm1.weight,vision_model.encoder.layers.16.layer_norm1.bias,vision_model.encoder.layers.16.mlp.fc1.weight,vision_model.encoder.layers.16.mlp.fc1.bias,vision_model.encoder.layers.16.mlp.fc2.weight,vision_model.encoder.layers.16.mlp.fc2.bias,vision_model.encoder.layers.16.layer_norm2.weight,vision_model.encoder.layers.16.layer_norm2.bias,vision_model.encoder.layers.17.self_attn.k_proj.weight,vision_model.encoder.layers.17.self_attn.k_proj.bias,vision_model.encoder.layers.17.self_attn.v_proj.weight,vision_model.encoder.layers.17.self_attn.v_proj.bias,vision_model.encoder.layers.17.self_attn.q_proj.weight,vision_model.encoder.layers.17.self_attn.q_proj.bias,vision_model.encoder.layers.17.self_attn.out_proj.weight,vision_model.encoder.layers.17.self_attn.out_proj.bias,vision_model.encoder.layers.17.layer_norm1.weight,vision_model.encoder.layers.17.layer_norm1.bias,vision_model.encoder.layers.17.mlp.fc1.weight,vision_model.encoder.layers.17.mlp.fc1.bias,vision_model.encoder.layers.17.mlp.fc2.weight,vision_model.encoder.layers.17.mlp.fc2.bias,vision_model.encoder.layers.17.layer_norm2.weight,vision_model.encoder.layers.17.layer_norm2.bias,vision_model.encoder.layers.18.self_attn.k_proj.weight,vision_model.encoder.layers.18.self_attn.k_proj.bias,vision_model.encoder.layers.18.self_attn.v_proj.weight,vision_model.encoder.layers.18.self_attn.v_proj.bias,vision_model.encoder.layers.18.self_attn.q_proj.weight,vision_model.encoder.layers.18.self_attn.q_proj.bias,vision_model.encoder.layers.18.self_attn.out_proj.weight,vision_model.encoder.layers.18.self_attn.out_proj.bias,vision_model.encoder.layers.18.layer_norm1.weight,vision_model.encoder.layers.18.layer_norm1.bias,vision_model.encoder.layers.18.mlp.fc1.weight,vision_model.encoder.layers.18.mlp.fc1.bias,vision_model.encoder.layers.18.mlp.fc2.weight,vision_model.encoder.layers.18.mlp.fc2.bias,vision_model.encoder.layers.18.layer_norm2.weight,vision_model.encoder.layers.18.layer_norm2.bias,vision_model.encoder.layers.19.self_attn.k_proj.weight,vision_model.encoder.layers.19.self_attn.k_proj.bias,vision_model.encoder.layers.19.self_attn.v_proj.weight,vision_model.encoder.layers.19.self_attn.v_proj.bias,vision_model.encoder.layers.19.self_attn.q_proj.weight,vision_model.encoder.layers.19.self_attn.q_proj.bias,vision_model.encoder.layers.19.self_attn.out_proj.weight,vision_model.encoder.layers.19.self_attn.out_proj.bias,vision_model.encoder.layers.19.layer_norm1.weight,vision_model.encoder.layers.19.layer_norm1.bias,vision_model.encoder.layers.19.mlp.fc1.weight,vision_model.encoder.layers.19.mlp.fc1.bias,vision_model.encoder.layers.19.mlp.fc2.weight,vision_model.encoder.layers.19.mlp.fc2.bias,vision_model.encoder.layers.19.layer_norm2.weight,vision_model.encoder.layers.19.layer_norm2.bias,vision_model.encoder.layers.20.self_attn.k_proj.weight,vision_model.encoder.layers.20.self_attn.k_proj.bias,vision_model.encoder.layers.20.self_attn.v_proj.weight,vision_model.encoder.layers.20.self_attn.v_proj.bias,vision_model.encoder.layers.20.self_attn.q_proj.weight,vision_model.encoder.layers.20.self_attn.q_proj.bias,vision_model.encoder.layers.20.self_attn.out_proj.weight,vision_model.encoder.layers.20.self_attn.out_proj.bias,vision_model.encoder.layers.20.layer_norm1.weight,vision_model.encoder.layers.20.layer_norm1.bias,vision_model.encoder.layers.20.mlp.fc1.weight,vision_model.encoder.layers.20.mlp.fc1.bias,vision_model.encoder.layers.20.mlp.fc2.weight,vision_model.encoder.layers.20.mlp.fc2.bias,vision_model.encoder.layers.20.layer_norm2.weight,vision_model.encoder.layers.20.layer_norm2.bias,vision_model.encoder.layers.21.self_attn.k_proj.weight,vision_model.encoder.layers.21.self_attn.k_proj.bias,vision_model.encoder.layers.21.self_attn.v_proj.weight,vision_model.encoder.layers.21.self_attn.v_proj.bias,vision_model.encoder.layers.21.self_attn.q_proj.weight,vision_model.encoder.layers.21.self_attn.q_proj.bias,vision_model.encoder.layers.21.self_attn.out_proj.weight,vision_model.encoder.layers.21.self_attn.out_proj.bias,vision_model.encoder.layers.21.layer_norm1.weight,vision_model.encoder.layers.21.layer_norm1.bias,vision_model.encoder.layers.21.mlp.fc1.weight,vision_model.encoder.layers.21.mlp.fc1.bias,vision_model.encoder.layers.21.mlp.fc2.weight,vision_model.encoder.layers.21.mlp.fc2.bias,vision_model.encoder.layers.21.layer_norm2.weight,vision_model.encoder.layers.21.layer_norm2.bias,vision_model.encoder.layers.22.self_attn.k_proj.weight,vision_model.encoder.layers.22.self_attn.k_proj.bias,vision_model.encoder.layers.22.self_attn.v_proj.weight,vision_model.encoder.layers.22.self_attn.v_proj.bias,vision_model.encoder.layers.22.self_attn.q_proj.weight,vision_model.encoder.layers.22.self_attn.q_proj.bias,vision_model.encoder.layers.22.self_attn.out_proj.weight,vision_model.encoder.layers.22.self_attn.out_proj.bias,vision_model.encoder.layers.22.layer_norm1.weight,vision_model.encoder.layers.22.layer_norm1.bias,vision_model.encoder.layers.22.mlp.fc1.weight,vision_model.encoder.layers.22.mlp.fc1.bias,vision_model.encoder.layers.22.mlp.fc2.weight,vision_model.encoder.layers.22.mlp.fc2.bias,vision_model.encoder.layers.22.layer_norm2.weight,vision_model.encoder.layers.22.layer_norm2.bias,vision_model.encoder.layers.23.self_attn.k_proj.weight,vision_model.encoder.layers.23.self_attn.k_proj.bias,vision_model.encoder.layers.23.self_attn.v_proj.weight,vision_model.encoder.layers.23.self_attn.v_proj.bias,vision_model.encoder.layers.23.self_attn.q_proj.weight,vision_model.encoder.layers.23.self_attn.q_proj.bias,vision_model.encoder.layers.23.self_attn.out_proj.weight,vision_model.encoder.layers.23.self_attn.out_proj.bias,vision_model.encoder.layers.23.layer_norm1.weight,vision_model.encoder.layers.23.layer_norm1.bias,vision_model.encoder.layers.23.mlp.fc1.weight,vision_model.encoder.layers.23.mlp.fc1.bias,vision_model.encoder.layers.23.mlp.fc2.weight,vision_model.encoder.layers.23.mlp.fc2.bias,vision_model.encoder.layers.23.layer_norm2.weight,vision_model.encoder.layers.23.layer_norm2.bias,vision_model.post_layernorm.weight,vision_model.post_layernorm.bias].
All weights are initialized.

从日志中可以看出,我们对预训练模型的参数进行了载入。下一步我们使用EasyNLP中的Train类创建训练实例,并进行训练。

trainer = Trainer(model=model, 
                  train_dataset=train_dataset,
                  user_defined_parameters=user_defined_parameters,
                  evaluator=get_application_evaluator(app_name="clip", 
                                                      valid_dataset=valid_dataset,
                                                      user_defined_parameters=user_defined_parameters,
                                                      eval_batch_size=32))
trainer.train()
/home/pai/lib/python3.6/site-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
[2022-07-20 17:10:55,821 INFO] ========== Initializing Tensorboard ==========
[2022-07-20 17:10:55,829 INFO] ========== Training Start ==========
[2022-07-20 17:10:55,832 INFO]   Num of GPUs (all)       = 1
[2022-07-20 17:10:55,833 INFO]   Num of CPUs per worker  = 1
[2022-07-20 17:10:55,833 INFO]   Num dataset examples    = 80
[2022-07-20 17:10:55,833 INFO]   Num training examples   = 80
[2022-07-20 17:10:55,834 INFO]   Num validation examples = 40
[2022-07-20 17:10:55,834 INFO]   Train. batch size       = 2
[2022-07-20 17:10:55,835 INFO]   Train. micro batch size = 2
[2022-07-20 17:10:55,835 INFO]   Train. batch no.        = 120
[2022-07-20 17:10:55,837 INFO]   Evaluation batch size   = 2
[2022-07-20 17:10:55,837 INFO]   Total training steps    = 120
[2022-07-20 17:10:55,838 INFO]   Sequence length         = 16
[2022-07-20 17:10:55,839 INFO]   Saving steps            = None
[2022-07-20 17:10:55,840 INFO]   Distributed_backend     = nccl
[2022-07-20 17:10:55,840 INFO]   Worker Count            = 1
[2022-07-20 17:10:55,841 INFO]   Worker CPU              = -1
[2022-07-20 17:10:55,841 INFO]   Worker data threads     = 10
[2022-07-20 17:10:55,846 INFO]   num model params        = 630,275,073
[2022-07-20 17:10:55,847 INFO]   num trainable params    = 327,095,297
[2022-07-20 17:10:55,847 INFO] 
[2022-07-20 17:10:55,851 INFO] ========== Model Config ==========
[2022-07-20 17:10:55,852 INFO] {"return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "use_bfloat16": false, "pruned_heads": {}, "tie_word_embeddings": true, "is_encoder_decoder": false, "is_decoder": false, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "chunk_size_feed_forward": 0, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "architectures": null, "finetuning_task": null, "id2label": {"0": "LABEL_0", "1": "LABEL_1"}, "label2id": {"LABEL_0": 0, "LABEL_1": 1}, "tokenizer_class": null, "prefix": null, "bos_token_id": null, "pad_token_id": null, "eos_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": null, "problem_type": null, "_name_or_path": "", "easynlp_version": null, "text_config_dict": {"return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "use_bfloat16": false, "pruned_heads": {}, "tie_word_embeddings": true, "is_encoder_decoder": false, "is_decoder": false, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "chunk_size_feed_forward": 0, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "architectures": ["BertForMaskedLM"], "finetuning_task": null, "id2label": {"0": "LABEL_0", "1": "LABEL_1"}, "label2id": {"LABEL_0": 0, "LABEL_1": 1}, "tokenizer_class": null, "prefix": null, "bos_token_id": 0, "pad_token_id": 0, "eos_token_id": 2, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": null, "problem_type": null, "_name_or_path": "", "easynlp_version": "0.0.3", "directionality": "bidi", "gradient_checkpointing": false, "model_type": "clip_text_model", "output_past": true, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "use_cache": true, "vocab_size": 21128, "hidden_size": 1024, "intermediate_size": 4096, "dropout": 0.0, "num_hidden_layers": 24, "num_attention_heads": 16, "max_position_embeddings": 512, "layer_norm_eps": 1e-12, "hidden_act": "gelu", "initializer_range": 0.02, "initializer_factor": 1.0, "attention_dropout": 0.0, "type_vocab_size": 2, "hidden_dropout_prob": 0.1, "attention_probs_dropout_prob": 0.1}, "vision_config_dict": {"return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "use_bfloat16": false, "pruned_heads": {}, "tie_word_embeddings": true, "is_encoder_decoder": false, "is_decoder": false, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "chunk_size_feed_forward": 0, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "architectures": null, "finetuning_task": null, "id2label": {"0": "LABEL_0", "1": "LABEL_1"}, "label2id": {"LABEL_0": 0, "LABEL_1": 1}, "tokenizer_class": null, "prefix": null, "bos_token_id": null, "pad_token_id": null, "eos_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": null, "problem_type": null, "_name_or_path": "", "easynlp_version": "0.0.3", "cross_attention_hidden_size": null, "model_type": "clip_vision_model", "torch_dtype": null, "transformers_version": "4.16.0.dev0", "hidden_size": 1024, "intermediate_size": 4096, "dropout": 0.0, "num_hidden_layers": 24, "num_attention_heads": 16, "patch_size": 14, "image_size": 224, "initializer_range": 0.02, "initializer_factor": 1.0, "attention_dropout": 0.0, "layer_norm_eps": 1e-05, "hidden_act": "quick_gelu"}, "text_config": {"return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "use_bfloat16": false, "pruned_heads": {}, "tie_word_embeddings": true, "is_encoder_decoder": false, "is_decoder": false, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "chunk_size_feed_forward": 0, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "architectures": ["BertForMaskedLM"], "finetuning_task": null, "id2label": {"0": "LABEL_0", "1": "LABEL_1"}, "label2id": {"LABEL_0": 0, "LABEL_1": 1}, "tokenizer_class": null, "prefix": null, "bos_token_id": 0, "pad_token_id": 0, "eos_token_id": 2, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": null, "problem_type": null, "_name_or_path": "", "easynlp_version": "0.0.3", "directionality": "bidi", "gradient_checkpointing": false, "model_type": "clip_text_model", "output_past": true, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "use_cache": true, "vocab_size": 21128, "hidden_size": 1024, "intermediate_size": 4096, "dropout": 0.0, "num_hidden_layers": 24, "num_attention_heads": 16, "max_position_embeddings": 512, "layer_norm_eps": 1e-12, "hidden_act": "gelu", "initializer_range": 0.02, "initializer_factor": 1.0, "attention_dropout": 0.0, "type_vocab_size": 2, "hidden_dropout_prob": 0.1, "attention_probs_dropout_prob": 0.1}, "vision_config": {"return_dict": true, "output_hidden_states": false, "output_attentions": false, "torchscript": false, "use_bfloat16": false, "pruned_heads": {}, "tie_word_embeddings": true, "is_encoder_decoder": false, "is_decoder": false, "add_cross_attention": false, "tie_encoder_decoder": false, "max_length": 20, "min_length": 0, "do_sample": false, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "diversity_penalty": 0.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "repetition_penalty": 1.0, "length_penalty": 1.0, "no_repeat_ngram_size": 0, "encoder_no_repeat_ngram_size": 0, "bad_words_ids": null, "num_return_sequences": 1, "chunk_size_feed_forward": 0, "output_scores": false, "return_dict_in_generate": false, "forced_bos_token_id": null, "forced_eos_token_id": null, "remove_invalid_values": false, "architectures": null, "finetuning_task": null, "id2label": {"0": "LABEL_0", "1": "LABEL_1"}, "label2id": {"LABEL_0": 0, "LABEL_1": 1}, "tokenizer_class": null, "prefix": null, "bos_token_id": null, "pad_token_id": null, "eos_token_id": null, "sep_token_id": null, "decoder_start_token_id": null, "task_specific_params": null, "problem_type": null, "_name_or_path": "", "easynlp_version": "0.0.3", "cross_attention_hidden_size": null, "model_type": "clip_vision_model", "torch_dtype": null, "transformers_version": "4.16.0.dev0", "hidden_size": 1024, "intermediate_size": 4096, "dropout": 0.0, "num_hidden_layers": 24, "num_attention_heads": 16, "patch_size": 14, "image_size": 224, "initializer_range": 0.02, "initializer_factor": 1.0, "attention_dropout": 0.0, "layer_norm_eps": 1e-05, "hidden_act": "quick_gelu"}, "projection_dim": 512, "logit_scale_init_value": 2.6592, "initializer_factor": 1.0, "model_type": "clip"}
optimizer type: AdamW
/root/.local/lib/python3.6/site-packages/pai_easynlp-0.0.6-py3.6.egg/easynlp/core/optimizers.py:441: UserWarning: This overload of add_ is deprecated:
  add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
  add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
/home/pai/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
  warnings.warn("To get the last learning rate computed by the scheduler, "
[2022-07-20 17:11:23,694 INFO] Epoch [ 2/ 3], step [100/120], lr 0.000010, 7.11 s
[2022-07-20 17:11:23,696 INFO]   loss      : 0.0004 
[2022-07-20 17:11:27,046 INFO] Saving best model to ./clip_model/pytorch_model.bin...
Training Time: 31.7812922000885, rank 0, gsteps 120
[2022-07-20 17:12:03,450 INFO] Training Time: 68.18616724014282

模型评估

训练过程结束后,train好的模型被我们保存在一开始指定好的checkpoint_dir中,本地路径为"./clip_model/"。我们可以对train好的模型进行效果评估。我们同样先使用EasyNLP中的get_application_model_for_evaluation方法构建评估模型。

model = get_application_model_for_evaluation(app_name="clip",
                                             pretrained_model_name_or_path="./clip_model/", 
                                             user_defined_parameters=user_defined_parameters)
 Loaded weights of the model:
 [embeddings.position_ids,embeddings.word_embeddings.weight,embeddings.position_embeddings.weight,embeddings.token_type_embeddings.weight,embeddings.LayerNorm.weight,embeddings.LayerNorm.bias,encoder.layer.0.attention.self.query.weight,encoder.layer.0.attention.self.query.bias,encoder.layer.0.attention.self.key.weight,encoder.layer.0.attention.self.key.bias,encoder.layer.0.attention.self.value.weight,encoder.layer.0.attention.self.value.bias,encoder.layer.0.attention.output.dense.weight,encoder.layer.0.attention.output.dense.bias,encoder.layer.0.attention.output.LayerNorm.weight,encoder.layer.0.attention.output.LayerNorm.bias,encoder.layer.0.intermediate.dense.weight,encoder.layer.0.intermediate.dense.bias,encoder.layer.0.output.dense.weight,encoder.layer.0.output.dense.bias,encoder.layer.0.output.LayerNorm.weight,encoder.layer.0.output.LayerNorm.bias,encoder.layer.1.attention.self.query.weight,encoder.layer.1.attention.self.query.bias,encoder.layer.1.attention.self.key.weight,encoder.layer.1.attention.self.key.bias,encoder.layer.1.attention.self.value.weight,encoder.layer.1.attention.self.value.bias,encoder.layer.1.attention.output.dense.weight,encoder.layer.1.attention.output.dense.bias,encoder.layer.1.attention.output.LayerNorm.weight,encoder.layer.1.attention.output.LayerNorm.bias,encoder.layer.1.intermediate.dense.weight,encoder.layer.1.intermediate.dense.bias,encoder.layer.1.output.dense.weight,encoder.layer.1.output.dense.bias,encoder.layer.1.output.LayerNorm.weight,encoder.layer.1.output.LayerNorm.bias,encoder.layer.2.attention.self.query.weight,encoder.layer.2.attention.self.query.bias,encoder.layer.2.attention.self.key.weight,encoder.layer.2.attention.self.key.bias,encoder.layer.2.attention.self.value.weight,encoder.layer.2.attention.self.value.bias,encoder.layer.2.attention.output.dense.weight,encoder.layer.2.attention.output.dense.bias,encoder.layer.2.attention.output.LayerNorm.weight,encoder.layer.2.attention.output.LayerNorm.bias,encoder.layer.2.intermediate.dense.weight,encoder.layer.2.intermediate.dense.bias,encoder.layer.2.output.dense.weight,encoder.layer.2.output.dense.bias,encoder.layer.2.output.LayerNorm.weight,encoder.layer.2.output.LayerNorm.bias,encoder.layer.3.attention.self.query.weight,encoder.layer.3.attention.self.query.bias,encoder.layer.3.attention.self.key.weight,encoder.layer.3.attention.self.key.bias,encoder.layer.3.attention.self.value.weight,encoder.layer.3.attention.self.value.bias,encoder.layer.3.attention.output.dense.weight,encoder.layer.3.attention.output.dense.bias,encoder.layer.3.attention.output.LayerNorm.weight,encoder.layer.3.attention.output.LayerNorm.bias,encoder.layer.3.intermediate.dense.weight,encoder.layer.3.intermediate.dense.bias,encoder.layer.3.output.dense.weight,encoder.layer.3.output.dense.bias,encoder.layer.3.output.LayerNorm.weight,encoder.layer.3.output.LayerNorm.bias,encoder.layer.4.attention.self.query.weight,encoder.layer.4.attention.self.query.bias,encoder.layer.4.attention.self.key.weight,encoder.layer.4.attention.self.key.bias,encoder.layer.4.attention.self.value.weight,encoder.layer.4.attention.self.value.bias,encoder.layer.4.attention.output.dense.weight,encoder.layer.4.attention.output.dense.bias,encoder.layer.4.attention.output.LayerNorm.weight,encoder.layer.4.attention.output.LayerNorm.bias,encoder.layer.4.intermediate.dense.weight,encoder.layer.4.intermediate.dense.bias,encoder.layer.4.output.dense.weight,encoder.layer.4.output.dense.bias,encoder.layer.4.output.LayerNorm.weight,encoder.layer.4.output.LayerNorm.bias,encoder.layer.5.attention.self.query.weight,encoder.layer.5.attention.self.query.bias,encoder.layer.5.attention.self.key.weight,encoder.layer.5.attention.self.key.bias,encoder.layer.5.attention.self.value.weight,encoder.layer.5.attention.self.value.bias,encoder.layer.5.attention.output.dense.weight,encoder.layer.5.attention.output.dense.bias,encoder.layer.5.attention.output.LayerNorm.weight,encoder.layer.5.attention.output.LayerNorm.bias,encoder.layer.5.intermediate.dense.weight,encoder.layer.5.intermediate.dense.bias,encoder.layer.5.output.dense.weight,encoder.layer.5.output.dense.bias,encoder.layer.5.output.LayerNorm.weight,encoder.layer.5.output.LayerNorm.bias,encoder.layer.6.attention.self.query.weight,encoder.layer.6.attention.self.query.bias,encoder.layer.6.attention.self.key.weight,encoder.layer.6.attention.self.key.bias,encoder.layer.6.attention.self.value.weight,encoder.layer.6.attention.self.value.bias,encoder.layer.6.attention.output.dense.weight,encoder.layer.6.attention.output.dense.bias,encoder.layer.6.attention.output.LayerNorm.weight,encoder.layer.6.attention.output.LayerNorm.bias,encoder.layer.6.intermediate.dense.weight,encoder.layer.6.intermediate.dense.bias,encoder.layer.6.output.dense.weight,encoder.layer.6.output.dense.bias,encoder.layer.6.output.LayerNorm.weight,encoder.layer.6.output.LayerNorm.bias,encoder.layer.7.attention.self.query.weight,encoder.layer.7.attention.self.query.bias,encoder.layer.7.attention.self.key.weight,encoder.layer.7.attention.self.key.bias,encoder.layer.7.attention.self.value.weight,encoder.layer.7.attention.self.value.bias,encoder.layer.7.attention.output.dense.weight,encoder.layer.7.attention.output.dense.bias,encoder.layer.7.attention.output.LayerNorm.weight,encoder.layer.7.attention.output.LayerNorm.bias,encoder.layer.7.intermediate.dense.weight,encoder.layer.7.intermediate.dense.bias,encoder.layer.7.output.dense.weight,encoder.layer.7.output.dense.bias,encoder.layer.7.output.LayerNorm.weight,encoder.layer.7.output.LayerNorm.bias,encoder.layer.8.attention.self.query.weight,encoder.layer.8.attention.self.query.bias,encoder.layer.8.attention.self.key.weight,encoder.layer.8.attention.self.key.bias,encoder.layer.8.attention.self.value.weight,encoder.layer.8.attention.self.value.bias,encoder.layer.8.attention.output.dense.weight,encoder.layer.8.attention.output.dense.bias,encoder.layer.8.attention.output.LayerNorm.weight,encoder.layer.8.attention.output.LayerNorm.bias,encoder.layer.8.intermediate.dense.weight,encoder.layer.8.intermediate.dense.bias,encoder.layer.8.output.dense.weight,encoder.layer.8.output.dense.bias,encoder.layer.8.output.LayerNorm.weight,encoder.layer.8.output.LayerNorm.bias,encoder.layer.9.attention.self.query.weight,encoder.layer.9.attention.self.query.bias,encoder.layer.9.attention.self.key.weight,encoder.layer.9.attention.self.key.bias,encoder.layer.9.attention.self.value.weight,encoder.layer.9.attention.self.value.bias,encoder.layer.9.attention.output.dense.weight,encoder.layer.9.attention.output.dense.bias,encoder.layer.9.attention.output.LayerNorm.weight,encoder.layer.9.attention.output.LayerNorm.bias,encoder.layer.9.intermediate.dense.weight,encoder.layer.9.intermediate.dense.bias,encoder.layer.9.output.dense.weight,encoder.layer.9.output.dense.bias,encoder.layer.9.output.LayerNorm.weight,encoder.layer.9.output.LayerNorm.bias,encoder.layer.10.attention.self.query.weight,encoder.layer.10.attention.self.query.bias,encoder.layer.10.attention.self.key.weight,encoder.layer.10.attention.self.key.bias,encoder.layer.10.attention.self.value.weight,encoder.layer.10.attention.self.value.bias,encoder.layer.10.attention.output.dense.weight,encoder.layer.10.attention.output.dense.bias,encoder.layer.10.attention.output.LayerNorm.weight,encoder.layer.10.attention.output.LayerNorm.bias,encoder.layer.10.intermediate.dense.weight,encoder.layer.10.intermediate.dense.bias,encoder.layer.10.output.dense.weight,encoder.layer.10.output.dense.bias,encoder.layer.10.output.LayerNorm.weight,encoder.layer.10.output.LayerNorm.bias,encoder.layer.11.attention.self.query.weight,encoder.layer.11.attention.self.query.bias,encoder.layer.11.attention.self.key.weight,encoder.layer.11.attention.self.key.bias,encoder.layer.11.attention.self.value.weight,encoder.layer.11.attention.self.value.bias,encoder.layer.11.attention.output.dense.weight,encoder.layer.11.attention.output.dense.bias,encoder.layer.11.attention.output.LayerNorm.weight,encoder.layer.11.attention.output.LayerNorm.bias,encoder.layer.11.intermediate.dense.weight,encoder.layer.11.intermediate.dense.bias,encoder.layer.11.output.dense.weight,encoder.layer.11.output.dense.bias,encoder.layer.11.output.LayerNorm.weight,encoder.layer.11.output.LayerNorm.bias,encoder.layer.12.attention.self.query.weight,encoder.layer.12.attention.self.query.bias,encoder.layer.12.attention.self.key.weight,encoder.layer.12.attention.self.key.bias,encoder.layer.12.attention.self.value.weight,encoder.layer.12.attention.self.value.bias,encoder.layer.12.attention.output.dense.weight,encoder.layer.12.attention.output.dense.bias,encoder.layer.12.attention.output.LayerNorm.weight,encoder.layer.12.attention.output.LayerNorm.bias,encoder.layer.12.intermediate.dense.weight,encoder.layer.12.intermediate.dense.bias,encoder.layer.12.output.dense.weight,encoder.layer.12.output.dense.bias,encoder.layer.12.output.LayerNorm.weight,encoder.layer.12.output.LayerNorm.bias,encoder.layer.13.attention.self.query.weight,encoder.layer.13.attention.self.query.bias,encoder.layer.13.attention.self.key.weight,encoder.layer.13.attention.self.key.bias,encoder.layer.13.attention.self.value.weight,encoder.layer.13.attention.self.value.bias,encoder.layer.13.attention.output.dense.weight,encoder.layer.13.attention.output.dense.bias,encoder.layer.13.attention.output.LayerNorm.weight,encoder.layer.13.attention.output.LayerNorm.bias,encoder.layer.13.intermediate.dense.weight,encoder.layer.13.intermediate.dense.bias,encoder.layer.13.output.dense.weight,encoder.layer.13.output.dense.bias,encoder.layer.13.output.LayerNorm.weight,encoder.layer.13.output.LayerNorm.bias,encoder.layer.14.attention.self.query.weight,encoder.layer.14.attention.self.query.bias,encoder.layer.14.attention.self.key.weight,encoder.layer.14.attention.self.key.bias,encoder.layer.14.attention.self.value.weight,encoder.layer.14.attention.self.value.bias,encoder.layer.14.attention.output.dense.weight,encoder.layer.14.attention.output.dense.bias,encoder.layer.14.attention.output.LayerNorm.weight,encoder.layer.14.attention.output.LayerNorm.bias,encoder.layer.14.intermediate.dense.weight,encoder.layer.14.intermediate.dense.bias,encoder.layer.14.output.dense.weight,encoder.layer.14.output.dense.bias,encoder.layer.14.output.LayerNorm.weight,encoder.layer.14.output.LayerNorm.bias,encoder.layer.15.attention.self.query.weight,encoder.layer.15.attention.self.query.bias,encoder.layer.15.attention.self.key.weight,encoder.layer.15.attention.self.key.bias,encoder.layer.15.attention.self.value.weight,encoder.layer.15.attention.self.value.bias,encoder.layer.15.attention.output.dense.weight,encoder.layer.15.attention.output.dense.bias,encoder.layer.15.attention.output.LayerNorm.weight,encoder.layer.15.attention.output.LayerNorm.bias,encoder.layer.15.intermediate.dense.weight,encoder.layer.15.intermediate.dense.bias,encoder.layer.15.output.dense.weight,encoder.layer.15.output.dense.bias,encoder.layer.15.output.LayerNorm.weight,encoder.layer.15.output.LayerNorm.bias,encoder.layer.16.attention.self.query.weight,encoder.layer.16.attention.self.query.bias,encoder.layer.16.attention.self.key.weight,encoder.layer.16.attention.self.key.bias,encoder.layer.16.attention.self.value.weight,encoder.layer.16.attention.self.value.bias,encoder.layer.16.attention.output.dense.weight,encoder.layer.16.attention.output.dense.bias,encoder.layer.16.attention.output.LayerNorm.weight,encoder.layer.16.attention.output.LayerNorm.bias,encoder.layer.16.intermediate.dense.weight,encoder.layer.16.intermediate.dense.bias,encoder.layer.16.output.dense.weight,encoder.layer.16.output.dense.bias,encoder.layer.16.output.LayerNorm.weight,encoder.layer.16.output.LayerNorm.bias,encoder.layer.17.attention.self.query.weight,encoder.layer.17.attention.self.query.bias,encoder.layer.17.attention.self.key.weight,encoder.layer.17.attention.self.key.bias,encoder.layer.17.attention.self.value.weight,encoder.layer.17.attention.self.value.bias,encoder.layer.17.attention.output.dense.weight,encoder.layer.17.attention.output.dense.bias,encoder.layer.17.attention.output.LayerNorm.weight,encoder.layer.17.attention.output.LayerNorm.bias,encoder.layer.17.intermediate.dense.weight,encoder.layer.17.intermediate.dense.bias,encoder.layer.17.output.dense.weight,encoder.layer.17.output.dense.bias,encoder.layer.17.output.LayerNorm.weight,encoder.layer.17.output.LayerNorm.bias,encoder.layer.18.attention.self.query.weight,encoder.layer.18.attention.self.query.bias,encoder.layer.18.attention.self.key.weight,encoder.layer.18.attention.self.key.bias,encoder.layer.18.attention.self.value.weight,encoder.layer.18.attention.self.value.bias,encoder.layer.18.attention.output.dense.weight,encoder.layer.18.attention.output.dense.bias,encoder.layer.18.attention.output.LayerNorm.weight,encoder.layer.18.attention.output.LayerNorm.bias,encoder.layer.18.intermediate.dense.weight,encoder.layer.18.intermediate.dense.bias,encoder.layer.18.output.dense.weight,encoder.layer.18.output.dense.bias,encoder.layer.18.output.LayerNorm.weight,encoder.layer.18.output.LayerNorm.bias,encoder.layer.19.attention.self.query.weight,encoder.layer.19.attention.self.query.bias,encoder.layer.19.attention.self.key.weight,encoder.layer.19.attention.self.key.bias,encoder.layer.19.attention.self.value.weight,encoder.layer.19.attention.self.value.bias,encoder.layer.19.attention.output.dense.weight,encoder.layer.19.attention.output.dense.bias,encoder.layer.19.attention.output.LayerNorm.weight,encoder.layer.19.attention.output.LayerNorm.bias,encoder.layer.19.intermediate.dense.weight,encoder.layer.19.intermediate.dense.bias,encoder.layer.19.output.dense.weight,encoder.layer.19.output.dense.bias,encoder.layer.19.output.LayerNorm.weight,encoder.layer.19.output.LayerNorm.bias,encoder.layer.20.attention.self.query.weight,encoder.layer.20.attention.self.query.bias,encoder.layer.20.attention.self.key.weight,encoder.layer.20.attention.self.key.bias,encoder.layer.20.attention.self.value.weight,encoder.layer.20.attention.self.value.bias,encoder.layer.20.attention.output.dense.weight,encoder.layer.20.attention.output.dense.bias,encoder.layer.20.attention.output.LayerNorm.weight,encoder.layer.20.attention.output.LayerNorm.bias,encoder.layer.20.intermediate.dense.weight,encoder.layer.20.intermediate.dense.bias,encoder.layer.20.output.dense.weight,encoder.layer.20.output.dense.bias,encoder.layer.20.output.LayerNorm.weight,encoder.layer.20.output.LayerNorm.bias,encoder.layer.21.attention.self.query.weight,encoder.layer.21.attention.self.query.bias,encoder.layer.21.attention.self.key.weight,encoder.layer.21.attention.self.key.bias,encoder.layer.21.attention.self.value.weight,encoder.layer.21.attention.self.value.bias,encoder.layer.21.attention.output.dense.weight,encoder.layer.21.attention.output.dense.bias,encoder.layer.21.attention.output.LayerNorm.weight,encoder.layer.21.attention.output.LayerNorm.bias,encoder.layer.21.intermediate.dense.weight,encoder.layer.21.intermediate.dense.bias,encoder.layer.21.output.dense.weight,encoder.layer.21.output.dense.bias,encoder.layer.21.output.LayerNorm.weight,encoder.layer.21.output.LayerNorm.bias,encoder.layer.22.attention.self.query.weight,encoder.layer.22.attention.self.query.bias,encoder.layer.22.attention.self.key.weight,encoder.layer.22.attention.self.key.bias,encoder.layer.22.attention.self.value.weight,encoder.layer.22.attention.self.value.bias,encoder.layer.22.attention.output.dense.weight,encoder.layer.22.attention.output.dense.bias,encoder.layer.22.attention.output.LayerNorm.weight,encoder.layer.22.attention.output.LayerNorm.bias,encoder.layer.22.intermediate.dense.weight,encoder.layer.22.intermediate.dense.bias,encoder.layer.22.output.dense.weight,encoder.layer.22.output.dense.bias,encoder.layer.22.output.LayerNorm.weight,encoder.layer.22.output.LayerNorm.bias,encoder.layer.23.attention.self.query.weight,encoder.layer.23.attention.self.query.bias,encoder.layer.23.attention.self.key.weight,encoder.layer.23.attention.self.key.bias,encoder.layer.23.attention.self.value.weight,encoder.layer.23.attention.self.value.bias,encoder.layer.23.attention.output.dense.weight,encoder.layer.23.attention.output.dense.bias,encoder.layer.23.attention.output.LayerNorm.weight,encoder.layer.23.attention.output.LayerNorm.bias,encoder.layer.23.intermediate.dense.weight,encoder.layer.23.intermediate.dense.bias,encoder.layer.23.output.dense.weight,encoder.layer.23.output.dense.bias,encoder.layer.23.output.LayerNorm.weight,encoder.layer.23.output.LayerNorm.bias,pooler.dense.weight,pooler.dense.bias].
All weights are initialized.
 Loaded weights of the model:
 [vision_model.embeddings.class_embedding,vision_model.embeddings.position_ids,vision_model.embeddings.patch_embedding.weight,vision_model.embeddings.position_embedding.weight,vision_model.pre_layrnorm.weight,vision_model.pre_layrnorm.bias,vision_model.encoder.layers.0.self_attn.k_proj.weight,vision_model.encoder.layers.0.self_attn.k_proj.bias,vision_model.encoder.layers.0.self_attn.v_proj.weight,vision_model.encoder.layers.0.self_attn.v_proj.bias,vision_model.encoder.layers.0.self_attn.q_proj.weight,vision_model.encoder.layers.0.self_attn.q_proj.bias,vision_model.encoder.layers.0.self_attn.out_proj.weight,vision_model.encoder.layers.0.self_attn.out_proj.bias,vision_model.encoder.layers.0.layer_norm1.weight,vision_model.encoder.layers.0.layer_norm1.bias,vision_model.encoder.layers.0.mlp.fc1.weight,vision_model.encoder.layers.0.mlp.fc1.bias,vision_model.encoder.layers.0.mlp.fc2.weight,vision_model.encoder.layers.0.mlp.fc2.bias,vision_model.encoder.layers.0.layer_norm2.weight,vision_model.encoder.layers.0.layer_norm2.bias,vision_model.encoder.layers.1.self_attn.k_proj.weight,vision_model.encoder.layers.1.self_attn.k_proj.bias,vision_model.encoder.layers.1.self_attn.v_proj.weight,vision_model.encoder.layers.1.self_attn.v_proj.bias,vision_model.encoder.layers.1.self_attn.q_proj.weight,vision_model.encoder.layers.1.self_attn.q_proj.bias,vision_model.encoder.layers.1.self_attn.out_proj.weight,vision_model.encoder.layers.1.self_attn.out_proj.bias,vision_model.encoder.layers.1.layer_norm1.weight,vision_model.encoder.layers.1.layer_norm1.bias,vision_model.encoder.layers.1.mlp.fc1.weight,vision_model.encoder.layers.1.mlp.fc1.bias,vision_model.encoder.layers.1.mlp.fc2.weight,vision_model.encoder.layers.1.mlp.fc2.bias,vision_model.encoder.layers.1.layer_norm2.weight,vision_model.encoder.layers.1.layer_norm2.bias,vision_model.encoder.layers.2.self_attn.k_proj.weight,vision_model.encoder.layers.2.self_attn.k_proj.bias,vision_model.encoder.layers.2.self_attn.v_proj.weight,vision_model.encoder.layers.2.self_attn.v_proj.bias,vision_model.encoder.layers.2.self_attn.q_proj.weight,vision_model.encoder.layers.2.self_attn.q_proj.bias,vision_model.encoder.layers.2.self_attn.out_proj.weight,vision_model.encoder.layers.2.self_attn.out_proj.bias,vision_model.encoder.layers.2.layer_norm1.weight,vision_model.encoder.layers.2.layer_norm1.bias,vision_model.encoder.layers.2.mlp.fc1.weight,vision_model.encoder.layers.2.mlp.fc1.bias,vision_model.encoder.layers.2.mlp.fc2.weight,vision_model.encoder.layers.2.mlp.fc2.bias,vision_model.encoder.layers.2.layer_norm2.weight,vision_model.encoder.layers.2.layer_norm2.bias,vision_model.encoder.layers.3.self_attn.k_proj.weight,vision_model.encoder.layers.3.self_attn.k_proj.bias,vision_model.encoder.layers.3.self_attn.v_proj.weight,vision_model.encoder.layers.3.self_attn.v_proj.bias,vision_model.encoder.layers.3.self_attn.q_proj.weight,vision_model.encoder.layers.3.self_attn.q_proj.bias,vision_model.encoder.layers.3.self_attn.out_proj.weight,vision_model.encoder.layers.3.self_attn.out_proj.bias,vision_model.encoder.layers.3.layer_norm1.weight,vision_model.encoder.layers.3.layer_norm1.bias,vision_model.encoder.layers.3.mlp.fc1.weight,vision_model.encoder.layers.3.mlp.fc1.bias,vision_model.encoder.layers.3.mlp.fc2.weight,vision_model.encoder.layers.3.mlp.fc2.bias,vision_model.encoder.layers.3.layer_norm2.weight,vision_model.encoder.layers.3.layer_norm2.bias,vision_model.encoder.layers.4.self_attn.k_proj.weight,vision_model.encoder.layers.4.self_attn.k_proj.bias,vision_model.encoder.layers.4.self_attn.v_proj.weight,vision_model.encoder.layers.4.self_attn.v_proj.bias,vision_model.encoder.layers.4.self_attn.q_proj.weight,vision_model.encoder.layers.4.self_attn.q_proj.bias,vision_model.encoder.layers.4.self_attn.out_proj.weight,vision_model.encoder.layers.4.self_attn.out_proj.bias,vision_model.encoder.layers.4.layer_norm1.weight,vision_model.encoder.layers.4.layer_norm1.bias,vision_model.encoder.layers.4.mlp.fc1.weight,vision_model.encoder.layers.4.mlp.fc1.bias,vision_model.encoder.layers.4.mlp.fc2.weight,vision_model.encoder.layers.4.mlp.fc2.bias,vision_model.encoder.layers.4.layer_norm2.weight,vision_model.encoder.layers.4.layer_norm2.bias,vision_model.encoder.layers.5.self_attn.k_proj.weight,vision_model.encoder.layers.5.self_attn.k_proj.bias,vision_model.encoder.layers.5.self_attn.v_proj.weight,vision_model.encoder.layers.5.self_attn.v_proj.bias,vision_model.encoder.layers.5.self_attn.q_proj.weight,vision_model.encoder.layers.5.self_attn.q_proj.bias,vision_model.encoder.layers.5.self_attn.out_proj.weight,vision_model.encoder.layers.5.self_attn.out_proj.bias,vision_model.encoder.layers.5.layer_norm1.weight,vision_model.encoder.layers.5.layer_norm1.bias,vision_model.encoder.layers.5.mlp.fc1.weight,vision_model.encoder.layers.5.mlp.fc1.bias,vision_model.encoder.layers.5.mlp.fc2.weight,vision_model.encoder.layers.5.mlp.fc2.bias,vision_model.encoder.layers.5.layer_norm2.weight,vision_model.encoder.layers.5.layer_norm2.bias,vision_model.encoder.layers.6.self_attn.k_proj.weight,vision_model.encoder.layers.6.self_attn.k_proj.bias,vision_model.encoder.layers.6.self_attn.v_proj.weight,vision_model.encoder.layers.6.self_attn.v_proj.bias,vision_model.encoder.layers.6.self_attn.q_proj.weight,vision_model.encoder.layers.6.self_attn.q_proj.bias,vision_model.encoder.layers.6.self_attn.out_proj.weight,vision_model.encoder.layers.6.self_attn.out_proj.bias,vision_model.encoder.layers.6.layer_norm1.weight,vision_model.encoder.layers.6.layer_norm1.bias,vision_model.encoder.layers.6.mlp.fc1.weight,vision_model.encoder.layers.6.mlp.fc1.bias,vision_model.encoder.layers.6.mlp.fc2.weight,vision_model.encoder.layers.6.mlp.fc2.bias,vision_model.encoder.layers.6.layer_norm2.weight,vision_model.encoder.layers.6.layer_norm2.bias,vision_model.encoder.layers.7.self_attn.k_proj.weight,vision_model.encoder.layers.7.self_attn.k_proj.bias,vision_model.encoder.layers.7.self_attn.v_proj.weight,vision_model.encoder.layers.7.self_attn.v_proj.bias,vision_model.encoder.layers.7.self_attn.q_proj.weight,vision_model.encoder.layers.7.self_attn.q_proj.bias,vision_model.encoder.layers.7.self_attn.out_proj.weight,vision_model.encoder.layers.7.self_attn.out_proj.bias,vision_model.encoder.layers.7.layer_norm1.weight,vision_model.encoder.layers.7.layer_norm1.bias,vision_model.encoder.layers.7.mlp.fc1.weight,vision_model.encoder.layers.7.mlp.fc1.bias,vision_model.encoder.layers.7.mlp.fc2.weight,vision_model.encoder.layers.7.mlp.fc2.bias,vision_model.encoder.layers.7.layer_norm2.weight,vision_model.encoder.layers.7.layer_norm2.bias,vision_model.encoder.layers.8.self_attn.k_proj.weight,vision_model.encoder.layers.8.self_attn.k_proj.bias,vision_model.encoder.layers.8.self_attn.v_proj.weight,vision_model.encoder.layers.8.self_attn.v_proj.bias,vision_model.encoder.layers.8.self_attn.q_proj.weight,vision_model.encoder.layers.8.self_attn.q_proj.bias,vision_model.encoder.layers.8.self_attn.out_proj.weight,vision_model.encoder.layers.8.self_attn.out_proj.bias,vision_model.encoder.layers.8.layer_norm1.weight,vision_model.encoder.layers.8.layer_norm1.bias,vision_model.encoder.layers.8.mlp.fc1.weight,vision_model.encoder.layers.8.mlp.fc1.bias,vision_model.encoder.layers.8.mlp.fc2.weight,vision_model.encoder.layers.8.mlp.fc2.bias,vision_model.encoder.layers.8.layer_norm2.weight,vision_model.encoder.layers.8.layer_norm2.bias,vision_model.encoder.layers.9.self_attn.k_proj.weight,vision_model.encoder.layers.9.self_attn.k_proj.bias,vision_model.encoder.layers.9.self_attn.v_proj.weight,vision_model.encoder.layers.9.self_attn.v_proj.bias,vision_model.encoder.layers.9.self_attn.q_proj.weight,vision_model.encoder.layers.9.self_attn.q_proj.bias,vision_model.encoder.layers.9.self_attn.out_proj.weight,vision_model.encoder.layers.9.self_attn.out_proj.bias,vision_model.encoder.layers.9.layer_norm1.weight,vision_model.encoder.layers.9.layer_norm1.bias,vision_model.encoder.layers.9.mlp.fc1.weight,vision_model.encoder.layers.9.mlp.fc1.bias,vision_model.encoder.layers.9.mlp.fc2.weight,vision_model.encoder.layers.9.mlp.fc2.bias,vision_model.encoder.layers.9.layer_norm2.weight,vision_model.encoder.layers.9.layer_norm2.bias,vision_model.encoder.layers.10.self_attn.k_proj.weight,vision_model.encoder.layers.10.self_attn.k_proj.bias,vision_model.encoder.layers.10.self_attn.v_proj.weight,vision_model.encoder.layers.10.self_attn.v_proj.bias,vision_model.encoder.layers.10.self_attn.q_proj.weight,vision_model.encoder.layers.10.self_attn.q_proj.bias,vision_model.encoder.layers.10.self_attn.out_proj.weight,vision_model.encoder.layers.10.self_attn.out_proj.bias,vision_model.encoder.layers.10.layer_norm1.weight,vision_model.encoder.layers.10.layer_norm1.bias,vision_model.encoder.layers.10.mlp.fc1.weight,vision_model.encoder.layers.10.mlp.fc1.bias,vision_model.encoder.layers.10.mlp.fc2.weight,vision_model.encoder.layers.10.mlp.fc2.bias,vision_model.encoder.layers.10.layer_norm2.weight,vision_model.encoder.layers.10.layer_norm2.bias,vision_model.encoder.layers.11.self_attn.k_proj.weight,vision_model.encoder.layers.11.self_attn.k_proj.bias,vision_model.encoder.layers.11.self_attn.v_proj.weight,vision_model.encoder.layers.11.self_attn.v_proj.bias,vision_model.encoder.layers.11.self_attn.q_proj.weight,vision_model.encoder.layers.11.self_attn.q_proj.bias,vision_model.encoder.layers.11.self_attn.out_proj.weight,vision_model.encoder.layers.11.self_attn.out_proj.bias,vision_model.encoder.layers.11.layer_norm1.weight,vision_model.encoder.layers.11.layer_norm1.bias,vision_model.encoder.layers.11.mlp.fc1.weight,vision_model.encoder.layers.11.mlp.fc1.bias,vision_model.encoder.layers.11.mlp.fc2.weight,vision_model.encoder.layers.11.mlp.fc2.bias,vision_model.encoder.layers.11.layer_norm2.weight,vision_model.encoder.layers.11.layer_norm2.bias,vision_model.encoder.layers.12.self_attn.k_proj.weight,vision_model.encoder.layers.12.self_attn.k_proj.bias,vision_model.encoder.layers.12.self_attn.v_proj.weight,vision_model.encoder.layers.12.self_attn.v_proj.bias,vision_model.encoder.layers.12.self_attn.q_proj.weight,vision_model.encoder.layers.12.self_attn.q_proj.bias,vision_model.encoder.layers.12.self_attn.out_proj.weight,vision_model.encoder.layers.12.self_attn.out_proj.bias,vision_model.encoder.layers.12.layer_norm1.weight,vision_model.encoder.layers.12.layer_norm1.bias,vision_model.encoder.layers.12.mlp.fc1.weight,vision_model.encoder.layers.12.mlp.fc1.bias,vision_model.encoder.layers.12.mlp.fc2.weight,vision_model.encoder.layers.12.mlp.fc2.bias,vision_model.encoder.layers.12.layer_norm2.weight,vision_model.encoder.layers.12.layer_norm2.bias,vision_model.encoder.layers.13.self_attn.k_proj.weight,vision_model.encoder.layers.13.self_attn.k_proj.bias,vision_model.encoder.layers.13.self_attn.v_proj.weight,vision_model.encoder.layers.13.self_attn.v_proj.bias,vision_model.encoder.layers.13.self_attn.q_proj.weight,vision_model.encoder.layers.13.self_attn.q_proj.bias,vision_model.encoder.layers.13.self_attn.out_proj.weight,vision_model.encoder.layers.13.self_attn.out_proj.bias,vision_model.encoder.layers.13.layer_norm1.weight,vision_model.encoder.layers.13.layer_norm1.bias,vision_model.encoder.layers.13.mlp.fc1.weight,vision_model.encoder.layers.13.mlp.fc1.bias,vision_model.encoder.layers.13.mlp.fc2.weight,vision_model.encoder.layers.13.mlp.fc2.bias,vision_model.encoder.layers.13.layer_norm2.weight,vision_model.encoder.layers.13.layer_norm2.bias,vision_model.encoder.layers.14.self_attn.k_proj.weight,vision_model.encoder.layers.14.self_attn.k_proj.bias,vision_model.encoder.layers.14.self_attn.v_proj.weight,vision_model.encoder.layers.14.self_attn.v_proj.bias,vision_model.encoder.layers.14.self_attn.q_proj.weight,vision_model.encoder.layers.14.self_attn.q_proj.bias,vision_model.encoder.layers.14.self_attn.out_proj.weight,vision_model.encoder.layers.14.self_attn.out_proj.bias,vision_model.encoder.layers.14.layer_norm1.weight,vision_model.encoder.layers.14.layer_norm1.bias,vision_model.encoder.layers.14.mlp.fc1.weight,vision_model.encoder.layers.14.mlp.fc1.bias,vision_model.encoder.layers.14.mlp.fc2.weight,vision_model.encoder.layers.14.mlp.fc2.bias,vision_model.encoder.layers.14.layer_norm2.weight,vision_model.encoder.layers.14.layer_norm2.bias,vision_model.encoder.layers.15.self_attn.k_proj.weight,vision_model.encoder.layers.15.self_attn.k_proj.bias,vision_model.encoder.layers.15.self_attn.v_proj.weight,vision_model.encoder.layers.15.self_attn.v_proj.bias,vision_model.encoder.layers.15.self_attn.q_proj.weight,vision_model.encoder.layers.15.self_attn.q_proj.bias,vision_model.encoder.layers.15.self_attn.out_proj.weight,vision_model.encoder.layers.15.self_attn.out_proj.bias,vision_model.encoder.layers.15.layer_norm1.weight,vision_model.encoder.layers.15.layer_norm1.bias,vision_model.encoder.layers.15.mlp.fc1.weight,vision_model.encoder.layers.15.mlp.fc1.bias,vision_model.encoder.layers.15.mlp.fc2.weight,vision_model.encoder.layers.15.mlp.fc2.bias,vision_model.encoder.layers.15.layer_norm2.weight,vision_model.encoder.layers.15.layer_norm2.bias,vision_model.encoder.layers.16.self_attn.k_proj.weight,vision_model.encoder.layers.16.self_attn.k_proj.bias,vision_model.encoder.layers.16.self_attn.v_proj.weight,vision_model.encoder.layers.16.self_attn.v_proj.bias,vision_model.encoder.layers.16.self_attn.q_proj.weight,vision_model.encoder.layers.16.self_attn.q_proj.bias,vision_model.encoder.layers.16.self_attn.out_proj.weight,vision_model.encoder.layers.16.self_attn.out_proj.bias,vision_model.encoder.layers.16.layer_norm1.weight,vision_model.encoder.layers.16.layer_norm1.bias,vision_model.encoder.layers.16.mlp.fc1.weight,vision_model.encoder.layers.16.mlp.fc1.bias,vision_model.encoder.layers.16.mlp.fc2.weight,vision_model.encoder.layers.16.mlp.fc2.bias,vision_model.encoder.layers.16.layer_norm2.weight,vision_model.encoder.layers.16.layer_norm2.bias,vision_model.encoder.layers.17.self_attn.k_proj.weight,vision_model.encoder.layers.17.self_attn.k_proj.bias,vision_model.encoder.layers.17.self_attn.v_proj.weight,vision_model.encoder.layers.17.self_attn.v_proj.bias,vision_model.encoder.layers.17.self_attn.q_proj.weight,vision_model.encoder.layers.17.self_attn.q_proj.bias,vision_model.encoder.layers.17.self_attn.out_proj.weight,vision_model.encoder.layers.17.self_attn.out_proj.bias,vision_model.encoder.layers.17.layer_norm1.weight,vision_model.encoder.layers.17.layer_norm1.bias,vision_model.encoder.layers.17.mlp.fc1.weight,vision_model.encoder.layers.17.mlp.fc1.bias,vision_model.encoder.layers.17.mlp.fc2.weight,vision_model.encoder.layers.17.mlp.fc2.bias,vision_model.encoder.layers.17.layer_norm2.weight,vision_model.encoder.layers.17.layer_norm2.bias,vision_model.encoder.layers.18.self_attn.k_proj.weight,vision_model.encoder.layers.18.self_attn.k_proj.bias,vision_model.encoder.layers.18.self_attn.v_proj.weight,vision_model.encoder.layers.18.self_attn.v_proj.bias,vision_model.encoder.layers.18.self_attn.q_proj.weight,vision_model.encoder.layers.18.self_attn.q_proj.bias,vision_model.encoder.layers.18.self_attn.out_proj.weight,vision_model.encoder.layers.18.self_attn.out_proj.bias,vision_model.encoder.layers.18.layer_norm1.weight,vision_model.encoder.layers.18.layer_norm1.bias,vision_model.encoder.layers.18.mlp.fc1.weight,vision_model.encoder.layers.18.mlp.fc1.bias,vision_model.encoder.layers.18.mlp.fc2.weight,vision_model.encoder.layers.18.mlp.fc2.bias,vision_model.encoder.layers.18.layer_norm2.weight,vision_model.encoder.layers.18.layer_norm2.bias,vision_model.encoder.layers.19.self_attn.k_proj.weight,vision_model.encoder.layers.19.self_attn.k_proj.bias,vision_model.encoder.layers.19.self_attn.v_proj.weight,vision_model.encoder.layers.19.self_attn.v_proj.bias,vision_model.encoder.layers.19.self_attn.q_proj.weight,vision_model.encoder.layers.19.self_attn.q_proj.bias,vision_model.encoder.layers.19.self_attn.out_proj.weight,vision_model.encoder.layers.19.self_attn.out_proj.bias,vision_model.encoder.layers.19.layer_norm1.weight,vision_model.encoder.layers.19.layer_norm1.bias,vision_model.encoder.layers.19.mlp.fc1.weight,vision_model.encoder.layers.19.mlp.fc1.bias,vision_model.encoder.layers.19.mlp.fc2.weight,vision_model.encoder.layers.19.mlp.fc2.bias,vision_model.encoder.layers.19.layer_norm2.weight,vision_model.encoder.layers.19.layer_norm2.bias,vision_model.encoder.layers.20.self_attn.k_proj.weight,vision_model.encoder.layers.20.self_attn.k_proj.bias,vision_model.encoder.layers.20.self_attn.v_proj.weight,vision_model.encoder.layers.20.self_attn.v_proj.bias,vision_model.encoder.layers.20.self_attn.q_proj.weight,vision_model.encoder.layers.20.self_attn.q_proj.bias,vision_model.encoder.layers.20.self_attn.out_proj.weight,vision_model.encoder.layers.20.self_attn.out_proj.bias,vision_model.encoder.layers.20.layer_norm1.weight,vision_model.encoder.layers.20.layer_norm1.bias,vision_model.encoder.layers.20.mlp.fc1.weight,vision_model.encoder.layers.20.mlp.fc1.bias,vision_model.encoder.layers.20.mlp.fc2.weight,vision_model.encoder.layers.20.mlp.fc2.bias,vision_model.encoder.layers.20.layer_norm2.weight,vision_model.encoder.layers.20.layer_norm2.bias,vision_model.encoder.layers.21.self_attn.k_proj.weight,vision_model.encoder.layers.21.self_attn.k_proj.bias,vision_model.encoder.layers.21.self_attn.v_proj.weight,vision_model.encoder.layers.21.self_attn.v_proj.bias,vision_model.encoder.layers.21.self_attn.q_proj.weight,vision_model.encoder.layers.21.self_attn.q_proj.bias,vision_model.encoder.layers.21.self_attn.out_proj.weight,vision_model.encoder.layers.21.self_attn.out_proj.bias,vision_model.encoder.layers.21.layer_norm1.weight,vision_model.encoder.layers.21.layer_norm1.bias,vision_model.encoder.layers.21.mlp.fc1.weight,vision_model.encoder.layers.21.mlp.fc1.bias,vision_model.encoder.layers.21.mlp.fc2.weight,vision_model.encoder.layers.21.mlp.fc2.bias,vision_model.encoder.layers.21.layer_norm2.weight,vision_model.encoder.layers.21.layer_norm2.bias,vision_model.encoder.layers.22.self_attn.k_proj.weight,vision_model.encoder.layers.22.self_attn.k_proj.bias,vision_model.encoder.layers.22.self_attn.v_proj.weight,vision_model.encoder.layers.22.self_attn.v_proj.bias,vision_model.encoder.layers.22.self_attn.q_proj.weight,vision_model.encoder.layers.22.self_attn.q_proj.bias,vision_model.encoder.layers.22.self_attn.out_proj.weight,vision_model.encoder.layers.22.self_attn.out_proj.bias,vision_model.encoder.layers.22.layer_norm1.weight,vision_model.encoder.layers.22.layer_norm1.bias,vision_model.encoder.layers.22.mlp.fc1.weight,vision_model.encoder.layers.22.mlp.fc1.bias,vision_model.encoder.layers.22.mlp.fc2.weight,vision_model.encoder.layers.22.mlp.fc2.bias,vision_model.encoder.layers.22.layer_norm2.weight,vision_model.encoder.layers.22.layer_norm2.bias,vision_model.encoder.layers.23.self_attn.k_proj.weight,vision_model.encoder.layers.23.self_attn.k_proj.bias,vision_model.encoder.layers.23.self_attn.v_proj.weight,vision_model.encoder.layers.23.self_attn.v_proj.bias,vision_model.encoder.layers.23.self_attn.q_proj.weight,vision_model.encoder.layers.23.self_attn.q_proj.bias,vision_model.encoder.layers.23.self_attn.out_proj.weight,vision_model.encoder.layers.23.self_attn.out_proj.bias,vision_model.encoder.layers.23.layer_norm1.weight,vision_model.encoder.layers.23.layer_norm1.bias,vision_model.encoder.layers.23.mlp.fc1.weight,vision_model.encoder.layers.23.mlp.fc1.bias,vision_model.encoder.layers.23.mlp.fc2.weight,vision_model.encoder.layers.23.mlp.fc2.bias,vision_model.encoder.layers.23.layer_norm2.weight,vision_model.encoder.layers.23.layer_norm2.bias,vision_model.post_layernorm.weight,vision_model.post_layernorm.bias].
All weights are initialized.

之后我们使用EasyNLP中的get_application_evaluator来初始化evaluator,并指定当前device下的当前模型,进行模型评估。

evaluator = get_application_evaluator(app_name="clip",
                                      valid_dataset=valid_dataset,
                                      user_defined_parameters=user_defined_parameters,
                                      eval_batch_size=32)
model.to(torch.cuda.current_device())
evaluator.evaluate(model=model)
[2022-07-20 17:05:47,057 INFO] Inference time = 0.40s, [9.9234 ms / sample] 
r1_num:31 r5_num:37 r10_num:38 query_num:40
r1(%):77.5 r5(%):92.5 r10(%):95.0 mean_recall(%):88.33333333333334
[('mean_recall', 0.8833333333333334)]

模型预测

我们同样可以使用训练好的模型进行预测,也就是文本和图片的特征向量提取。我们首先创建一个predictor,并据此实例化一个PredictorManager实例。以文本特征向量提取为例,我们指定输入为MUGE_MR_test_base64_part_text.tsv,预测好的结果输出在"text_feat.tsv",并指定输出格式为"text_feat"。

predictor = get_application_predictor(app_name="clip", 
                                      model_dir="./clip_model/",
                                      first_sequence="text",
                                      second_sequence="image",
                                      sequence_length=32,
                                      user_defined_parameters=user_defined_parameters)
predictor_manager = PredictorManager(predictor=predictor,
                                     input_file="MUGE_MR_test_base64_part_text.tsv",
                                     input_schema="text:str:1",
                                     output_file="text_feat.tsv",
                                     output_schema="text_feat",
                                     append_cols="text",
                                     batch_size=2)
predictor_manager.run()
exit()
 Loaded weights of the model:
 [embeddings.position_ids,embeddings.word_embeddings.weight,embeddings.position_embeddings.weight,embeddings.token_type_embeddings.weight,embeddings.LayerNorm.weight,embeddings.LayerNorm.bias,encoder.layer.0.attention.self.query.weight,encoder.layer.0.attention.self.query.bias,encoder.layer.0.attention.self.key.weight,encoder.layer.0.attention.self.key.bias,encoder.layer.0.attention.self.value.weight,encoder.layer.0.attention.self.value.bias,encoder.layer.0.attention.output.dense.weight,encoder.layer.0.attention.output.dense.bias,encoder.layer.0.attention.output.LayerNorm.weight,encoder.layer.0.attention.output.LayerNorm.bias,encoder.layer.0.intermediate.dense.weight,encoder.layer.0.intermediate.dense.bias,encoder.layer.0.output.dense.weight,encoder.layer.0.output.dense.bias,encoder.layer.0.output.LayerNorm.weight,encoder.layer.0.output.LayerNorm.bias,encoder.layer.1.attention.self.query.weight,encoder.layer.1.attention.self.query.bias,encoder.layer.1.attention.self.key.weight,encoder.layer.1.attention.self.key.bias,encoder.layer.1.attention.self.value.weight,encoder.layer.1.attention.self.value.bias,encoder.layer.1.attention.output.dense.weight,encoder.layer.1.attention.output.dense.bias,encoder.layer.1.attention.output.LayerNorm.weight,encoder.layer.1.attention.output.LayerNorm.bias,encoder.layer.1.intermediate.dense.weight,encoder.layer.1.intermediate.dense.bias,encoder.layer.1.output.dense.weight,encoder.layer.1.output.dense.bias,encoder.layer.1.output.LayerNorm.weight,encoder.layer.1.output.LayerNorm.bias,encoder.layer.2.attention.self.query.weight,encoder.layer.2.attention.self.query.bias,encoder.layer.2.attention.self.key.weight,encoder.layer.2.attention.self.key.bias,encoder.layer.2.attention.self.value.weight,encoder.layer.2.attention.self.value.bias,encoder.layer.2.attention.output.dense.weight,encoder.layer.2.attention.output.dense.bias,encoder.layer.2.attention.output.LayerNorm.weight,encoder.layer.2.attention.output.LayerNorm.bias,encoder.layer.2.intermediate.dense.weight,encoder.layer.2.intermediate.dense.bias,encoder.layer.2.output.dense.weight,encoder.layer.2.output.dense.bias,encoder.layer.2.output.LayerNorm.weight,encoder.layer.2.output.LayerNorm.bias,encoder.layer.3.attention.self.query.weight,encoder.layer.3.attention.self.query.bias,encoder.layer.3.attention.self.key.weight,encoder.layer.3.attention.self.key.bias,encoder.layer.3.attention.self.value.weight,encoder.layer.3.attention.self.value.bias,encoder.layer.3.attention.output.dense.weight,encoder.layer.3.attention.output.dense.bias,encoder.layer.3.attention.output.LayerNorm.weight,encoder.layer.3.attention.output.LayerNorm.bias,encoder.layer.3.intermediate.dense.weight,encoder.layer.3.intermediate.dense.bias,encoder.layer.3.output.dense.weight,encoder.layer.3.output.dense.bias,encoder.layer.3.output.LayerNorm.weight,encoder.layer.3.output.LayerNorm.bias,encoder.layer.4.attention.self.query.weight,encoder.layer.4.attention.self.query.bias,encoder.layer.4.attention.self.key.weight,encoder.layer.4.attention.self.key.bias,encoder.layer.4.attention.self.value.weight,encoder.layer.4.attention.self.value.bias,encoder.layer.4.attention.output.dense.weight,encoder.layer.4.attention.output.dense.bias,encoder.layer.4.attention.output.LayerNorm.weight,encoder.layer.4.attention.output.LayerNorm.bias,encoder.layer.4.intermediate.dense.weight,encoder.layer.4.intermediate.dense.bias,encoder.layer.4.output.dense.weight,encoder.layer.4.output.dense.bias,encoder.layer.4.output.LayerNorm.weight,encoder.layer.4.output.LayerNorm.bias,encoder.layer.5.attention.self.query.weight,encoder.layer.5.attention.self.query.bias,encoder.layer.5.attention.self.key.weight,encoder.layer.5.attention.self.key.bias,encoder.layer.5.attention.self.value.weight,encoder.layer.5.attention.self.value.bias,encoder.layer.5.attention.output.dense.weight,encoder.layer.5.attention.output.dense.bias,encoder.layer.5.attention.output.LayerNorm.weight,encoder.layer.5.attention.output.LayerNorm.bias,encoder.layer.5.intermediate.dense.weight,encoder.layer.5.intermediate.dense.bias,encoder.layer.5.output.dense.weight,encoder.layer.5.output.dense.bias,encoder.layer.5.output.LayerNorm.weight,encoder.layer.5.output.LayerNorm.bias,encoder.layer.6.attention.self.query.weight,encoder.layer.6.attention.self.query.bias,encoder.layer.6.attention.self.key.weight,encoder.layer.6.attention.self.key.bias,encoder.layer.6.attention.self.value.weight,encoder.layer.6.attention.self.value.bias,encoder.layer.6.attention.output.dense.weight,encoder.layer.6.attention.output.dense.bias,encoder.layer.6.attention.output.LayerNorm.weight,encoder.layer.6.attention.output.LayerNorm.bias,encoder.layer.6.intermediate.dense.weight,encoder.layer.6.intermediate.dense.bias,encoder.layer.6.output.dense.weight,encoder.layer.6.output.dense.bias,encoder.layer.6.output.LayerNorm.weight,encoder.layer.6.output.LayerNorm.bias,encoder.layer.7.attention.self.query.weight,encoder.layer.7.attention.self.query.bias,encoder.layer.7.attention.self.key.weight,encoder.layer.7.attention.self.key.bias,encoder.layer.7.attention.self.value.weight,encoder.layer.7.attention.self.value.bias,encoder.layer.7.attention.output.dense.weight,encoder.layer.7.attention.output.dense.bias,encoder.layer.7.attention.output.LayerNorm.weight,encoder.layer.7.attention.output.LayerNorm.bias,encoder.layer.7.intermediate.dense.weight,encoder.layer.7.intermediate.dense.bias,encoder.layer.7.output.dense.weight,encoder.layer.7.output.dense.bias,encoder.layer.7.output.LayerNorm.weight,encoder.layer.7.output.LayerNorm.bias,encoder.layer.8.attention.self.query.weight,encoder.layer.8.attention.self.query.bias,encoder.layer.8.attention.self.key.weight,encoder.layer.8.attention.self.key.bias,encoder.layer.8.attention.self.value.weight,encoder.layer.8.attention.self.value.bias,encoder.layer.8.attention.output.dense.weight,encoder.layer.8.attention.output.dense.bias,encoder.layer.8.attention.output.LayerNorm.weight,encoder.layer.8.attention.output.LayerNorm.bias,encoder.layer.8.intermediate.dense.weight,encoder.layer.8.intermediate.dense.bias,encoder.layer.8.output.dense.weight,encoder.layer.8.output.dense.bias,encoder.layer.8.output.LayerNorm.weight,encoder.layer.8.output.LayerNorm.bias,encoder.layer.9.attention.self.query.weight,encoder.layer.9.attention.self.query.bias,encoder.layer.9.attention.self.key.weight,encoder.layer.9.attention.self.key.bias,encoder.layer.9.attention.self.value.weight,encoder.layer.9.attention.self.value.bias,encoder.layer.9.attention.output.dense.weight,encoder.layer.9.attention.output.dense.bias,encoder.layer.9.attention.output.LayerNorm.weight,encoder.layer.9.attention.output.LayerNorm.bias,encoder.layer.9.intermediate.dense.weight,encoder.layer.9.intermediate.dense.bias,encoder.layer.9.output.dense.weight,encoder.layer.9.output.dense.bias,encoder.layer.9.output.LayerNorm.weight,encoder.layer.9.output.LayerNorm.bias,encoder.layer.10.attention.self.query.weight,encoder.layer.10.attention.self.query.bias,encoder.layer.10.attention.self.key.weight,encoder.layer.10.attention.self.key.bias,encoder.layer.10.attention.self.value.weight,encoder.layer.10.attention.self.value.bias,encoder.layer.10.attention.output.dense.weight,encoder.layer.10.attention.output.dense.bias,encoder.layer.10.attention.output.LayerNorm.weight,encoder.layer.10.attention.output.LayerNorm.bias,encoder.layer.10.intermediate.dense.weight,encoder.layer.10.intermediate.dense.bias,encoder.layer.10.output.dense.weight,encoder.layer.10.output.dense.bias,encoder.layer.10.output.LayerNorm.weight,encoder.layer.10.output.LayerNorm.bias,encoder.layer.11.attention.self.query.weight,encoder.layer.11.attention.self.query.bias,encoder.layer.11.attention.self.key.weight,encoder.layer.11.attention.self.key.bias,encoder.layer.11.attention.self.value.weight,encoder.layer.11.attention.self.value.bias,encoder.layer.11.attention.output.dense.weight,encoder.layer.11.attention.output.dense.bias,encoder.layer.11.attention.output.LayerNorm.weight,encoder.layer.11.attention.output.LayerNorm.bias,encoder.layer.11.intermediate.dense.weight,encoder.layer.11.intermediate.dense.bias,encoder.layer.11.output.dense.weight,encoder.layer.11.output.dense.bias,encoder.layer.11.output.LayerNorm.weight,encoder.layer.11.output.LayerNorm.bias,encoder.layer.12.attention.self.query.weight,encoder.layer.12.attention.self.query.bias,encoder.layer.12.attention.self.key.weight,encoder.layer.12.attention.self.key.bias,encoder.layer.12.attention.self.value.weight,encoder.layer.12.attention.self.value.bias,encoder.layer.12.attention.output.dense.weight,encoder.layer.12.attention.output.dense.bias,encoder.layer.12.attention.output.LayerNorm.weight,encoder.layer.12.attention.output.LayerNorm.bias,encoder.layer.12.intermediate.dense.weight,encoder.layer.12.intermediate.dense.bias,encoder.layer.12.output.dense.weight,encoder.layer.12.output.dense.bias,encoder.layer.12.output.LayerNorm.weight,encoder.layer.12.output.LayerNorm.bias,encoder.layer.13.attention.self.query.weight,encoder.layer.13.attention.self.query.bias,encoder.layer.13.attention.self.key.weight,encoder.layer.13.attention.self.key.bias,encoder.layer.13.attention.self.value.weight,encoder.layer.13.attention.self.value.bias,encoder.layer.13.attention.output.dense.weight,encoder.layer.13.attention.output.dense.bias,encoder.layer.13.attention.output.LayerNorm.weight,encoder.layer.13.attention.output.LayerNorm.bias,encoder.layer.13.intermediate.dense.weight,encoder.layer.13.intermediate.dense.bias,encoder.layer.13.output.dense.weight,encoder.layer.13.output.dense.bias,encoder.layer.13.output.LayerNorm.weight,encoder.layer.13.output.LayerNorm.bias,encoder.layer.14.attention.self.query.weight,encoder.layer.14.attention.self.query.bias,encoder.layer.14.attention.self.key.weight,encoder.layer.14.attention.self.key.bias,encoder.layer.14.attention.self.value.weight,encoder.layer.14.attention.self.value.bias,encoder.layer.14.attention.output.dense.weight,encoder.layer.14.attention.output.dense.bias,encoder.layer.14.attention.output.LayerNorm.weight,encoder.layer.14.attention.output.LayerNorm.bias,encoder.layer.14.intermediate.dense.weight,encoder.layer.14.intermediate.dense.bias,encoder.layer.14.output.dense.weight,encoder.layer.14.output.dense.bias,encoder.layer.14.output.LayerNorm.weight,encoder.layer.14.output.LayerNorm.bias,encoder.layer.15.attention.self.query.weight,encoder.layer.15.attention.self.query.bias,encoder.layer.15.attention.self.key.weight,encoder.layer.15.attention.self.key.bias,encoder.layer.15.attention.self.value.weight,encoder.layer.15.attention.self.value.bias,encoder.layer.15.attention.output.dense.weight,encoder.layer.15.attention.output.dense.bias,encoder.layer.15.attention.output.LayerNorm.weight,encoder.layer.15.attention.output.LayerNorm.bias,encoder.layer.15.intermediate.dense.weight,encoder.layer.15.intermediate.dense.bias,encoder.layer.15.output.dense.weight,encoder.layer.15.output.dense.bias,encoder.layer.15.output.LayerNorm.weight,encoder.layer.15.output.LayerNorm.bias,encoder.layer.16.attention.self.query.weight,encoder.layer.16.attention.self.query.bias,encoder.layer.16.attention.self.key.weight,encoder.layer.16.attention.self.key.bias,encoder.layer.16.attention.self.value.weight,encoder.layer.16.attention.self.value.bias,encoder.layer.16.attention.output.dense.weight,encoder.layer.16.attention.output.dense.bias,encoder.layer.16.attention.output.LayerNorm.weight,encoder.layer.16.attention.output.LayerNorm.bias,encoder.layer.16.intermediate.dense.weight,encoder.layer.16.intermediate.dense.bias,encoder.layer.16.output.dense.weight,encoder.layer.16.output.dense.bias,encoder.layer.16.output.LayerNorm.weight,encoder.layer.16.output.LayerNorm.bias,encoder.layer.17.attention.self.query.weight,encoder.layer.17.attention.self.query.bias,encoder.layer.17.attention.self.key.weight,encoder.layer.17.attention.self.key.bias,encoder.layer.17.attention.self.value.weight,encoder.layer.17.attention.self.value.bias,encoder.layer.17.attention.output.dense.weight,encoder.layer.17.attention.output.dense.bias,encoder.layer.17.attention.output.LayerNorm.weight,encoder.layer.17.attention.output.LayerNorm.bias,encoder.layer.17.intermediate.dense.weight,encoder.layer.17.intermediate.dense.bias,encoder.layer.17.output.dense.weight,encoder.layer.17.output.dense.bias,encoder.layer.17.output.LayerNorm.weight,encoder.layer.17.output.LayerNorm.bias,encoder.layer.18.attention.self.query.weight,encoder.layer.18.attention.self.query.bias,encoder.layer.18.attention.self.key.weight,encoder.layer.18.attention.self.key.bias,encoder.layer.18.attention.self.value.weight,encoder.layer.18.attention.self.value.bias,encoder.layer.18.attention.output.dense.weight,encoder.layer.18.attention.output.dense.bias,encoder.layer.18.attention.output.LayerNorm.weight,encoder.layer.18.attention.output.LayerNorm.bias,encoder.layer.18.intermediate.dense.weight,encoder.layer.18.intermediate.dense.bias,encoder.layer.18.output.dense.weight,encoder.layer.18.output.dense.bias,encoder.layer.18.output.LayerNorm.weight,encoder.layer.18.output.LayerNorm.bias,encoder.layer.19.attention.self.query.weight,encoder.layer.19.attention.self.query.bias,encoder.layer.19.attention.self.key.weight,encoder.layer.19.attention.self.key.bias,encoder.layer.19.attention.self.value.weight,encoder.layer.19.attention.self.value.bias,encoder.layer.19.attention.output.dense.weight,encoder.layer.19.attention.output.dense.bias,encoder.layer.19.attention.output.LayerNorm.weight,encoder.layer.19.attention.output.LayerNorm.bias,encoder.layer.19.intermediate.dense.weight,encoder.layer.19.intermediate.dense.bias,encoder.layer.19.output.dense.weight,encoder.layer.19.output.dense.bias,encoder.layer.19.output.LayerNorm.weight,encoder.layer.19.output.LayerNorm.bias,encoder.layer.20.attention.self.query.weight,encoder.layer.20.attention.self.query.bias,encoder.layer.20.attention.self.key.weight,encoder.layer.20.attention.self.key.bias,encoder.layer.20.attention.self.value.weight,encoder.layer.20.attention.self.value.bias,encoder.layer.20.attention.output.dense.weight,encoder.layer.20.attention.output.dense.bias,encoder.layer.20.attention.output.LayerNorm.weight,encoder.layer.20.attention.output.LayerNorm.bias,encoder.layer.20.intermediate.dense.weight,encoder.layer.20.intermediate.dense.bias,encoder.layer.20.output.dense.weight,encoder.layer.20.output.dense.bias,encoder.layer.20.output.LayerNorm.weight,encoder.layer.20.output.LayerNorm.bias,encoder.layer.21.attention.self.query.weight,encoder.layer.21.attention.self.query.bias,encoder.layer.21.attention.self.key.weight,encoder.layer.21.attention.self.key.bias,encoder.layer.21.attention.self.value.weight,encoder.layer.21.attention.self.value.bias,encoder.layer.21.attention.output.dense.weight,encoder.layer.21.attention.output.dense.bias,encoder.layer.21.attention.output.LayerNorm.weight,encoder.layer.21.attention.output.LayerNorm.bias,encoder.layer.21.intermediate.dense.weight,encoder.layer.21.intermediate.dense.bias,encoder.layer.21.output.dense.weight,encoder.layer.21.output.dense.bias,encoder.layer.21.output.LayerNorm.weight,encoder.layer.21.output.LayerNorm.bias,encoder.layer.22.attention.self.query.weight,encoder.layer.22.attention.self.query.bias,encoder.layer.22.attention.self.key.weight,encoder.layer.22.attention.self.key.bias,encoder.layer.22.attention.self.value.weight,encoder.layer.22.attention.self.value.bias,encoder.layer.22.attention.output.dense.weight,encoder.layer.22.attention.output.dense.bias,encoder.layer.22.attention.output.LayerNorm.weight,encoder.layer.22.attention.output.LayerNorm.bias,encoder.layer.22.intermediate.dense.weight,encoder.layer.22.intermediate.dense.bias,encoder.layer.22.output.dense.weight,encoder.layer.22.output.dense.bias,encoder.layer.22.output.LayerNorm.weight,encoder.layer.22.output.LayerNorm.bias,encoder.layer.23.attention.self.query.weight,encoder.layer.23.attention.self.query.bias,encoder.layer.23.attention.self.key.weight,encoder.layer.23.attention.self.key.bias,encoder.layer.23.attention.self.value.weight,encoder.layer.23.attention.self.value.bias,encoder.layer.23.attention.output.dense.weight,encoder.layer.23.attention.output.dense.bias,encoder.layer.23.attention.output.LayerNorm.weight,encoder.layer.23.attention.output.LayerNorm.bias,encoder.layer.23.intermediate.dense.weight,encoder.layer.23.intermediate.dense.bias,encoder.layer.23.output.dense.weight,encoder.layer.23.output.dense.bias,encoder.layer.23.output.LayerNorm.weight,encoder.layer.23.output.LayerNorm.bias,pooler.dense.weight,pooler.dense.bias].
All weights are initialized.
 Loaded weights of the model:
 [vision_model.embeddings.class_embedding,vision_model.embeddings.position_ids,vision_model.embeddings.patch_embedding.weight,vision_model.embeddings.position_embedding.weight,vision_model.pre_layrnorm.weight,vision_model.pre_layrnorm.bias,vision_model.encoder.layers.0.self_attn.k_proj.weight,vision_model.encoder.layers.0.self_attn.k_proj.bias,vision_model.encoder.layers.0.self_attn.v_proj.weight,vision_model.encoder.layers.0.self_attn.v_proj.bias,vision_model.encoder.layers.0.self_attn.q_proj.weight,vision_model.encoder.layers.0.self_attn.q_proj.bias,vision_model.encoder.layers.0.self_attn.out_proj.weight,vision_model.encoder.layers.0.self_attn.out_proj.bias,vision_model.encoder.layers.0.layer_norm1.weight,vision_model.encoder.layers.0.layer_norm1.bias,vision_model.encoder.layers.0.mlp.fc1.weight,vision_model.encoder.layers.0.mlp.fc1.bias,vision_model.encoder.layers.0.mlp.fc2.weight,vision_model.encoder.layers.0.mlp.fc2.bias,vision_model.encoder.layers.0.layer_norm2.weight,vision_model.encoder.layers.0.layer_norm2.bias,vision_model.encoder.layers.1.self_attn.k_proj.weight,vision_model.encoder.layers.1.self_attn.k_proj.bias,vision_model.encoder.layers.1.self_attn.v_proj.weight,vision_model.encoder.layers.1.self_attn.v_proj.bias,vision_model.encoder.layers.1.self_attn.q_proj.weight,vision_model.encoder.layers.1.self_attn.q_proj.bias,vision_model.encoder.layers.1.self_attn.out_proj.weight,vision_model.encoder.layers.1.self_attn.out_proj.bias,vision_model.encoder.layers.1.layer_norm1.weight,vision_model.encoder.layers.1.layer_norm1.bias,vision_model.encoder.layers.1.mlp.fc1.weight,vision_model.encoder.layers.1.mlp.fc1.bias,vision_model.encoder.layers.1.mlp.fc2.weight,vision_model.encoder.layers.1.mlp.fc2.bias,vision_model.encoder.layers.1.layer_norm2.weight,vision_model.encoder.layers.1.layer_norm2.bias,vision_model.encoder.layers.2.self_attn.k_proj.weight,vision_model.encoder.layers.2.self_attn.k_proj.bias,vision_model.encoder.layers.2.self_attn.v_proj.weight,vision_model.encoder.layers.2.self_attn.v_proj.bias,vision_model.encoder.layers.2.self_attn.q_proj.weight,vision_model.encoder.layers.2.self_attn.q_proj.bias,vision_model.encoder.layers.2.self_attn.out_proj.weight,vision_model.encoder.layers.2.self_attn.out_proj.bias,vision_model.encoder.layers.2.layer_norm1.weight,vision_model.encoder.layers.2.layer_norm1.bias,vision_model.encoder.layers.2.mlp.fc1.weight,vision_model.encoder.layers.2.mlp.fc1.bias,vision_model.encoder.layers.2.mlp.fc2.weight,vision_model.encoder.layers.2.mlp.fc2.bias,vision_model.encoder.layers.2.layer_norm2.weight,vision_model.encoder.layers.2.layer_norm2.bias,vision_model.encoder.layers.3.self_attn.k_proj.weight,vision_model.encoder.layers.3.self_attn.k_proj.bias,vision_model.encoder.layers.3.self_attn.v_proj.weight,vision_model.encoder.layers.3.self_attn.v_proj.bias,vision_model.encoder.layers.3.self_attn.q_proj.weight,vision_model.encoder.layers.3.self_attn.q_proj.bias,vision_model.encoder.layers.3.self_attn.out_proj.weight,vision_model.encoder.layers.3.self_attn.out_proj.bias,vision_model.encoder.layers.3.layer_norm1.weight,vision_model.encoder.layers.3.layer_norm1.bias,vision_model.encoder.layers.3.mlp.fc1.weight,vision_model.encoder.layers.3.mlp.fc1.bias,vision_model.encoder.layers.3.mlp.fc2.weight,vision_model.encoder.layers.3.mlp.fc2.bias,vision_model.encoder.layers.3.layer_norm2.weight,vision_model.encoder.layers.3.layer_norm2.bias,vision_model.encoder.layers.4.self_attn.k_proj.weight,vision_model.encoder.layers.4.self_attn.k_proj.bias,vision_model.encoder.layers.4.self_attn.v_proj.weight,vision_model.encoder.layers.4.self_attn.v_proj.bias,vision_model.encoder.layers.4.self_attn.q_proj.weight,vision_model.encoder.layers.4.self_attn.q_proj.bias,vision_model.encoder.layers.4.self_attn.out_proj.weight,vision_model.encoder.layers.4.self_attn.out_proj.bias,vision_model.encoder.layers.4.layer_norm1.weight,vision_model.encoder.layers.4.layer_norm1.bias,vision_model.encoder.layers.4.mlp.fc1.weight,vision_model.encoder.layers.4.mlp.fc1.bias,vision_model.encoder.layers.4.mlp.fc2.weight,vision_model.encoder.layers.4.mlp.fc2.bias,vision_model.encoder.layers.4.layer_norm2.weight,vision_model.encoder.layers.4.layer_norm2.bias,vision_model.encoder.layers.5.self_attn.k_proj.weight,vision_model.encoder.layers.5.self_attn.k_proj.bias,vision_model.encoder.layers.5.self_attn.v_proj.weight,vision_model.encoder.layers.5.self_attn.v_proj.bias,vision_model.encoder.layers.5.self_attn.q_proj.weight,vision_model.encoder.layers.5.self_attn.q_proj.bias,vision_model.encoder.layers.5.self_attn.out_proj.weight,vision_model.encoder.layers.5.self_attn.out_proj.bias,vision_model.encoder.layers.5.layer_norm1.weight,vision_model.encoder.layers.5.layer_norm1.bias,vision_model.encoder.layers.5.mlp.fc1.weight,vision_model.encoder.layers.5.mlp.fc1.bias,vision_model.encoder.layers.5.mlp.fc2.weight,vision_model.encoder.layers.5.mlp.fc2.bias,vision_model.encoder.layers.5.layer_norm2.weight,vision_model.encoder.layers.5.layer_norm2.bias,vision_model.encoder.layers.6.self_attn.k_proj.weight,vision_model.encoder.layers.6.self_attn.k_proj.bias,vision_model.encoder.layers.6.self_attn.v_proj.weight,vision_model.encoder.layers.6.self_attn.v_proj.bias,vision_model.encoder.layers.6.self_attn.q_proj.weight,vision_model.encoder.layers.6.self_attn.q_proj.bias,vision_model.encoder.layers.6.self_attn.out_proj.weight,vision_model.encoder.layers.6.self_attn.out_proj.bias,vision_model.encoder.layers.6.layer_norm1.weight,vision_model.encoder.layers.6.layer_norm1.bias,vision_model.encoder.layers.6.mlp.fc1.weight,vision_model.encoder.layers.6.mlp.fc1.bias,vision_model.encoder.layers.6.mlp.fc2.weight,vision_model.encoder.layers.6.mlp.fc2.bias,vision_model.encoder.layers.6.layer_norm2.weight,vision_model.encoder.layers.6.layer_norm2.bias,vision_model.encoder.layers.7.self_attn.k_proj.weight,vision_model.encoder.layers.7.self_attn.k_proj.bias,vision_model.encoder.layers.7.self_attn.v_proj.weight,vision_model.encoder.layers.7.self_attn.v_proj.bias,vision_model.encoder.layers.7.self_attn.q_proj.weight,vision_model.encoder.layers.7.self_attn.q_proj.bias,vision_model.encoder.layers.7.self_attn.out_proj.weight,vision_model.encoder.layers.7.self_attn.out_proj.bias,vision_model.encoder.layers.7.layer_norm1.weight,vision_model.encoder.layers.7.layer_norm1.bias,vision_model.encoder.layers.7.mlp.fc1.weight,vision_model.encoder.layers.7.mlp.fc1.bias,vision_model.encoder.layers.7.mlp.fc2.weight,vision_model.encoder.layers.7.mlp.fc2.bias,vision_model.encoder.layers.7.layer_norm2.weight,vision_model.encoder.layers.7.layer_norm2.bias,vision_model.encoder.layers.8.self_attn.k_proj.weight,vision_model.encoder.layers.8.self_attn.k_proj.bias,vision_model.encoder.layers.8.self_attn.v_proj.weight,vision_model.encoder.layers.8.self_attn.v_proj.bias,vision_model.encoder.layers.8.self_attn.q_proj.weight,vision_model.encoder.layers.8.self_attn.q_proj.bias,vision_model.encoder.layers.8.self_attn.out_proj.weight,vision_model.encoder.layers.8.self_attn.out_proj.bias,vision_model.encoder.layers.8.layer_norm1.weight,vision_model.encoder.layers.8.layer_norm1.bias,vision_model.encoder.layers.8.mlp.fc1.weight,vision_model.encoder.layers.8.mlp.fc1.bias,vision_model.encoder.layers.8.mlp.fc2.weight,vision_model.encoder.layers.8.mlp.fc2.bias,vision_model.encoder.layers.8.layer_norm2.weight,vision_model.encoder.layers.8.layer_norm2.bias,vision_model.encoder.layers.9.self_attn.k_proj.weight,vision_model.encoder.layers.9.self_attn.k_proj.bias,vision_model.encoder.layers.9.self_attn.v_proj.weight,vision_model.encoder.layers.9.self_attn.v_proj.bias,vision_model.encoder.layers.9.self_attn.q_proj.weight,vision_model.encoder.layers.9.self_attn.q_proj.bias,vision_model.encoder.layers.9.self_attn.out_proj.weight,vision_model.encoder.layers.9.self_attn.out_proj.bias,vision_model.encoder.layers.9.layer_norm1.weight,vision_model.encoder.layers.9.layer_norm1.bias,vision_model.encoder.layers.9.mlp.fc1.weight,vision_model.encoder.layers.9.mlp.fc1.bias,vision_model.encoder.layers.9.mlp.fc2.weight,vision_model.encoder.layers.9.mlp.fc2.bias,vision_model.encoder.layers.9.layer_norm2.weight,vision_model.encoder.layers.9.layer_norm2.bias,vision_model.encoder.layers.10.self_attn.k_proj.weight,vision_model.encoder.layers.10.self_attn.k_proj.bias,vision_model.encoder.layers.10.self_attn.v_proj.weight,vision_model.encoder.layers.10.self_attn.v_proj.bias,vision_model.encoder.layers.10.self_attn.q_proj.weight,vision_model.encoder.layers.10.self_attn.q_proj.bias,vision_model.encoder.layers.10.self_attn.out_proj.weight,vision_model.encoder.layers.10.self_attn.out_proj.bias,vision_model.encoder.layers.10.layer_norm1.weight,vision_model.encoder.layers.10.layer_norm1.bias,vision_model.encoder.layers.10.mlp.fc1.weight,vision_model.encoder.layers.10.mlp.fc1.bias,vision_model.encoder.layers.10.mlp.fc2.weight,vision_model.encoder.layers.10.mlp.fc2.bias,vision_model.encoder.layers.10.layer_norm2.weight,vision_model.encoder.layers.10.layer_norm2.bias,vision_model.encoder.layers.11.self_attn.k_proj.weight,vision_model.encoder.layers.11.self_attn.k_proj.bias,vision_model.encoder.layers.11.self_attn.v_proj.weight,vision_model.encoder.layers.11.self_attn.v_proj.bias,vision_model.encoder.layers.11.self_attn.q_proj.weight,vision_model.encoder.layers.11.self_attn.q_proj.bias,vision_model.encoder.layers.11.self_attn.out_proj.weight,vision_model.encoder.layers.11.self_attn.out_proj.bias,vision_model.encoder.layers.11.layer_norm1.weight,vision_model.encoder.layers.11.layer_norm1.bias,vision_model.encoder.layers.11.mlp.fc1.weight,vision_model.encoder.layers.11.mlp.fc1.bias,vision_model.encoder.layers.11.mlp.fc2.weight,vision_model.encoder.layers.11.mlp.fc2.bias,vision_model.encoder.layers.11.layer_norm2.weight,vision_model.encoder.layers.11.layer_norm2.bias,vision_model.encoder.layers.12.self_attn.k_proj.weight,vision_model.encoder.layers.12.self_attn.k_proj.bias,vision_model.encoder.layers.12.self_attn.v_proj.weight,vision_model.encoder.layers.12.self_attn.v_proj.bias,vision_model.encoder.layers.12.self_attn.q_proj.weight,vision_model.encoder.layers.12.self_attn.q_proj.bias,vision_model.encoder.layers.12.self_attn.out_proj.weight,vision_model.encoder.layers.12.self_attn.out_proj.bias,vision_model.encoder.layers.12.layer_norm1.weight,vision_model.encoder.layers.12.layer_norm1.bias,vision_model.encoder.layers.12.mlp.fc1.weight,vision_model.encoder.layers.12.mlp.fc1.bias,vision_model.encoder.layers.12.mlp.fc2.weight,vision_model.encoder.layers.12.mlp.fc2.bias,vision_model.encoder.layers.12.layer_norm2.weight,vision_model.encoder.layers.12.layer_norm2.bias,vision_model.encoder.layers.13.self_attn.k_proj.weight,vision_model.encoder.layers.13.self_attn.k_proj.bias,vision_model.encoder.layers.13.self_attn.v_proj.weight,vision_model.encoder.layers.13.self_attn.v_proj.bias,vision_model.encoder.layers.13.self_attn.q_proj.weight,vision_model.encoder.layers.13.self_attn.q_proj.bias,vision_model.encoder.layers.13.self_attn.out_proj.weight,vision_model.encoder.layers.13.self_attn.out_proj.bias,vision_model.encoder.layers.13.layer_norm1.weight,vision_model.encoder.layers.13.layer_norm1.bias,vision_model.encoder.layers.13.mlp.fc1.weight,vision_model.encoder.layers.13.mlp.fc1.bias,vision_model.encoder.layers.13.mlp.fc2.weight,vision_model.encoder.layers.13.mlp.fc2.bias,vision_model.encoder.layers.13.layer_norm2.weight,vision_model.encoder.layers.13.layer_norm2.bias,vision_model.encoder.layers.14.self_attn.k_proj.weight,vision_model.encoder.layers.14.self_attn.k_proj.bias,vision_model.encoder.layers.14.self_attn.v_proj.weight,vision_model.encoder.layers.14.self_attn.v_proj.bias,vision_model.encoder.layers.14.self_attn.q_proj.weight,vision_model.encoder.layers.14.self_attn.q_proj.bias,vision_model.encoder.layers.14.self_attn.out_proj.weight,vision_model.encoder.layers.14.self_attn.out_proj.bias,vision_model.encoder.layers.14.layer_norm1.weight,vision_model.encoder.layers.14.layer_norm1.bias,vision_model.encoder.layers.14.mlp.fc1.weight,vision_model.encoder.layers.14.mlp.fc1.bias,vision_model.encoder.layers.14.mlp.fc2.weight,vision_model.encoder.layers.14.mlp.fc2.bias,vision_model.encoder.layers.14.layer_norm2.weight,vision_model.encoder.layers.14.layer_norm2.bias,vision_model.encoder.layers.15.self_attn.k_proj.weight,vision_model.encoder.layers.15.self_attn.k_proj.bias,vision_model.encoder.layers.15.self_attn.v_proj.weight,vision_model.encoder.layers.15.self_attn.v_proj.bias,vision_model.encoder.layers.15.self_attn.q_proj.weight,vision_model.encoder.layers.15.self_attn.q_proj.bias,vision_model.encoder.layers.15.self_attn.out_proj.weight,vision_model.encoder.layers.15.self_attn.out_proj.bias,vision_model.encoder.layers.15.layer_norm1.weight,vision_model.encoder.layers.15.layer_norm1.bias,vision_model.encoder.layers.15.mlp.fc1.weight,vision_model.encoder.layers.15.mlp.fc1.bias,vision_model.encoder.layers.15.mlp.fc2.weight,vision_model.encoder.layers.15.mlp.fc2.bias,vision_model.encoder.layers.15.layer_norm2.weight,vision_model.encoder.layers.15.layer_norm2.bias,vision_model.encoder.layers.16.self_attn.k_proj.weight,vision_model.encoder.layers.16.self_attn.k_proj.bias,vision_model.encoder.layers.16.self_attn.v_proj.weight,vision_model.encoder.layers.16.self_attn.v_proj.bias,vision_model.encoder.layers.16.self_attn.q_proj.weight,vision_model.encoder.layers.16.self_attn.q_proj.bias,vision_model.encoder.layers.16.self_attn.out_proj.weight,vision_model.encoder.layers.16.self_attn.out_proj.bias,vision_model.encoder.layers.16.layer_norm1.weight,vision_model.encoder.layers.16.layer_norm1.bias,vision_model.encoder.layers.16.mlp.fc1.weight,vision_model.encoder.layers.16.mlp.fc1.bias,vision_model.encoder.layers.16.mlp.fc2.weight,vision_model.encoder.layers.16.mlp.fc2.bias,vision_model.encoder.layers.16.layer_norm2.weight,vision_model.encoder.layers.16.layer_norm2.bias,vision_model.encoder.layers.17.self_attn.k_proj.weight,vision_model.encoder.layers.17.self_attn.k_proj.bias,vision_model.encoder.layers.17.self_attn.v_proj.weight,vision_model.encoder.layers.17.self_attn.v_proj.bias,vision_model.encoder.layers.17.self_attn.q_proj.weight,vision_model.encoder.layers.17.self_attn.q_proj.bias,vision_model.encoder.layers.17.self_attn.out_proj.weight,vision_model.encoder.layers.17.self_attn.out_proj.bias,vision_model.encoder.layers.17.layer_norm1.weight,vision_model.encoder.layers.17.layer_norm1.bias,vision_model.encoder.layers.17.mlp.fc1.weight,vision_model.encoder.layers.17.mlp.fc1.bias,vision_model.encoder.layers.17.mlp.fc2.weight,vision_model.encoder.layers.17.mlp.fc2.bias,vision_model.encoder.layers.17.layer_norm2.weight,vision_model.encoder.layers.17.layer_norm2.bias,vision_model.encoder.layers.18.self_attn.k_proj.weight,vision_model.encoder.layers.18.self_attn.k_proj.bias,vision_model.encoder.layers.18.self_attn.v_proj.weight,vision_model.encoder.layers.18.self_attn.v_proj.bias,vision_model.encoder.layers.18.self_attn.q_proj.weight,vision_model.encoder.layers.18.self_attn.q_proj.bias,vision_model.encoder.layers.18.self_attn.out_proj.weight,vision_model.encoder.layers.18.self_attn.out_proj.bias,vision_model.encoder.layers.18.layer_norm1.weight,vision_model.encoder.layers.18.layer_norm1.bias,vision_model.encoder.layers.18.mlp.fc1.weight,vision_model.encoder.layers.18.mlp.fc1.bias,vision_model.encoder.layers.18.mlp.fc2.weight,vision_model.encoder.layers.18.mlp.fc2.bias,vision_model.encoder.layers.18.layer_norm2.weight,vision_model.encoder.layers.18.layer_norm2.bias,vision_model.encoder.layers.19.self_attn.k_proj.weight,vision_model.encoder.layers.19.self_attn.k_proj.bias,vision_model.encoder.layers.19.self_attn.v_proj.weight,vision_model.encoder.layers.19.self_attn.v_proj.bias,vision_model.encoder.layers.19.self_attn.q_proj.weight,vision_model.encoder.layers.19.self_attn.q_proj.bias,vision_model.encoder.layers.19.self_attn.out_proj.weight,vision_model.encoder.layers.19.self_attn.out_proj.bias,vision_model.encoder.layers.19.layer_norm1.weight,vision_model.encoder.layers.19.layer_norm1.bias,vision_model.encoder.layers.19.mlp.fc1.weight,vision_model.encoder.layers.19.mlp.fc1.bias,vision_model.encoder.layers.19.mlp.fc2.weight,vision_model.encoder.layers.19.mlp.fc2.bias,vision_model.encoder.layers.19.layer_norm2.weight,vision_model.encoder.layers.19.layer_norm2.bias,vision_model.encoder.layers.20.self_attn.k_proj.weight,vision_model.encoder.layers.20.self_attn.k_proj.bias,vision_model.encoder.layers.20.self_attn.v_proj.weight,vision_model.encoder.layers.20.self_attn.v_proj.bias,vision_model.encoder.layers.20.self_attn.q_proj.weight,vision_model.encoder.layers.20.self_attn.q_proj.bias,vision_model.encoder.layers.20.self_attn.out_proj.weight,vision_model.encoder.layers.20.self_attn.out_proj.bias,vision_model.encoder.layers.20.layer_norm1.weight,vision_model.encoder.layers.20.layer_norm1.bias,vision_model.encoder.layers.20.mlp.fc1.weight,vision_model.encoder.layers.20.mlp.fc1.bias,vision_model.encoder.layers.20.mlp.fc2.weight,vision_model.encoder.layers.20.mlp.fc2.bias,vision_model.encoder.layers.20.layer_norm2.weight,vision_model.encoder.layers.20.layer_norm2.bias,vision_model.encoder.layers.21.self_attn.k_proj.weight,vision_model.encoder.layers.21.self_attn.k_proj.bias,vision_model.encoder.layers.21.self_attn.v_proj.weight,vision_model.encoder.layers.21.self_attn.v_proj.bias,vision_model.encoder.layers.21.self_attn.q_proj.weight,vision_model.encoder.layers.21.self_attn.q_proj.bias,vision_model.encoder.layers.21.self_attn.out_proj.weight,vision_model.encoder.layers.21.self_attn.out_proj.bias,vision_model.encoder.layers.21.layer_norm1.weight,vision_model.encoder.layers.21.layer_norm1.bias,vision_model.encoder.layers.21.mlp.fc1.weight,vision_model.encoder.layers.21.mlp.fc1.bias,vision_model.encoder.layers.21.mlp.fc2.weight,vision_model.encoder.layers.21.mlp.fc2.bias,vision_model.encoder.layers.21.layer_norm2.weight,vision_model.encoder.layers.21.layer_norm2.bias,vision_model.encoder.layers.22.self_attn.k_proj.weight,vision_model.encoder.layers.22.self_attn.k_proj.bias,vision_model.encoder.layers.22.self_attn.v_proj.weight,vision_model.encoder.layers.22.self_attn.v_proj.bias,vision_model.encoder.layers.22.self_attn.q_proj.weight,vision_model.encoder.layers.22.self_attn.q_proj.bias,vision_model.encoder.layers.22.self_attn.out_proj.weight,vision_model.encoder.layers.22.self_attn.out_proj.bias,vision_model.encoder.layers.22.layer_norm1.weight,vision_model.encoder.layers.22.layer_norm1.bias,vision_model.encoder.layers.22.mlp.fc1.weight,vision_model.encoder.layers.22.mlp.fc1.bias,vision_model.encoder.layers.22.mlp.fc2.weight,vision_model.encoder.layers.22.mlp.fc2.bias,vision_model.encoder.layers.22.layer_norm2.weight,vision_model.encoder.layers.22.layer_norm2.bias,vision_model.encoder.layers.23.self_attn.k_proj.weight,vision_model.encoder.layers.23.self_attn.k_proj.bias,vision_model.encoder.layers.23.self_attn.v_proj.weight,vision_model.encoder.layers.23.self_attn.v_proj.bias,vision_model.encoder.layers.23.self_attn.q_proj.weight,vision_model.encoder.layers.23.self_attn.q_proj.bias,vision_model.encoder.layers.23.self_attn.out_proj.weight,vision_model.encoder.layers.23.self_attn.out_proj.bias,vision_model.encoder.layers.23.layer_norm1.weight,vision_model.encoder.layers.23.layer_norm1.bias,vision_model.encoder.layers.23.mlp.fc1.weight,vision_model.encoder.layers.23.mlp.fc1.bias,vision_model.encoder.layers.23.mlp.fc2.weight,vision_model.encoder.layers.23.mlp.fc2.bias,vision_model.encoder.layers.23.layer_norm2.weight,vision_model.encoder.layers.23.layer_norm2.bias,vision_model.post_layernorm.weight,vision_model.post_layernorm.bias].
All weights are initialized.
[2022-07-20 17:07:48,396 INFO] Using SimplePredict to predict...
5it [00:00,  5.19it/s]

一步执行

值得一提的是,上述所有训练/评估/预测代码,都已经被集成在EasyNLP/examples/appzoo_tutorials/text_vision/main.py中,此外,我们也预先编写好了多种可供直接执行的脚本。用户可以通过带参数运行main.py中指令,或者直接使用bash文件命令行执行的方式,一步执行上述所有训练/评估/预测操作。

main文件一步执行

用户通过以下代码带参数执行main.py中的指令,可直接对模型进行训练/评估/预测操作。

训练代码指令如下。参数中,tables指定了训练集和验证集tsv文件的路径,input_schema表示tsv的数据格式,first_sequence、second_sequence用于说明input_schema中哪些字段用于作为第一/第二列数据。模型存储的路径位于checkpoint_dir,learning_rate、epoch_num、random_seed、save_checkpoint_steps、sequence_length、train_batch_size等为训练的超参数。在本示例中,预训练模型指定为clip_chinese_roberta_large_with_vit_large。

! python main.py \
    --mode train \
    --worker_gpu=1 \
    --tables=MUGE_MR_train_base64_part.tsv,MUGE_MR_valid_base64_part.tsv \
    --input_schema=text:str:1,image:str:1 \
    --first_sequence=text \
    --second_sequence=image \
    --checkpoint_dir=./clip_model/ \
    --learning_rate=1e-4  \
    --epoch_num=1  \
    --random_seed=42 \
    --save_checkpoint_steps=200 \
    --sequence_length=32 \
    --train_batch_size=32 \
    --app_name=clip \
    --user_defined_parameters='pretrain_model_name_or_path=clip_chinese_roberta_large_with_vit_large fix_vision=True mode=finetune'

评估代码如下,参数含义与训练是一致的。

! python main.py \
    --mode evaluate \
    --worker_gpu=1 \
    --tables=MUGE_MR_valid_base64_part.tsv \
    --input_schema=text:str:1,image:str:1 \
    --first_sequence=text \
    --second_sequence=image \
    --checkpoint_dir=./clip_model/ \
    --sequence_length=32 \
    --micro_batch_size=32 \
    --app_name=clip \
    --user_defined_parameters=''

预测代码,也就是特征向量提取代码如下。参数同样与上面保持一致,以文本特征向量提取为例,输入为MUGE_MR_test_base64_part_text.tsv,输出结果可在text_feat.tsv中查看。

! python main.py \
    --mode predict \
    --worker_gpu=1 \
    --tables=MUGE_MR_test_base64_part_text.tsv \
    --outputs=text_feat.tsv \
    --input_schema=text:str:1 \
    --output_schema=text_feat \
    --append_cols=text \
    --first_sequence=text \
    --second_sequence=image \
    --checkpoint_path=./clip_model/ \
    --micro_batch_size=32 \
    --sequence_length=32 \
    --app_name=clip \
    --user_defined_parameters=''

利用bash文件命令行执行#

我们在EasyNLP/examples/appzoo_tutorials/text_vision/文件夹下封装好了多种可直接执行的bash脚本,用户同样可以通过直接使用bash文件命令行执行的方式来一步完成模型的训练/评估/预测。以下以run_train_eval_predict_user_defined_local.sh脚本为例。该bash文件需要传入两个参数,第一个参数为运行程序的GPU编号,一般为0;第二个参数代表模型的训练/评估/预测。

模型训练:

! bash run_train_eval_predict_user_defined_local.sh 0 train

模型评估:

! bash run_train_eval_predict_user_defined_local.sh 0 evaluate

模型预测:

! bash run_train_eval_predict_user_defined_local.sh 0 predict
相关实践学习
使用PAI-EAS一键部署ChatGLM及LangChain应用
本场景中主要介绍如何使用模型在线服务(PAI-EAS)部署ChatGLM的AI-Web应用以及启动WebUI进行模型推理,并通过LangChain集成自己的业务数据。
机器学习概览及常见算法
机器学习(Machine Learning, ML)是人工智能的核心,专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。 本课程将带你入门机器学习,掌握机器学习的概念和常用的算法。
相关文章
|
算法 PyTorch 算法框架/工具
【DSW Gallery】基于EasyCV的视频分类示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以视频分类为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于EasyCV的视频分类示例
|
机器学习/深度学习 并行计算 数据可视化
【DSW Gallery】EasyCV-基于关键点的视频分类示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以基于关键点的视频分类为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】EasyCV-基于关键点的视频分类示例
|
缓存 自然语言处理 算法
【DSW Gallery】基于EasyNLP Transformer模型的中文文图生成
EasyNLP提供多种模型的训练及预测功能,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。本文简要介绍文图生成的技术,以及如何在PAI-DSW中基于EasyNLP轻松实现文图生成,带你秒变艺术家。
【DSW Gallery】基于EasyNLP Transformer模型的中文文图生成
|
机器学习/深度学习 人工智能 自然语言处理
【DSW Gallery】基于EasyNLP的中文信息抽取
EasyNLP提供多种模型的训练及预测功能,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。本文以中文信息抽取为例,为您介绍如何在PAI-DSW中基于EasyNLP快速使用K-Global Pointer算法进行中文信息抽取模型的训练、评估、推理。
【DSW Gallery】基于EasyNLP的中文信息抽取
|
缓存 自然语言处理 Shell
【DSW Gallery】基于CK-BERT的中文序列标注
EasyNLP提供多种模型的训练及预测功能,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。本文以序列标注(命名实体识别)为例,为您介绍如何在PAI-DSW中使用EasyNLP。
【DSW Gallery】基于CK-BERT的中文序列标注
|
缓存 自然语言处理 Shell
【DSW Gallery】基于预训练模型的多场景文本生成(以新闻标题生成为例)
EasyNLP提供多种模型的训练及预测功能,旨在帮助自然语言开发者方便快捷地构建模型并应用于生产。本文以中文文本生成为例,为您介绍如何在PAI-DSW中使用EasyNLP。
【DSW Gallery】基于预训练模型的多场景文本生成(以新闻标题生成为例)
|
机器学习/深度学习 算法
【DSW Gallery】如何使用EasyRec训练DeepFM模型
本文基于EasyRec 0.4.7 展示了如何使用EasyRec快速的训练一个DeepFM模型
【DSW Gallery】如何使用EasyRec训练DeepFM模型
|
人工智能 并行计算 算法
【DSW Gallery】基于MOCOV2的自监督学习示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以自监督学习-MOCO为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于MOCOV2的自监督学习示例
|
并行计算 算法 自动驾驶
【DSW Gallery】基于EasyCV的BEVFormer 3D检测示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文将以BEVFormer 3D检测为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于EasyCV的BEVFormer 3D检测示例
|
文字识别 并行计算 算法
【DSW Gallery】基于EasyCV的文字识别示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以文字识别为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于EasyCV的文字识别示例