直接使用
请打开在DSW中如何玩转Hugging Face,并点击右上角 “ 在DSW中打开” 。
Hugging Face介绍
Hugging Face(简称HF,官网地址)最开始是专注于NLP技术的大型开源社区,在github上开源的自然语言处理预训练模型库Transformers已被下载超过百万次,github上超过64000颗星。提供大量的start-of-art的预训练模型是HF的最大招牌,目前已经覆盖了NLP、CV、Audio、Multimodel等领域的上万个模型,为广大模型开发者、研究者和算法工程师提供了极大的便利。
HF最主要的特性包括:
- 大量的预训练模型
- 模型直接支持推理和FineTune
- 简洁的python sdk
- 完善的基于git和git lfs的ModelHub
- 同时支持Tensorflow 2.0+,PyTroch 1.1.0+ 和Flax
使用HF,任何人都可以在最快的时间内获得工业界最知名的预训练模型用于自己的研究或者生产。下面介绍如何使用python sdk访问HF。
1. 环境准备
HF的功能主要通过3个python package来完成:
- transformers
- datasets
- tokenizers
- huggingface_hub
它们都可以通过pip来安装,要求python环境是3.6+。使用transformers需要注意对Tensorflow和PyTorch的依赖,HF中的模型卡片中会标明能够支持哪种深度学习框架。本文中假设PyTorch已经安装,也可以在DSW的镜像列表中选取预装pytroch的镜像或者在环境中显式的安装PyTorch。
!pip install transformers datasets tokenizers huggingface_hub sentencepiece
验证安装是否成功
from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))
得到结果:[{'label':'POSITIVE','score':0.9998704195022583}]
2. 使用HF的pipeline做推理
HF把各种模型安装任务(Task)做了分类,针对每一类Task,HF会提供调用的标准方法,以及默认的模型。HF把推理的任务封装到pipeline对象中,因为一个推理任务通常涉及到3个步骤:把输入数据做分词并转换为ID,调用模型的预测函数,把ID转换为词汇表中的文本。
2.1 英文情感分析
sentiment-analysis是一个情感分析任务:给定一段文字,给出是正面还是负面的评价。HF会使用默认的模型来完成这个任务。
classifier = pipeline("sentiment-analysis") results = classifier(["PAI is a wonderful tool for AI development", "It's a rainy day."]) for result in results: print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
得到结果: label:POSITIVE,withscore:0.9998
label:NEGATIVE,withscore:0.9964
2.2 中文问题回答(Extractive Q&A)
pipeline的构造函数中可以指定HF仓库中的模型名字来完成特定任务。在HF中,使用language=zh,task=question-answering过滤,看到排名第一的模型是“uer/roberta-base-chinese-extractive-qa”,我们将使用它来完成Q&A任务:给定一段文本和问题,获取答案;这里的答案仅仅是从文本中(被称为context)抽取一段文本,所以只需要返回一个start和end的下标,用来标识出答案,Extractive Question Answering。这个模型是基于chinese_roberta_L-12_H-768,再专门针对3个中文语料库做的FineTune得到的模型:全国第二届“军事智能机器阅读”挑战赛,百度的中文问答数据集WebQA,第二届“讯飞杯”中文机器阅读理解评测CMRC 2018公开数据集。
from huggingface_hub import list_models, ModelFilter # 获取所有支持中文的问答类模型 models = list_models(filter=ModelFilter(task="question-answering", language="zh")) models[0]
ModelInfo: { modelId: uer/roberta-base-chinese-extractive-qa sha: d5e37a8228fa9d396ff4b093c21e8f0082ff11e1 lastModified: 2022-02-20T07:50:56.000Z tags: ['pytorch', 'tf', 'jax', 'bert', 'question-answering', 'zh', 'transformers', 'autotrain_compatible', 'infinity_compatible'] pipeline_tag: question-answering siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='flax_model.msgpack'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tf_model.h5'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='vocab.txt')] config: None id: uer/roberta-base-chinese-extractive-qa private: False author: uer downloads: 4779 library_name: transformers likes: 11 }
可以看到与HF的官网网站上的效果是一致的:
Fig.1 - hugging face
from transformers import AutoModelForQuestionAnswering, AutoTokenizer,pipeline # 使用AutoModelFor<TASK>来显示选取模型 model = AutoModelForQuestionAnswering.from_pretrained('uer/roberta-base-chinese-extractive-qa') # NLP模型一般都需要一个Tokenizer来切词,而模型提供方会有对应的准备 tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-chinese-extractive-qa') QA = pipeline('question-answering', model=model, tokenizer=tokenizer) QA_input = {'question': "著名诗歌《假如生活欺骗了你》的作者是", 'context': "普希金从那里学习人民的语言,吸取了许多有益的养料,这一切对普希金后来的创作产生了很大的影响。" "这两年里,普希金创作了不少优秀的作品,如《囚徒》、《致大海》、《致凯恩》和《假如生活欺骗了你》等几十首抒情诗," "叙事诗《努林伯爵》,历史剧《鲍里斯·戈都诺夫》,以及《叶甫盖尼·奥涅金》前六章。"} QA(QA_input)
{'score': 0.9766426086425781, 'start': 0, 'end': 3, 'answer': '普希金'}
QA_input = {'question': "中国的首都是", 'context': "北京是一个古老的城市,从1949年起成为新中国的首都。在抗日战争时期,重庆曾经成为陪都。"} QA(QA_input)
{'score': 0.0009129824466072023, 'start': 0, 'end': 2, 'answer': '北京'}
3. 基于预训练的模型做FineTune
FineTune或者Transfer Learning是一种比较流行的做法,特别是在NLP领域:利用大量的语料训练出一个基本的模型,然后在结合自己的业务数据,再基本模型之上进一步FineTune。HF中有三种方法来运行Fine-Tune的训练过程(官方链接):
- Fine-tune a pretrained model with 🤗 Transformers Trainer.
- Fine-tune a pretrained model in TensorFlow with Keras.
- Fine-tune a pretrained model in native PyTorch.
FineTune是在预训练的模型的权重基础之上,进一步训练,有两种情况:
- 预训练好的模型已经可以解决目前的问题,但是需要把模型权重进一步训练以适应新的训练样本。
一个典型例子是通用的预训练英文翻译模型,在大量的通用语料上训练而来;但是针对某个专业领域效果不是很理想。如果我们拥有这个领域的语料,可以进一步训练,使得在这个领域的效果得到改进。 - 预训练模型的网络结构不能直接解决当前问题,进利用预训练模型的最主要网络部分,然后增加针对新任务的神经网络layer;这个时候的FineTune是把原模型的主要部分权重已经新增加的layer的权重一起训练。
比如我们用一个普通的bert语言模型来做QuestionAnswering,HF就会提示原有的部分模型权重没有被使用,同时有一部分权重没有被初始化。这是因为预训练好的模型的的网络结构是Embedding->Transoformer Encoder->classification,而QuestionAnswering需要Embedding->Transofrmer Encoder->QA。其中的Classification Layer只需要输出2个logit用来做二分类,而QA需要输出2个整数代表Answer的start和end下标。
3.1 查看模型结构并理解FineTune
我们尝试用一个bert模型来做QuestionAnsering,可以看到HF的Warning信息:原预训练模型的网络结构中的分类layer会被抛弃,同时新增一个没有初始化的QA layer。这就意味着这个模型必须被FineTune训练之后才能被使用。
我们也可以看到这个模型的详细神经网络结构:
- 最底层是Embedding层,支持30522个单词(token),每个单词对应的Embedding是一个768维的向量。
- 之后是一个Transfomer,由6个TransformerBlock构成,每一个TransformerBlock有MultiHeadSelfAttention, LayerNorm,FFN三部分。最后输出一个768维的向量
- 增加了一个新的Layer叫QA Output;把一个768维的向量转换为2个数字,分别代表Answer的start和end下标
AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertForQuestionAnswering: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight'] - This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
DistilBertForQuestionAnswering( (distilbert): DistilBertModel( (embeddings): Embeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layer): ModuleList( (0): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (1): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (2): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (3): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (4): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (5): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) ) (qa_outputs): Linear(in_features=768, out_features=2, bias=True) (dropout): Dropout(p=0.1, inplace=False) )
3.2 预训练第一步:加载一个预训练模型
我们选择排名第一的中文翻译到英文的模型来做FineTune:Helsinki-NLP/opus-mt-zh-en(链接)。我们将在这个模型基础之上加入自己的语料进一步FineTune,也就是保留预训练的模型结构,只是对其中权重(weights)做训练。
我们加载模型,并看一下这个预训练模型的效果:
import logging logging.disable(logging.WARN) from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_checkpoint = "Helsinki-NLP/opus-mt-zh-en" model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) translator = pipeline("translation", model=model, tokenizer=tokenizer)
sequences = [ "你好,今天天气很好", "深度学习是一种新的方法", "数学的重要性不言而喻", "不明觉厉", #'“虽不明,但觉厉”,网络流行词,简称“不明觉厉”,表示“虽然不明白你在说什么,但好像很厉害的样子。' ] results = translator(sequences) print(results) for source, target in zip(sequences, results): print(source, "===>", target["translation_text"])
[{'translation_text': "Hello. It's a nice day."}, {'translation_text': 'Deep learning is a new approach.'}, {'translation_text': 'The importance of mathematics speaks for itself.'}, {'translation_text': "I don't know what I'm talking about."}] 你好,今天天气很好 ===> Hello. It's a nice day. 深度学习是一种新的方法 ===> Deep learning is a new approach. 数学的重要性不言而喻 ===> The importance of mathematics speaks for itself. 不明觉厉 ===> I don't know what I'm talking about.
可以看到当前的翻译效果很不错,并且还帮我们把句号都加上了😄!其中的“不言而喻”的翻译更是非常地道。但是对“不明觉厉”这个网络词汇的理解不够准确,我们下面要完成的FineTune任务就是让模型记住“不明觉厉”的英文句子。
3.3 预训练第二步:准备训练数据
如果HF Hub中已经有我们需要的数据集,可以用datasets这个库来直接load;也可以提前push到HF Hub之后再load。如果不希望push到HF,也可以把训练数据放到本地来Load。 为了方便演示,我们用内存中数据来构造一个dataset(参考链接);其中就是“不明觉厉”的英文翻译这一个样本。
from datasets import Dataset source_sentences = ["不明觉厉"] target_sentences=["It's not clear what you're talking about, but it looks like it's pretty good"] inputs = tokenizer(source_sentences, max_length=50, padding=True, truncation=True, return_tensors="pt") with tokenizer.as_target_tokenizer(): labels = tokenizer(target_sentences, return_tensors='pt', padding=True) inputs['decoder_input_ids']=labels['input_ids'] inputs['decoder_attention_mask']=labels['attention_mask'] inputs['labels']=labels['input_ids'] dataset = Dataset.from_dict(inputs)
3.4 预训练第三步:准备训练的参数#
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq training_args = Seq2SeqTrainingArguments( output_dir="./mymodels", evaluation_strategy="no", overwrite_output_dir=True, num_train_epochs=6, save_steps=1000, save_total_limit=2, predict_with_generate=False, prediction_loss_only=True) data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
3.5 预训练第四步:开始训练
HF提供了Trainer类来辅助训练,HF也支持用Tensorflow或者PyTorch来完成训练。
trainer = Seq2SeqTrainer( model = model, args = training_args, data_collator=data_collator, tokenizer=tokenizer, #compute_metrics=compute_metrics, train_dataset=dataset) trainer.train()
TrainOutput(global_step=3, training_loss=0.3808326721191406, metrics={'train_runtime': 0.6848, 'train_samples_per_second': 4.381, 'train_steps_per_second': 4.381, 'total_flos': 3972464640.0, 'train_loss': 0.3808326721191406, 'epoch': 3.0})
3.6 使用FineTune之后的模型查看翻译效果
new_pipeline=pipeline('translation', model=model, tokenizer=tokenizer) new_pipeline(["听到这个消息之后,所有人都震惊了", "不明觉厉"])
[{'translation_text': 'Everyone was shocked when I heard the news.'}, {'translation_text': "It's not clear what you're talking about, but it looks like it's pretty good"}]
3.7 保存模型
model.save_pretrained("./models")
!ls -lh ./models
total 296M -rw-rw-rw- 1 root root 1.4K Jun 17 11:35 config.json -rw-rw-rw- 1 root root 296M Jun 17 11:35 pytorch_model.bin
4. 总结
HuggingFace提供了大量的预训练模型供算法工程师使用,基于预训练模型,我们可以直接做推理或预测,也可以进一步FineTune来适应自己的业务数据。而HF python SDK提供了非常友好的interface,最常用的是pipeline、model、tokeninzer、trainer等。