【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！-阿里云开发者社区

一、介绍

ChatGLM是优秀的国产开源大模型，研究的人也比较多，要用它完成自己的任务，还是需要了解它的一些玩法，细节还是很多的。ChatGLM已经更新了几个版本，我就从第一版代码开始记录笔记，后面的版本都是在前一版本进行修改，不会有天翻地覆的变化，所以看到新版本的时候只需要关注变化就可以啦。

大模型的内容肯定是很多的，就从比较前置的Tokenizer开始吧。

二、运行程序

首先下载ChatGLM项目，尽量科学上网，下载稳定些。

ChatGLM-6B：https://github.com/THUDM/ChatGLM-6B

模型文件：https://huggingface.co/THUDM/chatglm-6b/tree/main

下载完成后，把模型文件放在项目目录的THUDM/chatglm-6b中，执行下面的代码能出结果，证明程序运行正常：

from transformers import AutoTokenizer, AutoConfig
 
 
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    text = "我爱学习"
    tokens = tokenizer.encode(text)
    print("tokens:", tokens)
    ''' 打印结果：
    tokens: [5, 76202, 63992, 130001, 130004]
    '''

咱们再来看模型文件，Tokenizer相关的文件有三个，如下图：

ice_text.model：存储分词模型的参数文件；

tokenization_chatglm.py：实现分词相关的逻辑；

tokenizer_config.json：分词的配置文件

三、词典

1.生成字典

我们可以通过下面的代码查看词典规模，运行下面的代码我们将得到完整的词典，存在vocab.txt文件中：

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('THUDM/chatglm-6b/ice_text.model')
save_vocab = []
for id in range(sp.vocab_size()):
    save_vocab.append(str(id)+"\t"+sp.id_to_piece(id))
    print(sp.id_to_piece(id))
with open("vocab.txt", 'w+', encoding='utf-8') as f:
    f.write('\n'.join(save_vocab))

vocab.txt文件也可以直接下载：https://download.csdn.net/download/xian0710830114/88791662

分析vocab.txt文件我们可以发现词典规模130344，而且中英文的比例基本保持在1:1。

2.特殊字符

下面是模型用到的特殊字符：

特殊字符	token_id	说明
<n>	4	回车
▁	5	连接符，标记了一个词的开头

[gMASK]	130001	生成下文用的mask
<sop>	130004	output的开始
<eop>	130005	output的结尾
<\|tab\|>	130008	制表符

<|blank_{length}|>

130009-130087

每n个连续的空格会被组成一个特殊字符，

上限80，即<|blank_80|>

（1）连接符

ChatGLM和LLaMA的分词都用了SentencePiece 库，SentencePiece 库的_EncodeAsPiecesBatch 方法返回的每段（每段是用空格分隔的）数据最前面有一个特殊的下划线 ▁，我们称之为连接符。因为 SentencePiece 使用连接符来表示一个词的开始。值得注意的是他不是普通的下划线，普通的下划线是这样的_。连接符标记了一个词的开头，这有助于区分连续的词汇。

这样做的目的有如下两个好处：

a.词边界标记：SentencePiece 处理的文本通常没有明确的空格或者其他明显的词边界标记（尤其是在某些亚洲语言中）。使用连接符作为词的前缀可以帮助模型识别词的边界。

b.可逆性：在 SentencePiece 的编码和解码过程中，连接符的使用保证了操作的可逆性。这意味着你可以从编码的子词序列准确地重建原始文本，包括空格和词边界。

下面看一个有意思的例子：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    text1 = "苹果我是昨天买的"
    tokens1 = tokenizer.encode(text1, add_special_tokens=False)
    print("tokens1:", tokens1)
    participles1 = [vocab_exchange[token] for token in tokens1]
    print("participles1:", participles1)
    text2 = "我是昨天买的苹果"
    tokens2 = tokenizer.encode(text2, add_special_tokens=False)
    print("tokens2:", tokens2)
    participles2 = [vocab_exchange[token] for token in tokens2]
    print("participles2:", participles2)
 
'''
tokens1: [5, 65319, 65806, 67363, 68543]
participles1: ['▁', '苹果', '我是', '昨天', '买的']
tokens2: [71232, 67363, 68543, 65319]
participles2: ['▁我是', '昨天', '买的', '苹果']
'''

可以看到第一个例子符合我们前面说的每段的开头会自动加一个▁ 但是第二个例子的▁被融合到了起始的分词中，这是因为在这段的开头加完▁后，能在词典中找到能匹配的'▁我是'，根据匹配是长度优先的原则，肯定是选择组合成一个：'▁我是'，而不是分成两个：'▁'和'我是'。

再看一下“每段”的概念，段是单独的用空格分隔的，下面的例子一目了然，每个单独的空格会认为是新的开始。值得注意的是“单独的空格”会被用作分段，多个空格会被是做普通的空格并合并成<|blank|>标记，如下面的第三个例子：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    # 1
    text1 = "Hello World"
    tokens1 = tokenizer.encode(text1, add_special_tokens=False)
    print("tokens1:", tokens1)
    participles1 = [vocab_exchange[token] for token in tokens1]
    print("participles1:", participles1)
    # 2
    text2 = "我是 昨天买的苹果"
    tokens2 = tokenizer.encode(text2, add_special_tokens=False)
    print("tokens2:", tokens2)
    participles2 = [vocab_exchange[token] for token in tokens2]
    print("participles2:", participles2)
    # 3
    text3 = "我是  昨天买的苹果"
    tokens3 = tokenizer.encode(text3, add_special_tokens=False)
    print("tokens3:", tokens3)
    participles3 = [vocab_exchange[token] for token in tokens3]
    print("participles3:", participles3)
 
'''
tokens1: [14833, 398]
participles1: ['▁hello', '▁world']
tokens2: [71232, 70831, 68543, 65319]
participles2: ['▁我是', '▁昨天', '买的', '苹果']
tokens3: [71232, 130009, 67363, 68543, 65319]
participles3: ['▁我是', '<|blank_2|>', '昨天', '买的', '苹果']
'''

（2）[gMASK]

[gMASK]是生成下文用的mask，表示从这里开始往下生成，在训练的时候会先mask掉[gMASK]后面的内容，然后预测后面的内容。ChatGLM的注意力模式是Prefix decoder，也就是下面的第二种，[gMASK]的功能可以理解为分隔input和output，这个到介绍结构时再说。

（3）<sop> 和 <eop>

ChatGLM中的这两个标记分别被当做<bos>（Beginning Of Sentence）和<eos>（Ending Of Sentence）来使用，会被加在output的头尾。

下面看一个例子，数据是训练集中的一行，因为是训练数据所以是有明确的输出作为Ground Truth，训练之前数据预处理的过程就是这样的：

from transformers import AutoTokenizer, AutoConfig
 
 
def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}
 
 
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, device_map='auto')
    max_seq_length = 200
    example = {
        "context": "你是谁",
        "target": "人家是城堡中的小公主"
    }
    token = preprocess(tokenizer, config, example, max_seq_length)
    print("token:", token)
 
'''
token: {'input_ids': [5, 108293, 130001, 130004, 5, 65870, 63829, 75581, 64102, 103559, 130005], 'seq_len': 4}
'''

上面的代码实现的是将问答对转换成tokens，数据的转换过程如下：

四、编码过程

Tokenizer用了sentencepiece包，但是在调用sentencepiece之前还有很多操作，下面的例子是一行训练数据的编码过程，我们来看一下整个过程发生了什么：

from transformers import AutoTokenizer, AutoConfig
 
 
def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}
 
 
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, device_map='auto')
    max_seq_length = 200
    example = {
        "context": "你要干什么",
        "target": "小公主   我们来玩吧\nHAHA\tHAHA"
    }
    token = preprocess(tokenizer, config, example, max_seq_length)
    print("token:", token)
 
'''
token: {'input_ids': [85117, 72675, 130001, 130004, 5, 103559, 130010, 63869, 111415, 63956, 4, 26650, 130008, 26650, 130005], 'seq_len': 4}
'''

下面涉及的代码没有特殊说明的都在tokenization_chatglm.py中，程序入口ChatGLMTokenizer._tokenize()。

1.删除空格、变小写

这里是可以配置的，配置项在tokenizer_config.json中：

...
  "remove_space": false,
  "do_lower_case": true,
...

因为删除空格会影响下面的<|blank|>,所以这里我只变小写，代码如下：

    def preprocess_text(self, inputs):
        if self.remove_space:
            outputs = " ".join(inputs.strip().split())
        else:
            outputs = inputs
 
        if self.do_lower_case:
            outputs = outputs.lower()
 
        return outputs

2.转换回车、制表符和空格

\n替换成<n>; \t替换成<|tab|> ;空格被替换成<|blank_{length}|>，{length}是空格的个数，最多到80，值得注意的是，虽然80这个值是一个参数，但是只能小于等于80，因为词典中没有超过80的token。

代码如下：

    @staticmethod
    def _encode_whitespaces(text: str, max_len: int = 80):
        # 替换制表符
        text = text.replace("\t", SPTokenizer.get_tab_token())
        # 替换空格
        for i in range(max_len, 1, -1):
            text = text.replace(" " * i, SPTokenizer.get_blank_token(i))
        return text
 
    def _preprocess(self, text: str, linebreak=True, whitespaces=True):
        if linebreak:
            # 替换回车
            text = text.replace("\n", "<n>")
        if whitespaces:
            text = self._encode_whitespaces(text, max_len=self.max_blank_length)
        return text

3.虚拟空格

可以在开头添加虚拟空格，其实是<n>，默认是不加这个虚拟空格的，代码如下：

4.生成token_id

上面的处理之后，调用sentencepiece的EncodeAsIds()方法生成token，特殊的下划线就是这个时候拼上的。sentencepiece还是值得研究一下的，ice_text.model也是使用它训练的，从词典能看出来，用的是BPE (Byte Pair Encoding)算法。

5.拼接特殊字符

在encode完成的tokens后面拼上130001([gMASK])和130004(<sop>)。值得注意的是，在准备数据的时候，output后面不拼这两个token而是130005(<eop>)，这一步需要我们自己做。代码如下：

    def build_inputs_with_special_tokens(
            self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BERT sequence has the following format:
        - single sequence: `[CLS] X [SEP]`
        - pair of sequences: `[CLS] A [SEP] B [SEP]`
        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        gmask_id = self.sp_tokenizer[self.gmask_token]
        eos_id = self.sp_tokenizer[self.eos_token]
        token_ids_0 = token_ids_0 + [gmask_id, self.sp_tokenizer[self.bos_token]]
        if token_ids_1 is not None:
            token_ids_0 = token_ids_0 + token_ids_1 + [eos_id]
        return token_ids_0

执行拼接，在transformers包tokenization_utils_base.py中的DispatchService.build_inputs_with_special_tokens()方法中，将特殊字符拼接到了tokens的最后面，代码如下：

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens.
        This implementation does not add special tokens and this method should be overridden in a subclass.
        Args:
            token_ids_0 (`List[int]`): The first tokenized sequence.
            token_ids_1 (`List[int]`, *optional*): The second tokenized sequence.
        Returns:
            `List[int]`: The model input with special tokens.
        """
        if token_ids_1 is None:
            return token_ids_0
        return token_ids_0 + token_ids_1

下面是完整编码过程的示意图，部分流程略有调整，主要是为了易于理解：

五、解码过程

最后再看一下decode，过程比较简单，一句话就能概括。就是按照词典在把token_id转换成字符串，同时连接符会被去掉：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    tokens = [5, 19316, 932]
    participles = [vocab_exchange[token] for token in tokens]
    print("participles:", participles)
    decode_tokens = tokenizer.decode(tokens)
    print("decode_tokens:", decode_tokens)
 
'''
participles: ['▁', '▁Hello', '▁World']
decode_tokens: Hello World
'''

现在还有一个问题，词典（ice_text.model）是怎么生成的，ChatGLM和LLaMA其实都使用了sentencepiece包中的BPE，sentencepiece实现了BPE (Byte Pair Encoding)、Unigram、Word和Char四种算法，那这四种算法是什么，最终为什么选择BPE，因为篇（lan）幅（de）有（xie）限（le）以后会单独说。

ChatGLM的Tokenizer就介绍到这里，关注不迷路(#^.^#)...

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！

一、介绍

二、运行程序

三、词典

1.生成字典

2.特殊字符

四、编码过程

1.删除空格、变小写

2.转换回车、制表符和空格

3.虚拟空格

4.生成token_id

5.拼接特殊字符

五、解码过程

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

直播

下载

镜像站

技术资料

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！

一、介绍

二、运行程序

三、词典

1.生成字典

2.特殊字符

四、编码过程

1.删除空格、变小写

2.转换回车、制表符和空格

3.虚拟空格

4.生成token_id

5.拼接特殊字符

五、解码过程

热门文章

最新文章

相关课程

相关电子书

相关实验场景