
简介: 生产中的大语言模型(MEAP)(一)


2.2.2 贝叶斯技术


P(hypothesis | evidence) = (P(evidence | hypothesis) * P(hypothesis)) / P(evidence)

P(A|B) * P(B) = P(B|A) * P(A)


不幸的是,尽管该定理在数学上对数据进行了准确的描述,但它没有考虑到任何随机性或单词的多重含义。你可以用一个词来困惑贝叶斯模型,让其产生错误的结果,这个词就是"it"。任何指示代词最终都会被赋予与其他单词相同的 LogPrior 和 LogLikelihood 值,并且得到一个静态值,而这与这些单词的使用方式相悖。例如,如果你想对一个话语进行情感分析,最好给所有代词赋予一个空值,而不是让它们通过贝叶斯训练。还应该注意,贝叶斯技术并不像其他技术一样会创建生成式语言模型。由于贝叶斯定理验证一个假设,这些模型适用于分类,并且可以为生成式语言模型带来强大的增强。

在第 2.2 节中,我们展示了如何创建一个朴素贝叶斯分类语言模型。我们选择了手写代码而不是使用像 sklearn 这样的软件包,虽然代码会更长一些,但应该更有助于理解其工作原理。我们使用的是最简化版本的朴素贝叶斯模型,没有添加任何复杂的内容,如果你选择对任何你想解决的问题进行升级,这些都可以得到改进。我们强烈建议您这样做。

第 2.2 节 朴素贝叶斯分类语言模型实现
from utils import process_utt, lookup
from nltk.corpus.reader import PlaintextCorpusReader
import numpy as np
my_corpus = PlaintextCorpusReader("./", ".*\.txt")
sents = my_corpus.sents(fileids="hamlet.txt")
def count_utts(result, utts, ys):
        result: a dictionary that is used to map each pair to its frequency
        utts: a list of utts
        ys: a list of the sentiment of each utt (either 0 or 1)
        result: a dictionary mapping each pair to its frequency
    for y, utt in zip(ys, utts):
        for word in process_utt(utt):
            # define the key, which is the word and label tuple
            pair = (word, y)
            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1
            # if the key is new, add it to the dict and set the count to 1
                result[pair] = 1
    return result
result = {}
utts = [" ".join(sent) for sent in sents]
ys = [sent.count("be") > 0 for sent in sents]
count_utts(result, utts, ys)
freqs = count_utts({}, utts, ys)
lookup(freqs, "be", True)
for k, v in freqs.items():
    if "be" in k:
def train_naive_bayes(freqs, train_x, train_y):
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of utts
        train_y: a list of labels correponding to the utts (0,1)
        logprior: the log prior.
        loglikelihood: the log likelihood of you Naive bayes equation.
    loglikelihood = {}
    logprior = 0
    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:
            # Increment the number of positive words (word, label)
            N_pos += lookup(freqs, pair[0], True)
        # else, the label is negative
            # increment the number of negative words (word,label)
            N_neg += lookup(freqs, pair[0], False)
    # Calculate D, the number of documents
    D = len(train_y)
    # Calculate the number of positive documents
    D_pos = sum(train_y)
    # Calculate the number of negative documents
    D_neg = D - D_pos
    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs, word, 1)
        freq_neg = lookup(freqs, word, 0)
        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood
def naive_bayes_predict(utt, logprior, loglikelihood):
        utt: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
        p: the sum of all the logliklihoods + logprior
    # process the utt to get a list of words
    word_l = process_utt(utt)
    # initialize probability to zero
    p = 0
    # add the logprior
    p += logprior
    for word in word_l:
        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]
    return p
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
        test_x: A list of utts
        test_y: the corresponding labels for the list of utts
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
        accuracy: (# of utts classified correctly)/(total # of utts)
    accuracy = 0  # return this properly
    y_hats = []
    for utt in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(utt, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
            # otherwise the predicted class is 0
            y_hat_i = 0
        # append the predicted class to the list y_hats
    # error = avg of the abs vals of the diffs between y_hats and test_y
    error = sum(
        [abs(y_hat - test) for y_hat, test in zip(y_hats, test_y)]
    ) / len(y_hats)
    # Accuracy is 1 minus the error
    accuracy = 1 - error
    return accuracy
if __name__ == "__main__":
    logprior, loglikelihood = train_naive_bayes(freqs, utts, ys)
    my_utt = "To be or not to be, that is the question."
    p = naive_bayes_predict(my_utt, logprior, loglikelihood)
    print("The expected output is", p)
        "Naive Bayes accuracy = %0.4f"
        % (test_naive_bayes(utts, ys, logprior, loglikelihood))


在贝叶斯模型中,一个重要的问题是所有序列实质上都是完全不相关的,就像 BoW 模型一样,将我们从 N-Grams 的序列建模的另一端移动过来。类似于钟摆一样,语言建模在马尔可夫链中再次摆回到序列建模和语言生成。

2.2.3 马尔可夫链

马尔可夫链通常称为隐马尔可夫模型(HMMs),本质上是在之前提到的 N-Gram 模型中添加了状态,使用隐藏状态存储概率。它们通常用于帮助解析文本数据以供更大的模型使用,执行诸如词性标注(Part-of-Speech tagging,将单词标记为它们的词性)和命名实体识别(NER,将标识性单词标记为它们的指示词和通常的类型,例如 LA - 洛杉矶 - 城市)等任务。与之前的贝叶斯模型不同,马尔可夫模型完全依赖于随机性(可预测的随机性),而贝叶斯模型则假装它不存在。然而,其思想同样在数学上是正确的,即任何事情发生的概率 下一个 完全取决于 现在 的状态。因此,我们不是仅基于其历史发生情况对单词进行建模,并从中提取概率,而是基于当前正在发生的情况对其未来和过去的搭配进行建模。因此,“happy” 发生的概率会几乎降至零,如果刚刚输出了“happy”,但如果刚刚发生了“am”,则会显着提高。马尔可夫链非常直观,以至于它们被纳入了贝叶斯统计学的后续迭代中,并且仍然在生产系统中使用。

在清单 2.3 中,我们训练了一个马尔可夫链生成式语言模型。这是我们第一次使用特定的标记器,本例中将基于单词之间的空格进行标记化。这也是我们第二次提到了一组意图作为文档一起查看的话语。当您尝试此模型时,请仔细注意并自行进行一些比较,看看 HMM 的生成效果与即使是大型 N-Gram 模型相比如何。

清单 2.3 生成式隐马尔可夫语言模型实现
import re
import random
from nltk.tokenize import word_tokenize
from collections import defaultdict, deque
class MarkovChain:
    def __init__(self):
        self.lookup_dict = defaultdict(list)
        self._seeded = False
    def __seed_me(self, rand_seed=None):
        if self._seeded is not True:
                if rand_seed is not None:
                self._seeded = True
            except NotImplementedError:
                self._seeded = False
    def add_document(self, str):
        preprocessed_list = self._preprocess(str)
        pairs = self.__generate_tuple_keys(preprocessed_list)
        for pair in pairs:
    def _preprocess(self, str):
        cleaned = re.sub(r"\W+", " ", str).lower()
        tokenized = word_tokenize(cleaned)
        return tokenized
    def __generate_tuple_keys(self, data):
        if len(data) < 1:
        for i in range(len(data) - 1):
            yield [data[i], data[i + 1]]
    def generate_text(self, max_length=50):
        context = deque()
        output = []
        if len(self.lookup_dict) > 0:
            chain_head = [list(self.lookup_dict)[0]]
            while len(output) < (max_length - 1):
                next_choices = self.lookup_dict[context[-1]]
                if len(next_choices) > 0:
                    next_word = random.choice(next_choices)
        return " ".join(output)
if __name__ == "__main__":
    with open("hamlet.txt", "r", encoding="utf-8") as f:
        text = f.read()
    HMM = MarkovChain()

这段代码展示了一个用于生成的马尔可夫模型的基本实现,我们鼓励读者对其进行实验,将其与你最喜欢的音乐家的歌曲或最喜欢的作者的书籍进行结合,看看生成的内容是否真的听起来像他们。HMM 非常快速,通常用于预测文本或预测搜索应用。马尔可夫模型代表了对语言进行描述性语言学建模的首次全面尝试,而不是规范性的尝试,这很有趣,因为马尔可夫最初并不打算用于语言建模,只是为了赢得关于连续独立状态的论战。后来,马尔可夫使用马尔可夫链来模拟普希金小说中的元音分布,所以他至少意识到了可能的应用。


2.2.4 连续语言建模

连续词袋(CBoW),就像它的名字一样,词袋一样,是一种基于频率的语言分析方法,意味着它根据单词出现的频率对单词进行建模。话语中的下一个单词从未基于概率或频率来确定。由于这个原因,所给出的示例将是如何使用 CBoW 创建要由其他模型摄取或比较的单词嵌入。我们将使用神经网络进行此操作,以为您提供一个良好的方法论。

这是我们将看到的第一个语言建模技术,它基本上是在给定话语上滑动一个上下文窗口(上下文窗口是一个 N-gram 模型),并尝试根据窗口中的周围单词猜测中间的单词。例如,假设你的窗口长度为 5,你的句子是“学习语言学让我感到快乐”,你会给出 CBoW[‘学习’, ‘关于’, ‘使’, ‘我’],并试图让模型猜测“语言学”,根据模型之前在类似位置看到该单词出现的次数。这应该会向你展示为什么像这样训练的模型难以生成,因为如果你给出[‘使’, ’我’, ’]作为输入,首先它只有 3 个信息要尝试解决,而不是 4 个,它还将倾向于只猜测它之前在句子末尾看到过的单词,而不是准备开始新的从句。但情况并不完全糟糕,连续模型在嵌入方面突出的一个特征是,它不仅可以查看目标词之前的单词,还可以使用目标之后的单词来获得一些上下文的相似性。

在列表 2.4 中,我们创建了我们的第一个连续模型。在我们的例子中,为了尽可能简单,我们使用词袋进行语言处理,使用一个两个参数的单层神经网络进行嵌入估计,尽管这两者都可以被替换为任何其他模型。例如,你可以将 N-gram 替换为词袋,将朴素贝叶斯替换为神经网络,得到一个连续朴素 N-gram 模型。重点是这种技术中使用的实际模型有点随意,更重要的是连续技术。为了进一步说明这一点,我们除了使用 numpy 做神经网络的数学运算外,没有使用任何其他包,尽管这是我们在本节中首次出现。

特别注意下面的步骤,初始化模型权重,ReLU 激活函数,最终的 softmax 层,前向和反向传播,以及它们如何在gradient_descent函数中组合在一起。这些是拼图中的片段,你将一遍又一遍地看到它们出现,不论编程语言或框架如何。无论你使用 Tensorflow、Pytorch 还是 HuggingFace,如果你开始创建自己的模型而不是使用别人的模型,你都需要初始化模型、选择激活函数、选择最终层,并在前向和反向传播中定义。

列表 2.4 生成连续词袋语言模型实现
import nltk
import numpy as np
from utils import get_batches, compute_pca, get_dict
import re
from matplotlib import pyplot
# Create our corpus for training
with open("hamlet.txt", "r", encoding="utf-8") as f:
    data = f.read()
# Slightly clean the data by removing punctuation, tokenizing by word, and converting to lowercase alpha characters
data = re.sub(r"[,!?;-]", ".", data)
data = nltk.word_tokenize(data)
data = [ch.lower() for ch in data if ch.isalpha() or ch == "."]
print("Number of tokens:", len(data), "\n", data[500:515])
# Get our Bag of Words, along with a distribution
fdist = nltk.FreqDist(word for word in data)
print("Size of vocabulary:", len(fdist))
print("Most Frequent Tokens:", fdist.most_common(20))
# Create 2 dictionaries to speed up time-to-convert and keep track of vocabulary
word2Ind, Ind2word = get_dict(data)
V = len(word2Ind)
print("Size of vocabulary:", V)
print("Index of the word 'king':", word2Ind["king"])
print("Word which has index 2743:", Ind2word[2743])
# Here we create our Neural network with 1 layer and 2 parameters
def initialize_model(N, V, random_seed=1):
        N: dimension of hidden vector
        V: dimension of vocabulary
        random_seed: seed for consistent results in tests
        W1, W2, b1, b2: initialized weights and biases
    W1 = np.random.rand(N, V)
    W2 = np.random.rand(V, N)
    b1 = np.random.rand(N, 1)
    b2 = np.random.rand(V, 1)
    return W1, W2, b1, b2
# Create our final classification layer, which makes all possibilities add up to 1
def softmax(z):
        z: output scores from the hidden layer
        yhat: prediction (estimate of y)
    yhat = np.exp(z) / np.sum(np.exp(z), axis=0)
    return yhat
# Define the behavior for moving forward through our model, along with an activation function
def forward_prop(x, W1, W2, b1, b2):
        x: average one-hot vector for the context
        W1,W2,b1,b2: weights and biases to be learned
        z: output score vector
    h = W1 @ x + b1
    h = np.maximum(0, h)
    z = W2 @ h + b2
    return z, h
# Define how we determine the distance between ground truth and model predictions
def compute_cost(y, yhat, batch_size):
    logprobs = np.multiply(np.log(yhat), y) + np.multiply(
        np.log(1 - yhat), 1 - y
    cost = -1 / batch_size * np.sum(logprobs)
    cost = np.squeeze(cost)
    return cost
# Define how we move backward through the model and collect gradients
def back_prop(x, yhat, y, h, W1, W2, b1, b2, batch_size):
        x:  average one hot vector for the context
        yhat: prediction (estimate of y)
        y:  target vector
        h:  hidden vector (see eq. 1)
        W1, W2, b1, b2:  weights and biases
        batch_size: batch size
        grad_W1, grad_W2, grad_b1, grad_b2:  gradients of weights and biases
    l1 = np.dot(W2.T, yhat - y)
    l1 = np.maximum(0, l1)
    grad_W1 = np.dot(l1, x.T) / batch_size
    grad_W2 = np.dot(yhat - y, h.T) / batch_size
    grad_b1 = np.sum(l1, axis=1, keepdims=True) / batch_size
    grad_b2 = np.sum(yhat - y, axis=1, keepdims=True) / batch_size
    return grad_W1, grad_W2, grad_b1, grad_b2
# Put it all together and train
def gradient_descent(data, word2Ind, N, V, num_iters, alpha=0.03):
    This is the gradient_descent function
        data:      text
        word2Ind:  words to Indices
        N:         dimension of hidden vector
        V:         dimension of vocabulary
        num_iters: number of iterations
        W1, W2, b1, b2:  updated matrices and biases
    W1, W2, b1, b2 = initialize_model(N, V, random_seed=8855)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in get_batches(data, word2Ind, V, C, batch_size):
        z, h = forward_prop(x, W1, W2, b1, b2)
        yhat = softmax(z)
        cost = compute_cost(y, yhat, batch_size)
        if (iters + 1) % 10 == 0:
            print(f"iters: {iters+1} cost: {cost:.6f}")
        grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(
            x, yhat, y, h, W1, W2, b1, b2, batch_size
        W1 = W1 - alpha * grad_W1
        W2 = W2 - alpha * grad_W2
        b1 = b1 - alpha * grad_b1
        b2 = b2 - alpha * grad_b2
        iters += 1
        if iters == num_iters:
        if iters % 100 == 0:
            alpha *= 0.66
    return W1, W2, b1, b2
# Train the model
C = 2
N = 50
word2Ind, Ind2word = get_dict(data)
V = len(word2Ind)
num_iters = 150
print("Call gradient_descent")
W1, W2, b1, b2 = gradient_descent(data, word2Ind, N, V, num_iters)
Call gradient descent
Iters: 10 loss: 0.525015
Iters: 20 loss: 0.092373
Iters: 30 loss: 0.050474
Iters: 40 loss: 0.034724
Iters: 50 loss: 0.026468
Iters: 60 loss: 0.021385
Iters: 70 loss: 0.017941
Iters: 80 loss: 0.015453
Iters: 90 loss: 0.012099
Iters: 100 loss: 0.012099
Iters: 110 loss: 0.011253
Iters: 120 loss: 0.010551
Iters: 130 loss: 0.009932
Iters: 140 loss: 0.009382
Iters: 150 loss: 0.008889

CBoW 示例是我们的第一个代码示例,展示了机器学习中完整有效的训练循环。在所有这些中,我们要求读者特别注意训练循环中的步骤,特别是激活函数 ReLU。由于我们希望读者至少熟悉各种 ML 范式,包括不同的激活函数,因此我们不会在这里解释 ReLU,而是解释为什么应该使用它以及为什么不应该使用它。ReLU 虽然解决了梯度消失问题,但并未解决梯度爆炸问题,并且会严重破坏模型内的所有负比较。更好的情况变体包括 ELU,它允许负数归一化到 alpha,或者 GEGLU/SWIGLU,在越来越复杂的场景中表现良好,如语言。然而,人们经常使用 ReLU,不是因为它们在某种情况下是最好的,而是因为它们易于理解、易于编码、直观,甚至比它们被创建来替代的激活函数如 sigmoid 或 tanh 更加直观。

许多情况下都会使用包等进行抽象处理,但了解底层发生的情况对于你作为 LLMs 投入生产的人来说将非常有帮助。你应该能够相当肯定地预测不同模型在各种情况下的行为。接下来的部分将深入探讨其中一个抽象,这种情况下是由连续建模技术创建的抽象。


