在 IMDB 电影评论数据集上进行文本数据漂移检测（Seldon Alibi Detect）（3）-阿里云开发者社区

在 IMDB 电影评论数据集上进行文本数据漂移检测（Seldon Alibi Detect）（3）

2022-11-02 135

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 我们使用最大均值差异（MMD）和 Kolmogorov-Smirnov (K-S) 检测器检测文本数据的漂移。在这个示例中，我们将专注于检测协变量漂移Δp(x)\Delta p(x)Δp(x)，因为检测预测的标签分布漂移与其他方式没有区别（在 CIFAR-10 上检查 K-S 和 MMD 漂移）。

MMD PyTorch 检测器

初始化

对于预处理步骤和 MMD 实现，我们可以使用 PyTorch 后端运行相同的检测器：

import torch
import torch.nn as nn
# set random seed and device
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
复制代码

运行结果：

cuda
复制代码

from alibi_detect.cd.pytorch import preprocess_drift
from alibi_detect.models.pytorch import TransformerEmbedding
embedding_pt = TransformerEmbedding(model_name, emb_type, layers)
model = nn.Sequential(
    embedding_pt,
    nn.Linear(768, 256),
    nn.ReLU(),
    nn.Linear(256, enc_dim)
).to(device).eval()
# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=model, tokenizer=tokenizer,
                        max_len=max_len, batch_size=32, device=device)
# initialise drift detector
cd = MMDDrift(X_ref, backend='pytorch', p_val=.05, preprocess_fn=preprocess_fn,
              n_permutations=100, input_shape=(max_len,))
复制代码

检测漂移

H0数据集:

preds_h0 = cd.predict(X_h0)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds_h0['data']['is_drift']]))
print('p-value: {}'.format(preds_h0['data']['p_val']))
复制代码

运行结果：

Drift? No!
p-value: 0.49000000953674316
复制代码

不平衡数据:

for k, v in X_imb.items():
    preds = cd.predict(v)
    print('% negative sentiment {}'.format(k * 100))
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('p-value: {}'.format(preds['data']['p_val']))
    print('')
复制代码

运行结果：

% negative sentiment 10.0
Drift? Yes!
p-value: 0.0
% negative sentiment 90.0
Drift? Yes!
p-value: 0.0
复制代码

扰动数据:

for w, probas in X_word.items():
    for p, v in probas.items():
        preds = cd.predict(v)
        print('Word: {} -- % perturbed: {}'.format(w, p))
        print('Drift? {}'.format(labels[preds['data']['is_drift']]))
        print('p-value: {}'.format(preds['data']['p_val']))
        print('')
复制代码

运行结果：

Word: fantastic -- % perturbed: 1.0
Drift? Yes!
p-value: 0.0
Word: fantastic -- % perturbed: 5.0
Drift? Yes!
p-value: 0.0
Word: good -- % perturbed: 1.0
Drift? No!
p-value: 0.10000000149011612
Word: good -- % perturbed: 5.0
Drift? Yes!
p-value: 0.0
Word: bad -- % perturbed: 1.0
Drift? Yes!
p-value: 0.0
Word: bad -- % perturbed: 5.0
Drift? Yes!
p-value: 0.0
Word: horrible -- % perturbed: 1.0
Drift? No!
p-value: 0.05999999865889549
Word: horrible -- % perturbed: 5.0
Drift? Yes!
p-value: 0.0
复制代码

从头开始训练 embeddings

到目前为止，我们使用了来自 BERT 模型的预训练 embeddings。然而，我们也可以使用从头开始训练的模型中的 embeddings。

首先，我们在 TensorFlow 中定义和训练一个由 embedding 和 LSTM 层组成的简单分类模型。

加载数据并训练模型

from tensorflow.keras.datasets import imdb, reuters
from tensorflow.keras.layers import Dense, Embedding, Input, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
INDEX_FROM = 3
NUM_WORDS = 10000
def print_sentence(tokenized_sentence: str, id2w: dict):
    print(' '.join(id2w[_] for _ in tokenized_sentence))
    print('')
    print(tokenized_sentence)
def mapping_word_id(data):
    w2id = data.get_word_index()
    w2id = {k: (v + INDEX_FROM) for k, v in w2id.items()}
    w2id["<PAD>"] = 0
    w2id["<START>"] = 1
    w2id["<UNK>"] = 2
    w2id["<UNUSED>"] = 3
    id2w = {v: k for k, v in w2id.items()}
    return w2id, id2w
def get_dataset(dataset: str = 'imdb', max_len: int = 100):
    if dataset == 'imdb':
        data = imdb
    elif dataset == 'reuters':
        data = reuters
    else:
        raise NotImplementedError
    w2id, id2w = mapping_word_id(data)
    (X_train, y_train), (X_test, y_test) = data.load_data(
        num_words=NUM_WORDS, index_from=INDEX_FROM)
    X_train = sequence.pad_sequences(X_train, maxlen=max_len)
    X_test = sequence.pad_sequences(X_test, maxlen=max_len)
    y_train, y_test = to_categorical(y_train), to_categorical(y_test)
    return (X_train, y_train), (X_test, y_test), (w2id, id2w)
def imdb_model(X: np.ndarray, num_words: int = 100, emb_dim: int = 128,
               lstm_dim: int = 128, output_dim: int = 2) -> tf.keras.Model:
    X = np.array(X)
    inputs = Input(shape=(X.shape[1:]), dtype=tf.float32)
    x = Embedding(num_words, emb_dim)(inputs)
    x = LSTM(lstm_dim, dropout=.5)(x)
    outputs = Dense(output_dim, activation=tf.nn.softmax)(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        loss='categorical_crossentropy',
        optimizer='adam',
        metrics=['accuracy']
    )
    return model
复制代码

加载和tokenize数据：

(X_train, y_train), (X_test, y_test), (word2token, token2word) = \
    get_dataset(dataset='imdb', max_len=max_len)
复制代码

在 IMDB 电影评论数据集上进行文本数据漂移检测（Seldon Alibi Detect）（3）

MMD PyTorch 检测器

初始化

检测漂移

从头开始训练 embeddings

加载数据并训练模型

热门文章

最新文章

相关课程

相关电子书

相关实验场景