最近在学习大模型,关于tokenize_function或者preprocess_function,GPT等大模型给出的代码demo中通常会有2个版本:
def preprocess_function(examples):
inputs = [ex["input"] for ex in examples]
targets = [ex["output"] for ex in examples]
model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length")["input_ids"]
model_inputs["labels"] = labels
return model_inputs
def preprocess_function(examples):
inputs = examples["text"]
model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
labels = model_inputs["input_ids"].copy()
model_inputs["labels"] = labels
return model_inputs
前者是将模型的输出传给labels,后者是将整个文本传给labels
两种数据结构背后的训练逻辑看起来完全不一样啊,我到底应该用哪一种?还是说选择哪一种逻辑跟代码其他的设置相关?