CCF BDCI 剧本角色情感识别：多目标学习开源方案-阿里云开发者社区

CCF BDCI 剧本角色情感识别：多目标学习开源方案

2022-05-23 438

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： CCF BDCI 剧本角色情感识别：多目标学习开源方案

1、赛题名称

剧本角色情感识别

比赛链接：https://www.datafountain.cn/competitions/518

2、赛题背景

剧本对影视行业的重要性不言而喻。一部好的剧本，不光是好口碑和大流量的基础，也能带来更高的商业回报。剧本分析是影视内容生产链条的第一环，其中剧本角色的情感识别是一个非常重要的任务，主要是对剧本中每句对白和动作描述中涉及到的每个角色从多个维度进行分析并识别出情感。相对于通常的新闻、评论性文本的情感分析，有其独有的业务特点和挑战。

3、赛题任务

本赛题提供一部分电影剧本作为训练集，训练集数据已由人工进行标注，参赛队伍需要对剧本场景中每句对白和动作描述中涉及到的每个角色的情感从多个维度进行分析和识别。该任务的主要难点和挑战包括：1）剧本的行文风格和通常的新闻类语料差别较大，更加口语化；2）剧本中角色情感不仅仅取决于当前的文本，对前文语义可能有深度依赖。

4 数据简介

比赛的数据来源主要是一部分电影剧本，以及爱奇艺标注团队的情感标注结果，主要用于提供给各参赛团队进行模型训练和结果验证使用。

数据说明

训练数据：训练数据为txt格式，以英文制表符分隔，首行为表头，字段说明如下：

字段名称	类型	描述	说明
id	String	数据ID	-
content	String	文本内容	剧本对白或动作描写
character	String	角色名	文本中提到的角色
emotion	String	情感识别结果（按顺序）	爱情感值，乐情感值，惊情感值，怒情感值，恐情感值，哀情感值

备注：

1）本赛题的情感定义共6类（按顺序）：爱、乐、惊、怒、恐、哀；

2）情感识别结果：上述6类情感按固定顺序对应的情感值，情感值范围是[0, 1, 2, 3]，0-没有，1-弱，2-中，3-强，以英文半角逗号分隔；

3）本赛题不需要识别剧本中的角色名；

文件编码：UTF-8 无BOM编码

5 评估标准

本赛题算法评分采用常用的均方根误差（RMSE）来计算评分，按照“文本内容+角色名”识别出的6类情感对应的情感值来统计。

score = 1/(1 + RMSE)

其中是yi,j预测的情感值，xi,j是标注的情感值，n是总的测试样本数。

最终按score得分来排名。

6 基于预训练模型的对目标学习

这个题目可操作的地方有很多，一开始见到这个比赛的时候见想到了multi outputs的模型构建，这里给大家分享下这个基线，希望有大佬能够针对这个思路优化上去~

6.1 加载数据

首先读取数据

with open('data/train_dataset_v2.tsv', 'r', encoding='utf-8') as handler:
    lines = handler.read().split('\n')[1:-1]
    data = list()
    for line in tqdm(lines):
        sp = line.split('\t')
        if len(sp) != 4:
            print("ERROR:", sp)
            continue
        data.append(sp)
train = pd.DataFrame(data)
train.columns = ['id', 'content', 'character', 'emotions']
test = pd.read_csv('data/test_dataset.tsv', sep='\t')
submit = pd.read_csv('data/submit_example.tsv', sep='\t')
train = train[train['emotions'] != '']

提取情感目标

train['emotions'] = train['emotions'].apply(lambda x: [int(_i) for _i in x.split(',')])
train[['love', 'joy', 'fright', 'anger', 'fear', 'sorrow']] = train['emotions'].values.tolist()

6.2 构建数据集

数据集的标签一共有六个：

class RoleDataset(Dataset):
    def __init__(self,texts,labels,tokenizer,max_len):
        self.texts=texts
        self.labels=labels
        self.tokenizer=tokenizer
        self.max_len=max_len
    def __len__(self):
        return len(self.texts)
    def __getitem__(self,item):
        """
        item 为数据索引，迭代取第item条数据
        """
        text=str(self.texts[item])
        label=self.labels[item]
        encoding=self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
#         print(encoding['input_ids'])
        sample = {
            'texts': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten()
        }
        for label_col in target_cols:
            sample[label_col] = torch.tensor(label[label_col], dtype=torch.float)
        return sample

6.3 模型构建

class EmotionClassifier(nn.Module):
    def __init__(self, n_classes):
        super(EmotionClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        self.out_love = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.out_joy = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.out_fright = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.out_anger = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.out_fear = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.out_sorrow = nn.Linear(self.bert.config.hidden_size, n_classes)
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict = False
        )
        love = self.out_love(pooled_output)
        joy = self.out_joy(pooled_output)
        fright = self.out_fright(pooled_output)
        anger = self.out_anger(pooled_output)
        fear = self.out_fear(pooled_output)
        sorrow = self.out_sorrow(pooled_output)
        return {
            'love': love, 'joy': joy, 'fright': fright,
            'anger': anger, 'fear': fear, 'sorrow': sorrow,
        }

6.4 模型训练

回归损失函数直接选取 nn.MSELoss()

EPOCHS = 1 # 训练轮数
optimizer = AdamW(model.parameters(), lr=3e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)
loss_fn = nn.MSELoss().to(device)

模型总的loss为六个目标值的loss之和

def train_epoch(
  model, 
  data_loader, 
  criterion, 
  optimizer, 
  device, 
  scheduler, 
  n_examples
):
    model = model.train()
    losses = []
    correct_predictions = 0
    for sample in tqdm(data_loader):
        input_ids = sample["input_ids"].to(device)
        attention_mask = sample["attention_mask"].to(device)
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        loss_love = criterion(outputs['love'], sample['love'].to(device))
        loss_joy = criterion(outputs['joy'], sample['joy'].to(device))
        loss_fright = criterion(outputs['fright'], sample['fright'].to(device))
        loss_anger = criterion(outputs['anger'], sample['anger'].to(device))
        loss_fear = criterion(outputs['fear'], sample['fear'].to(device))
        loss_sorrow = criterion(outputs['sorrow'], sample['sorrow'].to(device))
        loss = loss_love + loss_joy + loss_fright + loss_anger + loss_fear + loss_sorrow
        losses.append(loss.item())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
#     return correct_predictions.double() / (n_examples*6), np.mean(losses)
    return np.mean(losses)

线上提交0.67+

CCF BDCI 剧本角色情感识别：多目标学习开源方案

1、赛题名称

2、赛题背景

3、赛题任务

4 数据简介

数据说明

5 评估标准

6 基于预训练模型的对目标学习

6.1 加载数据

6.2 构建数据集

6.3 模型构建

6.4 模型训练

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

CCF BDCI 剧本角色情感识别：多目标学习开源方案

1、赛题名称

2、赛题背景

3、赛题任务

4 数据简介

数据说明

5 评估标准

6 基于预训练模型的对目标学习

6.1 加载数据

6.2 构建数据集

6.3 模型构建

6.4 模型训练

热门文章

最新文章

相关电子书