本文实现的核心问题
通过信息抽取技术实现实体、关系抽取任务。通过光学字符识别能力扩大企业公告 pdf 的识别来源。通过对开源数据集及开源深度学习解决方案实现预训练语言模型训练工作、实体识别训练工作、关系抽取训练工作。
通过 networks 实现关系可视化,关系可视化布局、pagerank 节点重要性排序。
本文涉及技术点
深度学习封装框架
- paddleocr
- paddlenlp
- bert4keras
可视化框架
- networkx
- pyvis
分布式加速框架
- ray
- pyspark
对外提供接口形式
- TensorFlow serving
- ray["serve"]
- fast api
- onnxruntime
本文所涉及数据集
- duie 百度构建娱乐关系抽取数据集
{"postag": [{"word": "查尔斯", "pos": "nr"}, {"word": "·", "pos": "w"}, {"word": "阿兰基斯", "pos": "nr"}, {"word": "(", "pos": "w"}, {"word": "Charles Aránguiz", "pos": "nz"}, {"word": ")", "pos": "w"}, {"word": ",", "pos": "w"}, {"word": "1989年4月17日", "pos": "t"}, {"word": "出生", "pos": "v"}, {"word": "于", "pos": "p"}, {"word": "智利圣地亚哥", "pos": "ns"}, {"word": ",", "pos": "w"}, {"word": "智利", "pos": "ns"}, {"word": "职业", "pos": "n"}, {"word": "足球", "pos": "n"}, {"word": "运动员", "pos": "n"}, {"word": ",", "pos": "w"}, {"word": "司职", "pos": "v"}, {"word": "中场", "pos": "n"}, {"word": ",", "pos": "w"}, {"word": "效力", "pos": "v"}, {"word": "于", "pos": "p"}, {"word": "德国", "pos": "ns"}, {"word": "足球", "pos": "n"}, {"word": "甲级", "pos": "a"}, {"word": "联赛", "pos": "n"}, {"word": "勒沃库森足球俱乐部", "pos": "nt"}], "text": "查尔斯·阿兰基斯(Charles Aránguiz),1989年4月17日出生于智利圣地亚哥,智利职业足球运动员,司职中场,效力于德国足球甲级联赛勒沃库森足球俱乐部", "spo_list": [{"predicate": "出生地", "object_type": "地点", "subject_type": "人物", "object": "圣地亚哥", "subject": "查尔斯·阿兰基斯"}, {"predicate": "出生日期", "object_type": "Date", "subject_type": "人物", "object": "1989年4月17日", "subject": "查尔斯·阿兰基斯"}]}
- 海通大智慧经济因果抽取数据集
{"id": 865, "text": "自2012年二季度开始,整个家禽养殖业就进入下行亏损通道,随后2012年末爆发的“速成鸡事件”与2013年的“H7N9”等不可抗力因素导致了家禽业进入深度亏损状态,在2013年上半年同期,“圣农发展”的净利润为亏损2.33亿元", "relations": [{"id": 2472, "from_id": 3652, "to_id": 3654, "type": "Influence"}, {"id": 2473, "from_id": 3653, "to_id": 3654, "type": "Influence"}], "entities": [{"id": 3652, "start_offset": 40, "end_offset": 47, "label": "event"}, {"id": 3653, "start_offset": 54, "end_offset": 60, "label": "event"}, {"id": 3654, "start_offset": 70, "end_offset": 81, "label": "event"}]}
- 企业年报数据集 被抽取的数据来源
年报数据原始格式为 pdf,通过年报 pdf 数据处理流程转换为 txt 格式文本数据
前置安装 pip install ray pdfmner3k
import importlib import os # encoding: utf-8 import sys # pip uninstall pdfminer.six importlib.reload(sys) from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LTTextBoxHorizontal, LAParams import ray ray.init(num_cpus=os.cpu_count()*2) @ray.remote def parse(path, out_path): if os.path.exists(out_path + ".txt"): return "ok" fp = open(path, 'rb') # 用文件对象来创建一个pdf文档分析器PDFParser praser = PDFParser(fp) # 创建一个PDF文档PDFDocument doc = PDFDocument() # 连接分析器 与文档对象 praser.set_document(doc) doc.set_parser(praser) # 提供初始化密码,如果没有密码 就创建一个空的字符串 doc.initialize() # 检测文档是否提供txt转换,不提供就忽略 if not doc.is_extractable: # raise PDFTextExtractionNotAllowed return "ok" else: # 创建PDf 资源管理器 来管理共享资源PDFResourceManager rsrcmgr = PDFResourceManager() # 创建一个PDF设备对象LAParams laparams = LAParams() # 创建聚合器,用于读取文档的对象PDFPageAggregator device = PDFPageAggregator(rsrcmgr, laparams=laparams) # 创建一个PDF解释器对象,对文档编码,解释成Python能够识别的格式:PDFPageInterpreter interpreter = PDFPageInterpreter(rsrcmgr, device) try: # 循环遍历列表,每次处理一个page的内容 for page in doc.get_pages(): # doc.get_pages() 获取page列表 # 利用解释器的process_page()方法解析读取单独页数 try: interpreter.process_page(page) except: continue # 这里layout是一个LTPage对象,里面存放着这个page解析出的各种对象,一般包括LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal等等,想要获取文本就获得对象的text属性, # 使用聚合器get_result()方法获取页面内容 layout = device.get_result() for x in layout: if (isinstance(x, LTTextBoxHorizontal)): # 需要写出编码格式 with open(out_path + ".txt", 'a', encoding='utf-8') as f: results = x.get_text() f.write(results + '\n') return "ok" except: return "ok" if __name__ == '__main__': base_path = "../../上市公司年报" out_path = "../../公司年报txt" first_path = os.listdir(os.path.join(base_path)) futures = [parse.remote(os.path.join(base_path, first_path), os.path.join(out_path, first_path)) for first_path in first_path] print(ray.get(futures)) # [0, 1, 4, 9]
年报数据处理流程
编辑切换为居中
添加图片注释,不超过 140 字(可选)
其中固定长度文本提取部分代码采用以下代码实现。
策略 年报数据按照句号进行分割。考虑长度在 10-128 范围内长度的文本。去除包含页眉页脚内容。
import os base_path = "./2021年报_text_dir" base_path_list = os.listdir(base_path) word_list = [] for base_path_one in base_path_list: try: base_path_two = os.listdir(os.path.join(base_path,base_path_one)) base_path_data = open(os.path.join(base_path,base_path_one,base_path_two[0]),"r").read().replace("\n", " ") except: continue words = "" for i in base_path_data.split("。"): if "年年度报告" in i: continue if len(i) < 10: continue if len(words + i) < 128: words+=i.replace(" ","")+"。" if len(words+i) > 128: if len(words): word_list.append(words) words = ""
其中年年度报告 是目前发现的页眉页脚文本具有的特征。
关系抽取数据集读取代码
- duie 百度构建娱乐关系抽取数据集 在基于 bert4keras 的 gplinker 关系抽取框架下数据读取部分代码实现。
def normalize(text): """简单的文本格式化函数 """ return ' '.join(text.split()) def load_data(filename): """加载数据 单条格式:{'text': text, 'spo_list': [(s, p, o)]} """ D = [] with open(filename, encoding='utf-8') as f: for l in f: l = json.loads(l) D.append({ 'text': normalize(l['text']), 'spoes': [( normalize(spo['subject']), spo['predicate'], normalize(spo['object']) ) for spo in l['spo_list']] }) return D # 加载数据集 train_data = load_data('../小说人物关系抽取/train_data.json') valid_data = load_data('../小说人物关系抽取/dev_data.json') predicate2id, id2predicate = {}, {} with open('../小说人物关系抽取/all_50_schemas') as f: for l in f: l = json.loads(l) if l['predicate'] not in predicate2id: id2predicate[len(predicate2id)] = l['predicate'] predicate2id[l['predicate']] = len(predicate2id)
海通大智慧因果抽取数据读取代码
def load_data(filename): """加载数据 单条格式:{'text': text, 'spo_list': [(s, p, o)]} """ D = [] id2predicate = {} predicate2id = {} with open(filename, encoding='utf-8') as f: for l in f: l = json.loads(l) entities_mapping = {} for i in l["entities"]: entities_mapping[i["id"]]=l['text'][i["start_offset"]:i["end_offset"]] D.append({ 'text': l['text'], 'spo_list': [(entities_mapping[spo['from_id']], spo['type'], entities_mapping[spo['to_id']]) for spo in l['relations']] }) for spo in l["relations"]: if spo['type'] not in predicate2id: id2predicate[len(predicate2id)] = spo['type'] predicate2id[spo['type']] = len(predicate2id) return D,id2predicate,predicate2id # 加载数据集 all_data,id2predicate,predicate2id = load_data('untitled.txt') train_data = all_data[:int(len(all_data)*0.8)] valid_data = all_data[int(len(all_data)*0.8):]