-------------------我是正文分割线---------------------
分析流程
- 将简历内容调用简历实体识别模型识别实体内容,调用方法参考官方给出的代码范例。
- 将分析结果存储到hive并进行数据分析。
- 对接FineBI进行数据展示。
分析结果
我选了三个实体类型:专业、学历、职称 (Emm, 其实很想选学校,但是这个模型不区分学校和企业)
数据量总共1508条,识别出有专业的有20条,有学历的数据有108条,有职称的数据有695条。(Emm, 为啥有人不写专业呢)
ODS(hive)=>DWS(hive)=>APP(mysql)
话不多说,上图:
学历大部分集中在大专以上,本科居多,可能是数据都是在职员工的简历吧,如果是现在的校招简历,一沓一沓的硕士。
职称看起来都是很高级的职位,可能是数据来源是公开简历,我等小透明也不会去公开简历。
专业集中在经管类,对着职称一票的经理董事,想问下我等码农专业还有机会吗?
最后,说下总体的使用感受吧:
- 识别准确率还是蛮高的,对行业、学历、职称的识别度较高,几乎没有识别错的,就是跑的有点慢 (小pc瑟瑟发抖)
- 单是一个抽取模型,不能将同义词进行归一,如识别出来大学本科、本科、本科学历,对BI还是有点不够用。
- 实体类型有点少,ORG类型有点粗,不能区分学校和企业。这个好像是原始训练数据就是这样?
附件
- 模型调用
frommodelscope.pipelinesimportpipelinefrommodelscope.utils.constantimportTasksimportjsonner_pipeline=pipeline(Tasks.named_entity_recognition, 'damo/nlp_raner_named-entity-recognition_chinese-base-resume') result_file=open("./result.txt", "w", encoding="utf-8") withopen("./test.txt", "r", encoding="utf-8") asf: forlineinf.readlines(): result=ner_pipeline(line) result_file.write(json.dumps(result) +"\n") result_file.close()
- ner结果
- 生成ODS并导入到hive
ods_f=open("ods.csv", "w", encoding="utf-8") withopen("./result.txt", 'r', encoding="utf-8") asf: forlineinf.readlines(): output=eval(line).get("output") print(output) fortype_listinoutput: dict_one= {} dict_one[type_list.get("type")] =type_list.get("span") name=dict_one.get("NAME", '-1') occupation=dict_one.get("PRO", "-1") education=dict_one.get("EDU", '-1') title=dict_one.get("TITLE", '-1') s1=name+"\t"+occupation+"\t"+education+"\t"+title+"\n"ods_f.write(s1) ods_f.close()
--建库建表
create database jianli default character set utf8mb4 collate UTF8MB4_UNICODE_CI;CREATETABLE jianli_ods ( name VARCHAR(30), education VARCHAR(30), occupation VARCHAR(30), title VARCHAR(30));load data local inpath '/root/ods.csv'intotable jianli_ods partition(create_day='2022-08-16');
- 生成DWS(hive中操作)
-- 建库建表
USE jianli;CREATETABLE jianli_app ( group_type VARCHAR(30), occupation_name VARCHAR(30), occupation_count INT, education_name VARCHAR(30), education_count INT, title_name VARCHAR(30), title_count INT) row format delimited fields terminated by'\t' stored as textfile;INSERTINTO jianli.jianli_app(group_type, occupation_name, occupation_count, education_name, education_count, title_name,title_count)SELECT'1'as group_type,occupation as occupation_name,count(name)as occupation_count,'-1'as education_name,0as education_count,'-1'as title_name,0as title_count from jianli.jianli_odsgroupby occupation;INSERTINTO jianli.jianli_app(group_type, occupation_name, occupation_count, education_name, education_count, title_name,title_count)SELECT'2'as group_type,'-1'as occupation_name,0as occupation_count,education as education_name,count(name)as education_count,'-1'as title_name,0as title_count from jianli.jianli_odsgroupby education;INSERTINTO jianli.jianli_app(group_type, occupation_name, occupation_count, education_name, education_count, title_name,title_count)SELECT'3'as group_type,'-1'as occupation_name,0as occupation_count,'-1'as education_name,0as education_count,title as title_name,count(name)as title_count from jianli.jianli_odsgroupby title;sqoop export \ --connect jdbc:mysql://xx.xx.xx.xx:3306/jianli \--username root --password xxxx \--table jianli_app \ --hcatalog-database jianli \ --hcatalog-table jianli_app \ -m 1
- mysql对接FineBI