最简单的方法

如果您想使用开源的NLP包来解决手机号识别问题，可以考虑使用jieba或者thulac等库。这些库都提供了中文分词和关键词提取等功能，可以方便地对文本进行处理。

对于Maxcomputer安装依赖的问题，您可以尝试最简单的方法，即在git上找到一个开源的NLP包，并使用pip安装到Maxcomputer中。接下来，您可以使用Python脚本或Maxcomputer Studio中的pyudf功能来实现手机号识别。

我最初使用的是JioNLP,但不确定它与其他项目相比有何优劣。我只是搜索了一下，发现它的stars很高，因此就使用了它。

JioNLP包含许多小模块，其中有两个模块专门用于手机号识别。第一个模块名为phone_location,如果无法识别数据，则会返回None;否则，它将返回一个字典。

需要留意的是手机号和电话号是有区别的，电话号是不会返回operator的，实际使用的时候要留意，第二种则是extract_phone_number，传出的会是一个dict组成的数组，它会提取出文本中正确的手机号，如果你的一个文本中包含多个手机号，就需要留意是否考虑变换表结构。

简单的办法好处就是封装的好，交给任何一个稍微会点python的，for循环也能够实现一个还算可以的udf，不过要安装一个开源的包，你需要先介绍这个包的作用，来自哪里，然后写个文档，一套流程下来，如果包冲突，最后再不了了之，还不如直接扒代码，毕竟复制粘贴可是本能，这里就不展示使用源码编写的过程了。

jio.extract_phone_number

extract_phone_number是Extractor类中的一个方法，这个类是一个规则抽取器，其中定义了各种不同的需求抽取方法。针对手机号的抽取是常规的正则表达式抽取，使用了一个手机号的正则表达式、一个电话号码的正则表达式和一个基础函数。

首先，我们需要了解extract_base函数的使用方法。它接受三个参数：正则表达式对象、要抽取的文本和是否返回偏移量的开启选项。如果不指定开启选项，默认情况下不会返回偏移量，只会返回抽取出来的文本。

在数据进来后，extract_base函数会使用正则表达式的finditer方法返回一个所有匹配上的子串的迭代器对象。然后，我们可以使用列表推导式和条件表达式来处理每一个子串，并使用group方法获得第一个捕获组的值。偏移量则是通过span方法获得。

了解了基类的使用方法，我们可以轻松地实现extract_phone_number函数。首先，我们需要创建两个正则表达式对象，分别用于手机号和电话号码的匹配。接下来，在需要判定的文本前后都加上#号，这一段我并不是很清楚，只是怀疑和正在中采用的负向前瞻断言和正向前瞻断言有关，可能是为了优化空格等特殊字符。

然后是上面代码对应的两段正则：

如果没有接触过正则表达式，可以使用市面上的生成式AI来理解。基础正则表达式是一种模式匹配规则，用于在文本中查找特定的模式。捕获组是指正则表达式中用括号括起来的子表达式，它们可以被提取出来作为结果的一部分。前瞻断言和负前瞻断言是两种特殊的正则表达式，用于指定匹配的顺序和位置。

在使用过程中，需要注意考虑到一行文本可能包含多个联系方式，因此最终返回的结果是一个数组。

jio.phone_location

相比于extract_phone_number,phone_location的处理过程显然更为复杂。我们可以从数据进入的方式来解释数据的处理过程，而不是从代码的上下文结构来描述。

当数据进来时，在phone_location类的init方法中，会创建一个名为cell_phone_location_trie的cell_phone_location_trie对象，并将其初始化为none。第一次调用该方法时，会首先检查其值是否为none。如果为none,则开始加载词典。这个词典是一个名为phone_location的txt文件，其中存储了很多国内联系方式的开头以及后面的四位数字。

在phone_location文件中，联系方式的开头和后面的四位数字之间用换行符分隔。不同的中间位数则使用逗号分隔。实际上，-符号表示一个区间范围。代码会根据-将其转化为一个数组。为了避免出现0011在循环中被识别为11,我们使用了格式化字符串将其转化为四位数字。如果不足四位，则在前面补0。最后，将手机号前缀和所有手机号整理成一个新的集合并返回。

当然，我们还使用了startwith方法来判断是否需要识别该行。例如，开头的山东这一行会被忽略。开头第一行存储的是直观的省、市、区号和电话号码段。接下来，数据会根据这个被解析后的手机号生成三个词典类型。第一个phone_location_dict中的数据格式大概是这样的：{1340054****:“山东济南”},而zip_code_location_dict则是{0531:"山东济南"},area_code_location_dict的是{250000:"山东济南"}。

当词典加载完成后，cell_phone_location的数据会被循环写入cell_phone_location_trie中。首先，我们将cell_phone_location_trie转化为一个树对象TrieTree,使用树的add方法将数据分为手机号和对应的地区loc传入。在添加数据时，我们会先清空两侧空白字符，然后判断其不是由特殊的空白字符组成。

接下来，我们需要使用init中定义的一个dict类型，以及对输入的手机号求长度并将其字母全部转为小写。然后，我们将手机号拆分成一个个字符，并进行一些判断。例如，当我们第一次输入"123"时，tree-init中的dict会是{1:{2:{3:{}}}}。而当我们再次输入"124"时，数据会变成{1:{2:{3:{},4:{}}}}。有些人可能会因为else中最下面的一个tree而感到困惑。实际上，这个tree是一个嵌套的新字典。如果自己运行代码并观察一下，可以发现它与上面的tree[char]不是同一个东西。

地区的loc会被作为cell_phone_location_trie['type']写入，之后会是和extract_phone_number一样的定义三个正则匹配对象，具体正则如下：

加载完电话号词典后，我们还会加载一个运营商词典。这个相对简单，只是为了通过手机号的前三位判断对应的通信运营商。加载方式依旧是采用树结构来add。

首先，我们使用正则表达式来判断手机号是否有符合的子串。如果有，就截取前七位。然后，我们使用相同的逻辑，用字符来依次匹配字段，返回手机号代表的省市。由于省市之间在txt中使用的空格拆分，所以这里使用空格作为split的拆分键。

对于电话号的处理逻辑思路大致相同，只不过在进行电话号的匹配规则上还有一个针对区号的规则。如果没有对应的区号，则会返回none。如果对电话号的限制没有这么高，可以调整此处规则，比如剔除对区号的判定。同时，电话号的区域判定是通过area_code_location_dict来判断，不需要使用树结构。此外，源代码中有一个单独的电话号方法和手机号匹配方法，实际不使用的时候可以剔除，减少代码的冗余。

实际使用

实际使用时，我采用了两种方式。一种是UDF,另一种是维表。UDF是为了单独通过类型确定联系方式是否是可使用的联系方式。维表则是为了后续考虑而编写，前者没有解析手机号，后者则解析出正确的手机号。

由于数据的问题，可能会有部分识别不精准的情况。源代码提供的txt词典是三年前的，实际上有十万分之一，甚至更低的概览会出现手机号不在词典的情况。不过我有一个思路，对于这些少一点的手机号，可以使用request直接百度，然后解析html看其是否是手机号。但实际开发时间有限，而且考虑到maxcomputer的白名单问题，我没有采用这种方法。

同时，识别不精确还包括可能正确的数据。比如数据为“dsfjasd13934720013fasdf”这样的数据，里面包含的手机号确实正确，但说其是手机号吗？这一点让人很头疼。

实际代码

fromodpsimportODPSfromodps.dfimportDataFrameimporthashlibfromdatetimeimportdatetime, timedeltaimportreimportjsonimportosimportsysimporttime##@resource_reference{"phone_location.txt"}##@resource_reference{"telecom_operator.txt"}sys.path.append(os.path.dirname('phone_location.txt'))
sys.path.append(os.path.dirname('telecom_operator.txt'))
# 插入表insert_table='wwwx_cdm_dev.dim_wwwx_ppd_phone_detail_di'no_clean_phone='old_phone'# 查询表名:列名table_name_col_name= {"wwwx_cdm.dwd_wwwx_ppd_order_master_df":"user_phone","wwwx_cdm.dim_wwwx_ppd_cw_user_df":"phone"}
# 手机号码CELL_PHONE_PATTERN=r'(?<=[^\d])(((\+86)?([- ])?)?((1[3-9][0-9]))([- ])?\d{4}([- ])?\d{4})(?=[^\d])'LANDLINE_PHONE_PATTERN=r'(?<=[^\d])(([\(（])?0\d{2,3}[\)） —-]{1,2}\d{7,8}|\d{3,4}[ -]\d{3,4}[ -]\d{4})(?=[^\d])'# 该规则用于抽取与判定手机号的归属地，即抽取前三位、中间4位CELL_PHONE_CHECK_PATTERN=r'((1[3-9][0-9]))([- ])?\d{4}([- ])?\d{4}'LANDLINE_PHONE_CHECK_PATTERN=r'(([\(（])?0\d{2,3}[\)） —-]{1,2}\d{7,8}|\d{3,4}[ -]\d{3,4}[ -]\d{4})'# 用分隔符，找到靠前的区号LANDLINE_PHONE_AREA_CODE_PATTERN=r'(0\d{2,3})[\)） —-]'# 拿到全部地区的手机号GRAND_DIR_PATH='phone_location.txt'# 拿到全国手机号开头映射TELE_DIR_PATH='telecom_operator.txt'defextract_base(pattern, text, with_offset=False):
""" 正则抽取器的基础函数    Args:        pattern(re.compile): 正则表达式对象        text(str): 字符串文本        with_offset(bool): 是否携带 offset （抽取内容字段在文本中的位置信息）    Returns:        list: 返回结果    """ifwith_offset:
results= [{'text': item.group(1),
'offset': (item.span()[0] -1, item.span()[1] -1)}
foriteminpattern.finditer(text)]
else:
results= [item.group(1) foriteminpattern.finditer(text)]
returnresultsdefextract_phone_number(text, detail=False):
"""从文本中抽取出电话号码    Args:        text(str): 字符串文本        detail(bool): 是否携带 offset （电话号码在文本中的位置信息）    Returns:        list: 电话号码列表    """cell_phone_pattern=re.compile(CELL_PHONE_PATTERN)
landline_phone_pattern=re.compile(LANDLINE_PHONE_PATTERN)
text=''.join(['#', text, '#'])
cell_results=extract_base(
cell_phone_pattern, text, with_offset=detail)
landline_results=extract_base(
landline_phone_pattern, text, with_offset=detail)
ifnotdetail:
returncell_results+landline_resultselse:
detail_results=list()
foritemincell_results:
item.update({'type': 'cell_phone'})
detail_results.append(item)
foriteminlandline_results:
item.update({'type': 'landline_phone'})
detail_results.append(item)
returndetail_resultsdefread_file_by_line(file_path, line_num=None,
skip_empty_line=True, strip=True,
auto_loads_json=True):
""" 读取一个文件的前 N 行，按列表返回，    文件中按行组织，要求 utf-8 格式编码的自然语言文本。    若每行元素为 json 格式可自动加载。    Args:        file_path(str): 文件路径        line_num(int): 读取文件中的行数，若不指定则全部按行读出        skip_empty_line(boolean): 是否跳过空行        strip: 将每一行的内容字符串做 strip() 操作        auto_loads_json(bool): 是否自动将每行使用 json 加载，默认是    Returns:        list: line_num 行的内容列表    Examples:        >>> file_path = '/path/to/stopwords.txt'        >>> print(jio.read_file_by_line(file_path, line_num=3))        # ['在', '然后', '还有']    """content_list=list()
count=0withopen(file_path, 'r', encoding='utf-8') asf:
line=f.readline()
whileTrue:
ifline=='':  # 整行全空，说明到文件底breakifline_numisnotNone:
ifcount>=line_num:
breakifline.strip() =='':
ifskip_empty_line:
count+=1line=f.readline()
else:
try:
ifauto_loads_json:
cur_obj=json.loads(line.strip())
content_list.append(cur_obj)
else:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
except:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
count+=1line=f.readline()
continueelse:
try:
ifauto_loads_json:
cur_obj=json.loads(line.strip())
content_list.append(cur_obj)
else:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
except:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
count+=1line=f.readline()
continuereturncontent_listdefphone_location_loader():
""" 加载电话号码地址与运营商解析词典 """content=read_file_by_line(os.path.join(
GRAND_DIR_PATH),
strip=False, auto_loads_json=False)
defreturn_all_num(line):
""" 返回所有的手机号码中间四位字符串 """front, info=line.strip().split('\t')
num_string_list=info.split(',')
result_list= []
fornum_stringinnum_string_list:
if'-'innum_string:
start_num, end_num=num_string.split('-')
foriinrange(int(start_num), int(end_num) +1):
result_list.append('{:0>4d}'.format(i))
else:
result_list.append(num_string)
result_list= [front+resforresinresult_list]
returnresult_listphone_location_dict= {}
cur_location=''zip_code_location_dict= {}
area_code_location_dict= {}
forlineincontent:
ifline.startswith('\t'):
res=return_all_num(line)
foriinres:
phone_location_dict.update({i: cur_location})
else:
cur_location, area_code, zip_code=line.strip().split('\t')
zip_code_location_dict.update({zip_code: cur_location})
area_code_location_dict.update({area_code: cur_location})
returnphone_location_dict, zip_code_location_dict, area_code_location_dictdeftelecom_operator_loader():
"""     加载通信运营商手机号码的匹配词典    """telecom_operator=read_file_by_line(os.path.join(TELE_DIR_PATH))
telecom_operator_dict=dict()
forlineintelecom_operator:
num, operator=line.strip().split(' ')
telecom_operator_dict.update({num: operator})
returntelecom_operator_dictdefget_one_phone(oldphone, phone):
"维表唯一键"md5_obj=hashlib.md5()
md5_obj.update((oldphone+phone).encode('utf-8'))
returnmd5_obj.hexdigest()
defget_data_sql(table_name,col_name,ds):
"拼接成没有清洗过的字符串"drop_sql="drop table if exists wwwx_cdm_dev.phone_tmp_di;"sql_text="create table wwwx_cdm_dev.phone_tmp_di as select distinct {} as tmp_phone from {} where ds={} and {} not in (select distinct {} from {} where ds<={} and {} is not null);".format(col_name,table_name,ds,col_name,no_clean_phone,insert_table,ds,no_clean_phone)
returndrop_sql,sql_text,"wwwx_cdm_dev.phone_tmp_di"defget_table_phone_n_data(table_name,ds,table_name_col_name):
"获得需要运行的数据"col_name=table_name_col_name[table_name]
drop_sql,create_sql,new_table_name=get_data_sql(table_name,col_name,ds)
try:
# 过滤掉的数据no_clean_master=DataFrame(odps.get_table(new_table_name))
clean_phone=no_clean_master[no_clean_master.tmp_phone.notnull()][['tmp_phone']].distinct()
result=clean_phone.execute()
forphont_iteminresult:
foritem_valinlist(phont_item):
yielditem_valexceptException:
print("数据转化异常")
defregex_phone(old_phone,phone_location_x):
"""    返回解析后的手机号的具体情况:    唯一键，原始手机号，正确手机号，省，市，手机号|电话号格式，服务商，是否是可识别手机号，是否清洗，是否是报错写入    如果无法解析，只有原始手机号有值，且是否可识别手机号为False    """defget_phone_data(old_phone, text_phone, phone_location_x):
"用解析出来的手机号，再解析手机号详情"text_phone=text_phone.replace(" ","")
phone_location=phone_location_x(text_phone)
ifphone_location['type'] =='cell_phone':
return [get_one_phone(old_phone, text_phone), old_phone, text_phone, phone_location['province'], phone_location['city'], phone_location['type'], phone_location['operator'], True, True, False]
elifphone_location['type'] =='unknown':
return [old_phone, old_phone, None, None, None, None, None, False, False, False]
else:
return [get_one_phone(old_phone, text_phone), old_phone, text_phone, phone_location['province'], phone_location['city'], phone_location['type'], None, True, True, False]
defget_phone_one_data(old_phone, phone_location_x):
"会有无法先解析手机号，再解析手机号详情，却可以直接解析手机号详情的情况"phone_location=phone_location_x(old_phone.replace(" ",""))
ifphone_location['type'] =='cell_phone':
return [old_phone, old_phone, old_phone, phone_location['province'], phone_location['city'], phone_location['type'], phone_location['operator'], True, False, False]
elifphone_location['type'] =='unknown':
return [old_phone, old_phone, None, None, None, None, None, False, False, False]
else:
return [old_phone, old_phone, old_phone, phone_location['province'], phone_location['city'], phone_location['type'], None, True, False, False]
phone_message=extract_phone_number(old_phone, detail=True)
iflen(phone_message) ==0:
get_main_phone_message=get_phone_one_data(
old_phone, phone_location_x)
iflen(get_main_phone_message) ==0:
yield [old_phone, old_phone, None, None, None, None, None, False, False, False]
else:
yieldget_main_phone_messageelse:
foriteminphone_message:
yieldget_phone_data(old_phone, item['text'], phone_location_x)
classTrieTree(object):
"""    Trie 树的基本方法，用途包括：    - 词典 NER 的前向最大匹配计算    - 繁简体词汇转换的前向最大匹配计算    """def__init__(self):
self.dict_trie=dict()
self.depth=0defadd_node(self, word, typing):
"""向 Trie 树添加节点。        Args:            word(str): 词典中的词汇            typing(str): 词汇类型        Returns: None        """word=word.strip()
ifwordnotin ['', '\t', ' ', '\r']:
tree=self.dict_triedepth=len(word)
word=word.lower()  # 将所有的字母全部转换成小写forcharinword:
ifcharintree:
tree=tree[char]
else:
tree[char] =dict()
tree=tree[char]
ifdepth>self.depth:
self.depth=depthif'type'intreeandtree['type'] !=typing:
print(
'`{}` belongs to both `{}` and `{}`.'.format(
word, tree['type'], typing))
else:
tree['type'] =typingdefbuild_trie_tree(self, dict_list, typing):
""" 创建 trie 树 """forwordindict_list:
self.add_node(word, typing)
defsearch(self, word):
""" 搜索给定 word 字符串中与词典匹配的 entity，        返回值 None 代表字符串中没有要找的实体，        如果返回字符串，则该字符串就是所要找的词汇的类型        """tree=self.dict_trieres=Nonestep=0# step 计数索引位置forcharinword:
ifcharintree:
tree=tree[char]
step+=1if'type'intree:
res= (step, tree['type'])
else:
breakifres:
returnresreturn1, NoneclassPhoneLocation(object):
""" 对于给定的电话号码，返回其归属地、区号、运营商等信息。    该方法与 jio.extract_phone_number 配合使用。    Args:        text(str): 电话号码文本。若输入为 jio.extract_phone_number 返回的结果，效果更佳。            注意，仅输入电话号码文本，如 "86-17309729105"、"13499013052"、"021 60128421" 等，            而 "81203432" 这样的电话号码则没有对应的归属地。            若输入 "343981217799212723" 这样的文本，会造成误识别，须首先从中识别电话号码，再进行            归属地、区号、运营商的识别    Returns:        dict: 该电话号码的类型，归属地，手机运营商    Examples:        # [{'text': '13288568202', 'offset': (5, 16), 'type': 'cell_phone'},           {'text': '(021)32830431', 'offset': (18, 31), 'type': 'landline_phone'}]        # {'number': '(021)32830431', 'province': '上海', 'city': '上海', 'type': 'landline_phone'}        # {'number': '13288568202', 'province': '广东', 'city': '揭阳',           'type': 'cell_phone', 'operator': '中国联通'}    """def__init__(self):
self.cell_phone_location_trie=Nonedef_prepare(self):
""" 加载词典 """cell_phone_location, zip_code_location, area_code_location=phone_location_loader()
self.zip_code_location=zip_code_locationself.area_code_location=area_code_locationself.cell_phone_location_trie=TrieTree()
fornum, locincell_phone_location.items():
self.cell_phone_location_trie.add_node(num, loc)
self.cell_phone_pattern=re.compile(CELL_PHONE_CHECK_PATTERN)
self.landline_phone_pattern=re.compile(LANDLINE_PHONE_CHECK_PATTERN)
self.landline_area_code_pattern=re.compile(
LANDLINE_PHONE_AREA_CODE_PATTERN)
# 运营商词典telecom_operator=telecom_operator_loader()
self.telecom_operator_trie=TrieTree()
fornum, locintelecom_operator.items():
self.telecom_operator_trie.add_node(num, loc)
def__call__(self, text):
""" 输入一段电话号码文本，返回其结果 """ifself.cell_phone_location_trieisNone:
self._prepare()
res=self.cell_phone_pattern.search(text)
ifresisnotNone:  # 匹配至手机号码cell_phone_number=res.group()
first_seven=cell_phone_number[:7]
_, location=self.cell_phone_location_trie.search(first_seven)
province, city=location.split(' ')
# print(province, city)_, operator=self.telecom_operator_trie.search(
cell_phone_number[:4])
return {'number': text, 'province': province, 'city': city,
'type': 'cell_phone', 'operator': operator}
res=self.landline_phone_pattern.search(text)
ifresisnotNone:  # 匹配至固话号码# 抽取固话号码的区号res=self.landline_area_code_pattern.search(text)
ifresisnotNone:
area_code=res.group(1)
province, city=self.area_code_location.get(
area_code, ' ').split(' ')
ifprovince=='':
province, city=None, Nonereturn {'number': text, 'province': province,
'city': city, 'type': 'landline_phone'}
else:
return {'number': text, 'province': None,
'city': None, 'type': 'landline_phone'}
return {'number': text, 'province': None,
'city': None, 'type': 'unknown'}
if__name__=="__main__":
print("开始执行:{}".format(datetime.now()))
ds=args['ds']
master_table=odps.get_table(insert_table)
# 如果分区存在，删除分区master_table.delete_partition('ds={}'.format(ds),if_exists=True)
foritem_tintable_name_col_name:
num=0drop_sql,create_sql,new_table_name=get_data_sql(item_t,table_name_col_name[item_t],ds)
instance1=odps.run_sql(drop_sql)
whilestr(instance1.status) in ['Status.RUNNING','Status.WAITING']:
passinstance2=odps.run_sql(create_sql)
whilestr(instance2.status) in ['Status.RUNNING','Status.WAITING']:
passdatan=get_table_phone_n_data(item_t,ds,table_name_col_name)
phone_location_x=PhoneLocation()
result_phone= []
fordataindatan:
try:
foritem_phoneinregex_phone(data,phone_location_x):
result_phone.append(item_phone)
exceptAttributeError:
result_phone.append(
                    [data, data, None, None, None, None, None, False, False, True])
iflen(result_phone)>134000:
odps.write_table(insert_table, result_phone,
partition='ds={}'.format(ds), create_partition=True)
result_phone.clear()
num=num+134000print("数据量有点大，目前跑了{}条了,当前时间为{}".format(num,datetime.now()))
odps.write_table(insert_table, result_phone,
partition='ds={}'.format(ds), create_partition=True)
print("表{}结束:{}".format(item_t,datetime.now()))
print("结束执行:{}".format(datetime.now()))

在DataWorks中使用PyODPS时，如果要引用资源，可以使用@resource来实现。之后再调用os就可以正常调用了。由于涉及到多表的一个notin操作，使用DataFrame效率没有runsql来的快。因此，我们直接上了SQL语句。由于它是异步执行的，所以我们需要通过判定运行状态来确定是否成功执行。逻辑主要在regex_phone里面。如果有extract_phone_number解析失败，我们会再次调用phone_location,因为后者的准确率是建立在词典和正则表达式上的，比单纯的extract_phone_number要高。

# coding: utf-8fromodps.udfimportannotatefromodps.distcacheimportget_cache_fileimportreimportjson# 手机号码CELL_PHONE_PATTERN=r'(?<=[^\d])(((\+86)?([- ])?)?((1[3-9][0-9]))([- ])?\d{4}([- ])?\d{4})(?=[^\d])'LANDLINE_PHONE_PATTERN=r'(?<=[^\d])(([\(（])?0\d{2,3}[\)） —-]{1,2}\d{7,8}|\d{3,4}[ -]\d{3,4}[ -]\d{4})(?=[^\d])'# 该规则用于抽取与判定手机号的归属地，即抽取前三位、中间4位CELL_PHONE_CHECK_PATTERN=r'((1[3-9][0-9]))([- ])?\d{4}([- ])?\d{4}'LANDLINE_PHONE_CHECK_PATTERN=r'(([\(（])?0\d{2,3}[\)） —-]{1,2}\d{7,8}|\d{3,4}[ -]\d{3,4}[ -]\d{4})'# 用分隔符，找到靠前的区号LANDLINE_PHONE_AREA_CODE_PATTERN=r'(0\d{2,3})[\)） —-]'defextract_base(pattern, text, with_offset=False):
""" 正则抽取器的基础函数    Args:        pattern(re.compile): 正则表达式对象        text(str): 字符串文本        with_offset(bool): 是否携带 offset （抽取内容字段在文本中的位置信息）    Returns:        list: 返回结果    """ifwith_offset:
results= [{'text': item.group(1),
'offset': (item.span()[0] -1, item.span()[1] -1)}
foriteminpattern.finditer(text)]
else:
results= [item.group(1) foriteminpattern.finditer(text)]
returnresultsdefextract_phone_number(text, detail=False):
"""从文本中抽取出电话号码    Args:        text(str): 字符串文本        detail(bool): 是否携带 offset （电话号码在文本中的位置信息）    Returns:        list: 电话号码列表    """cell_phone_pattern=re.compile(CELL_PHONE_PATTERN)
landline_phone_pattern=re.compile(LANDLINE_PHONE_PATTERN)
text=''.join(['#', text, '#'])
cell_results=extract_base(
cell_phone_pattern, text, with_offset=detail)
landline_results=extract_base(
landline_phone_pattern, text, with_offset=detail)
ifnotdetail:
returncell_results+landline_resultselse:
detail_results=list()
foritemincell_results:
item.update({'type': 'cell_phone'})
detail_results.append(item)
foriteminlandline_results:
item.update({'type': 'landline_phone'})
detail_results.append(item)
returndetail_resultsdefextract_phone_number(text, detail=False):
"""从文本中抽取出电话号码    Args:        text(str): 字符串文本        detail(bool): 是否携带 offset （电话号码在文本中的位置信息）    Returns:        list: 电话号码列表    """cell_phone_pattern=re.compile(CELL_PHONE_PATTERN)
landline_phone_pattern=re.compile(LANDLINE_PHONE_PATTERN)
text=''.join(['#', text, '#'])
cell_results=extract_base(
cell_phone_pattern, text, with_offset=detail)
landline_results=extract_base(
landline_phone_pattern, text, with_offset=detail)
ifnotdetail:
returncell_results+landline_resultselse:
detail_results=list()
foritemincell_results:
item.update({'type': 'cell_phone'})
detail_results.append(item)
foriteminlandline_results:
item.update({'type': 'landline_phone'})
detail_results.append(item)
returndetail_resultsdefread_file_by_line(file_path, line_num=None,
skip_empty_line=True, strip=True,
auto_loads_json=True):
""" 读取一个文件的前 N 行，按列表返回，    文件中按行组织，要求 utf-8 格式编码的自然语言文本。    若每行元素为 json 格式可自动加载。    Args:        file_path(str): 文件路径        line_num(int): 读取文件中的行数，若不指定则全部按行读出        skip_empty_line(boolean): 是否跳过空行        strip: 将每一行的内容字符串做 strip() 操作        auto_loads_json(bool): 是否自动将每行使用 json 加载，默认是    Returns:        list: line_num 行的内容列表        # ['在', '然后', '还有']    """content_list=list()
count=0withfile_pathasf:
line=f.readline()
whileTrue:
ifline=='':  # 整行全空，说明到文件底breakifline_numisnotNone:
ifcount>=line_num:
breakifline.strip() =='':
ifskip_empty_line:
count+=1line=f.readline()
else:
try:
ifauto_loads_json:
cur_obj=json.loads(line.strip())
content_list.append(cur_obj)
else:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
except:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
count+=1line=f.readline()
continueelse:
try:
ifauto_loads_json:
cur_obj=json.loads(line.strip())
content_list.append(cur_obj)
else:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
except:
ifstrip:
content_list.append(line.strip())
else:
content_list.append(line)
count+=1line=f.readline()
continuereturncontent_listdefphone_location_loader(GRAND_DIR_PATH):
""" 加载电话号码地址与运营商解析词典 """content=read_file_by_line(
GRAND_DIR_PATH,
strip=False, auto_loads_json=False)
defreturn_all_num(line):
""" 返回所有的手机号码中间四位字符串 """front, info=line.strip().split('\t')
num_string_list=info.split(',')
result_list= []
fornum_stringinnum_string_list:
if'-'innum_string:
start_num, end_num=num_string.split('-')
foriinrange(int(start_num), int(end_num) +1):
result_list.append('{:0>4d}'.format(i))
else:
result_list.append(num_string)
result_list= [front+resforresinresult_list]
returnresult_listphone_location_dict= {}
cur_location=''zip_code_location_dict= {}
area_code_location_dict= {}
forlineincontent:
ifline.startswith('\t'):
res=return_all_num(line)
foriinres:
phone_location_dict.update({i: cur_location})
else:
cur_location, area_code, zip_code=line.strip().split('\t')
zip_code_location_dict.update({zip_code: cur_location})
area_code_location_dict.update({area_code: cur_location})
returnphone_location_dict, zip_code_location_dict, area_code_location_dictdeftelecom_operator_loader(TELE_DIR_PATH):
"""    加载通信运营商手机号码的匹配词典    """telecom_operator=read_file_by_line(TELE_DIR_PATH)
telecom_operator_dict=dict()
forlineintelecom_operator:
num, operator=line.strip().split(' ')
telecom_operator_dict.update({num: operator})
returntelecom_operator_dictclassTrieTree(object):
"""    Trie 树的基本方法，用途包括：    - 词典 NER 的前向最大匹配计算    - 繁简体词汇转换的前向最大匹配计算    """def__init__(self):
self.dict_trie=dict()
self.depth=0defadd_node(self, word, typing):
"""向 Trie 树添加节点。        Args:            word(str): 词典中的词汇            typing(str): 词汇类型        Returns: None        """word=word.strip()
ifwordnotin ['', '\t', ' ', '\r']:
tree=self.dict_triedepth=len(word)
word=word.lower()  # 将所有的字母全部转换成小写forcharinword:
ifcharintree:
tree=tree[char]
else:
tree[char] =dict()
tree=tree[char]
ifdepth>self.depth:
self.depth=depthif'type'intreeandtree['type'] !=typing:
print(
'`{}` belongs to both `{}` and `{}`.'.format(
word, tree['type'], typing))
else:
tree['type'] =typingdefbuild_trie_tree(self, dict_list, typing):
""" 创建 trie 树 """forwordindict_list:
self.add_node(word, typing)
defsearch(self, word):
""" 搜索给定 word 字符串中与词典匹配的 entity，        返回值 None 代表字符串中没有要找的实体，        如果返回字符串，则该字符串就是所要找的词汇的类型        """tree=self.dict_trieres=Nonestep=0# step 计数索引位置forcharinword:
ifcharintree:
tree=tree[char]
step+=1if'type'intree:
res= (step, tree['type'])
else:
breakifres:
returnresreturn1, NoneclassPhoneLocation(object):
""" 对于给定的电话号码，返回其归属地、区号、运营商等信息。    该方法与 jio.extract_phone_number 配合使用。    Args:        text(str): 电话号码文本。若输入为 jio.extract_phone_number 返回的结果，效果更佳。            注意，仅输入电话号码文本，如 "86-17309729105"、"13499013052"、"021 60128421" 等，            而 "81203432" 这样的电话号码则没有对应的归属地。            若输入 "343981217799212723" 这样的文本，会造成误识别，须首先从中识别电话号码，再进行            归属地、区号、运营商的识别    Returns:        dict: 该电话号码的类型，归属地，手机运营商    Examples:        # [{'text': '13288568202', 'offset': (5, 16), 'type': 'cell_phone'},           {'text': '(021)32830431', 'offset': (18, 31), 'type': 'landline_phone'}]        # {'number': '(021)32830431', 'province': '上海', 'city': '上海', 'type': 'landline_phone'}        # {'number': '13288568202', 'province': '广东', 'city': '揭阳',           'type': 'cell_phone', 'operator': '中国联通'}    """def__init__(self, GRAND_DIR_PATH, TELE_DIR_PATH):
self.cell_phone_location_trie=Noneself.GRAND_DIR_PATH=GRAND_DIR_PATHself.TELE_DIR_PATH=TELE_DIR_PATHdef_prepare(self):
""" 加载词典 """cell_phone_location, zip_code_location, area_code_location=phone_location_loader(self.GRAND_DIR_PATH)
self.zip_code_location=zip_code_locationself.area_code_location=area_code_locationself.cell_phone_location_trie=TrieTree()
fornum, locincell_phone_location.items():
self.cell_phone_location_trie.add_node(num, loc)
self.cell_phone_pattern=re.compile(CELL_PHONE_CHECK_PATTERN)
self.landline_phone_pattern=re.compile(LANDLINE_PHONE_CHECK_PATTERN)
self.landline_area_code_pattern=re.compile(
LANDLINE_PHONE_AREA_CODE_PATTERN)
# 运营商词典telecom_operator=telecom_operator_loader(self.TELE_DIR_PATH)
self.telecom_operator_trie=TrieTree()
fornum, locintelecom_operator.items():
self.telecom_operator_trie.add_node(num, loc)
def__call__(self, text):
""" 输入一段电话号码文本，返回其结果 """ifself.cell_phone_location_trieisNone:
self._prepare()
res=self.cell_phone_pattern.search(text)
ifresisnotNone:  # 匹配至手机号码cell_phone_number=res.group()
first_seven=cell_phone_number[:7]
_, location=self.cell_phone_location_trie.search(first_seven)
province, city=location.split(' ')
# print(province, city)_, operator=self.telecom_operator_trie.search(
cell_phone_number[:4])
return {'number': text, 'province': province, 'city': city,
'type': 'cell_phone', 'operator': operator}
res=self.landline_phone_pattern.search(text)
ifresisnotNone:  # 匹配至固话号码# 抽取固话号码的区号res=self.landline_area_code_pattern.search(text)
ifresisnotNone:
area_code=res.group(1)
province, city=self.area_code_location.get(
area_code, ' ').split(' ')
ifprovince=='':
province, city=None, Nonereturn {'number': text, 'province': province,
'city': city, 'type': 'landline_phone'}
else:
return {'number': text, 'province': None,
'city': None, 'type': 'landline_phone'}
return {'number': text, 'province': None,
'city': None, 'type': 'unknown'}
@annotate("string->string")
classPhoneClean(object):
def__init__(self):
# 引用资源self.GRAND_DIR_PATH=get_cache_file('phone_location.txt')
self.TELE_DIR_PATH=get_cache_file('telecom_operator.txt')
self.phone_location_x=PhoneLocation(self.GRAND_DIR_PATH, self.TELE_DIR_PATH)
defevaluate(self, arg0):
ifarg0isNoneorlen(str(arg0)) ==0:
returnNoneelse:
try:
regex_phone=self.phone_location_xnew_text=str(arg0).replace(" ", '')
returnregex_phone(new_text)['type']
exceptException:
returnNone