针对word.docx文档的关键词索引器

2023-02-16 128

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 针对word.docx文档的关键词索引器

前言

之前曾想做这样一个工具，用来遍历目录下的word.docx文档查询关键词，主要是有几个目的：

1.在项目开始阶段，通过关键词检索feature，我脑子实在是记不住文档都在哪；

2.收尾阶段，检查是否有TODO/TBD项遗留在文档中未补全；

3.检查是否有关键信息未修改，尤其对外交付文档中是有留存旧的项目信息等；

4.进一步拓展，支持通配符模糊索引和全局替换，类似linux里grep和sed的功能；

后来一忙起来就把这件事放下了，趁着现在有三天的假期，决定搞一波事~

效果展示

我策划的软件长这个样子：

但是我实在是懒得摆位置了，最后就长成了这个样子：

反正吧，外貌不重要，有趣的灵魂才重要！

使用时候呢，先选一下目录：

然后再输入下索引的关键词，点击开始检查：

程序开始以选择目录为基准，向下递归检查多有的docx文档内容：

并在选择目录下生成一个.report文件：

点击查看结果就可以了：

目前我就实现到这一步，看起来还能用，以后有时间在继续优化，递归算法找文件的效率实在是太低，如果能看看everything是如何实现的就好了。

中场广告时间

对于这种小的程序，我自己非常喜欢Geany这款编译器，上手简单的可怕，调试执行功能也很舒服，界面对小白尤其很是善意，开源免费随便用，一键安装：

好了下面继续正文；

前期准备

这次用到的库有：

1. import tkinter
2. import os
3. import docx
4. import re
5. from zipfile import ZipFile
6. from bs4    import BeautifulSoup
7. from tkinter import *

本来用的docx库，结果发现有些文件打不开，于是转头使用zipfile库了，毕竟（小科普时间）每一个docx文档的本质都是一个zip压缩文件，比如我这里有一个docx文档：

使用解压缩文件打开时，就可以看到里面的结构：

继续往里面点，就可以看到内容的源码啦：

因此用python解析docx文档时，用zipfile库就好了。

关键处理

根据根目录遍历之下所有docx文件，这个我直接从网上找的改了两行代码：

1. def get_process_files(root_dir):
2.  """process all files in directory"""
3.  cur_dir=os.path.abspath(root_dir)
4.  file_list=os.listdir(cur_dir)
5.  process_list=[]
6.  dir_extra_list = []
7. 
8.  for file in file_list:
9.    fullfile=cur_dir+"\\"+file
10.     #print(fullfile)
11.     if os.path.isfile(fullfile) and fullfile.endswith(".docx"):
12.       process_list.append(fullfile)
13.       #print("add " + fullfile)
14.     elif os.path.isdir(fullfile):
15.       dir_extra_list.extend(get_process_files(fullfile))
16. 
17.   if len(dir_extra_list)!=0:
18.     for x in dir_extra_list:
19.       process_list.append(x)

索引docx文件内是否包含关键词：

1. def search_keyword(filename, key_word):
2.  try:
3.    document = ZipFile(filename)
4.    xml    = document.read("word/document.xml")
5.    wordObj  = BeautifulSoup(xml.decode("utf-8"), features="lxml")
6.    texts    = wordObj.findAll("w:t")
7.  except:
8.    ignore_list.append(filename + "未能打开")
9.    #print("请手动检查" + filename)
10.     pass
11. 
12.   for text in texts:
13.     #print(text.text)
14.     try:
15.       if re.search(key_word, text.text):
16.         #print(filename + "找到关键字")
17.         return True
18.     except:
19.       pass
20.   return False

剩下的所有操作都在tk_mian里面了：

1. def tk_main():
2.  root = Tk()
3.  root.geometry("600x600")
4.  root.title("用起来一般的docx关键词查询器")

对应整个软件的外框；

1.  f1 = Frame(root, height = 100, width = 400)
2.  f1.pack()
3.  button1 = Button(f1, text='选择目录', command=get_path)
4.  button1.pack(side = LEFT)
5.  text1   = Text(f1, height = 1, undo=True, autoseparators=False)
6.  text1.pack(side = RIGHT)

对应这里：

1.     f2 = Frame(root, height = 100, width = 400)
2.  f2.pack()
3.  label1 = Label(f2, text="查找信息")
4.  label1.pack(side=LEFT)
5.  text2   = Text(f2, height = 1, undo=True, autoseparators=False)
6.  text2.pack(side = RIGHT)
7. 
8.  f3 = Frame(root, height = 100, width = 400)

对应这里：

1.  f3 = Frame(root, height = 100, width = 400)
2.  f3.pack()
3.  button2 = Button(f3, text='开始检查', command=start_check)
4.  button2.pack(side=LEFT)
5.  button3 = Button(f3, text='退出程序', command=root.quit)
6.  button3.pack(side=RIGHT)
7. 
8.  f4 = Frame(root, height = 100, width = 400)
9.  f4.pack()
10.   text3 = Text(f4, height = 100, undo=True, autoseparators=False)
11.   text3.pack(side = RIGHT)

对应这里和这里：

“选择目录”按钮关联的函数是：

1.  def get_path():
2.    from tkinter import filedialog
3.    tk_file_path = filedialog.askdirectory() #获得选择好的文件夹
4.    text1.insert(INSERT, tk_file_path)

“开始检查”按钮关联的函数是下面，就是把“选择目录”后面文本框内的地址还有“查找信息”后面文本框的内容给提取出来，送给search_key_file函数：

1.  def start_check():
2.    #print(text1.get(1.0, "end"))
3.    fullpath = text1.get(1.0, "end").strip()
4.    keyword  = text2.get(1.0, "end").strip()
5.    full_docx = get_process_files(fullpath)
6.    key_docx  = search_key_file(full_docx, keyword)
7. 
8.  def search_key_file(list, keyword):
9.    global rpt_path
10.     key_file = []
11.     fullpath = text1.get(1.0, "end").strip()
12.     keyword  = text2.get(1.0, "end").strip()
13.     rpt_path = os.path.join(fullpath, "关键词检查_" + keyword + ".report")
14. 
15.     for file in list:
16.       str = "正在检查文件： " + file
17.       text3.mark_set('here',1.0)
18.       text3.insert('here', str + "\n")      
19.       if search_keyword(file, keyword) is True:
20.         key_file.append(file)
21.         #print("找到一个")
22. 
23.     with open(rpt_path, 'w') as file:
24.       file.write("对以下文件中找到关键字：\n")
25.       for line in key_file:
26.         file.write(line + "\n")
27. 
28.     with open(rpt_path, 'a') as file:
29.       file.write("\n")
30.       file.write("以下文件未能打开，请手动检索：\n")
31.       for line in ignore_list:
32.         file.write(line + "\n")             
33. 
34.     with open(rpt_path, 'a') as file:
35.       file.write("\n")
36.       file.write("共尝试对以下文件进行了检查：\n")
37.       for line in list:
38.         file.write(line + "\n")
39. 
40.     text3.mark_set('here',1.0)    
41.     text3.insert('here', "检索完成，请打开REOPRT文件查看结果：" + rpt_path + "\n")
42.     text3.insert('here', "检索完成，请打开REOPRT文件查看结果：" + rpt_path + "\n")
43.     text3.insert('here', "检索完成，请打开REOPRT文件查看结果：" + rpt_path + "\n")
44.     return key_file