单位的刷脸考勤机后台系统做得比较差,只能导出每个部门的出勤统计表pdf,格式如下:
近期领导要看所有部门的考勤数据,于是动手快速写了个合并pdf并输出csv文件的脚本。
安装模块
pypdf2,pdfplumber,前者用于合并,后者用于读表格。
C:\>pip install pypdf2
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pypdf2
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/5e/c86a5643653825d3c913719e788e41386bee415c2b87b4f955432f2de6b2/pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pypdf2
Successfully installed pypdf2-3.0.1
C:\>pip install pdfplumber
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pdfplumber
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f8/d3/f58c2d5d86a585e438c6708f568eca79e7c4e6ee3d5210cf8b31d38cb021/pdfplumber-0.10.3-py3-none-any.whl (48 kB)
Requirement already satisfied: pdfminer.six==20221105 in d:\program files\python\lib\site-packages (from pdfplumber) (20221105)
Requirement already satisfied: Pillow>=9.1 in d:\program files\python\lib\site-packages (from pdfplumber) (10.2.0)
Requirement already satisfied: pypdfium2>=4.18.0 in d:\program files\python\lib\site-packages (from pdfplumber) (4.25.0)
Requirement already satisfied: charset-normalizer>=2.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (41.0.7)
Requirement already satisfied: cffi>=1.12 in d:\program files\python\lib\site-packages (from cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (1.16.0)
Requirement already satisfied: pycparser in d:\program files\python\lib\site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (2.21)
Installing collected packages: pdfplumber
Successfully installed pdfplumber-0.10.3
读取、合并文件
PyPDF2
PyPDF2 用于对PDF文件的分离、合并、裁剪、转换、加密、解密等操作。
读取和合并pdf文件正好以前写过,主要代码如下:
with codecs.open(file_path, 'rb', encoding='utf-16') as file: pdf_reader = PyPDF2.PdfReader(file) text = '' for page_num in range(len(pdf_reader.pages)): tt = pdf_reader.pages[page_num].extract_text() print(tt) text += tt ...... pdfMerge = PyPDF2.PdfMerger() try: for pdf in pdfLists: pdfMerge.append(pdf, import_outline=False) pdfMerge.write(pdfFileN) pdfMerge.close print("PDF files merged successfully!") ......
表格读取
pdfplumber
pdfplumber 用于读取PDF文件文本和表格提取,功能比较均衡。
优点:
- 每页单独对象,支持文本、表格数据的抽取(亮点)
- 文本抽取:保留了文本的格式,比如换行位置有空格,可以通过这个特点将一段的文本整合
- 表格数据抽取:不会被换行数据所干扰
缺点:
- 进行文本抽取时,如果一页有文本和表格,那么抽取的文本数据也会包括表格数据
- 对于有合并单元格的表格,无法还原表格结构
- 表格数据不能100%保证和原数据一致,可能缺少几个字,可能识别出错等
- 对于无边框的表格,处理效果很差
- 流程图和时序图会对处理产生严重影响
读取代码如下:
pdf = pdfplumber.open(pdfFileN) for page in pdf.pages: tables = page.extract_tables(table_settings = {}) for table in tables: print(table)
遍历得到的是一个个二维列表,可以根据需要自己清洗数据。
程序界面
easygui
就用这个库,弄2个对话框简单了事:
选择一个文件夹,把其下的所有pdf文件合并,然后转换输出csv文件:
输出的文件格式如下:
更多easygui内容请见:
以上几样库拼凑在一起,就可以完成合并和转换pdf表格,完整代码如下:
import sys,os import datetime as dt import PyPDF2,pdfplumber import easygui as eg def get_pdf_text(file_path): with codecs.open(file_path, 'rb', encoding='utf-16') as file: pdf_reader = PyPDF2.PdfReader(file) text = '' for page_num in range(len(pdf_reader.pages)): tt = pdf_reader.pages[page_num].extract_text() print(tt) text += tt return text def strDateTime(diff=0): now = dt.datetime.now() future_time = now + dt.timedelta(days=diff) return f'{future_time.year:04}{future_time.month:02}{future_time.day:02}_{future_time.hour:02}{future_time.minute:02}{future_time.second:02}' txtStart = "PDFmerged_" try: Dir = eg.diropenbox(msg=None, title=None, default='./') pdfLists = [f for f in os.listdir(Dir) if f.endswith('.pdf') and not f.startswith(txtStart)] pdfFileN = Dir + '\\' + txtStart + strDateTime() + ".pdf" except: print('取消退出!') sys.exit(0) if len(pdfLists)==0: eg.msgbox("此文件夹没有Pdf文件!", title="结束", ok_button="Fail") sys.exit(0) else: pdfMerge = PyPDF2.PdfMerger() try: for pdf in pdfLists: pdfMerge.append(pdf, import_outline=False) pdfMerge.write(pdfFileN) pdfMerge.close print("PDF files merged successfully!") except: eg.msgbox("合并pdf失败!", title="结束", ok_button="Fail") sys.exit(0) pdf = pdfplumber.open(pdfFileN) dct = dict() for page in pdf.pages: tables = page.extract_tables(table_settings = {}) for table in tables: for lst in table: tmp = lst[1:] tmp = [tmp[0]]+tmp[3:8]+[tmp[-1]] try: tmp[0] = tmp[0].replace('\n','') tmp[0] = tmp[0].split('/') tmp[0] = tmp[0][-1] except: pass if lst[0]=='时间': dct[lst[0]] = tmp[0] else: dct[','.join([lst[0],tmp[0] if tmp[0] else ''])] = ','.join(tmp[1:]) if all(tmp[1:]) else '' pdf.close() try:os.remove(pdfFileN) except:pass try: fn = "考勤表(" + dct['时间'] + ")"+strDateTime()+".csv" except: fn = "考勤表"+strDateTime()+".csv" try: with open(fn, 'w') as f: for k,v in dct.items(): print(','.join([k,v]), file=f) eg.msgbox(f"考勤表保存成功!\n\n\n\t文件名:{fn}", title="结束", ok_button="Good!") print(f"CSV file written successfully! by HannYang {strDateTime()}") except: eg.msgbox("保存csv文件失败!", title="结束", ok_button="Fail")
后话
如果要直接输出Excel表格,则需要另安装和导入xlwt模块。实现的代码大致如下:
myxl = xlwt.Workbook() style = xlwt.easyxf('align: wrap yes; align: horiz center; font: bold yes;borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;') sheet = myxl.add_sheet('考勤表') wcol = [20,40,50,75,40,75] for i in range(6): sheet.col(i).width = wcol[i]*80 sheet.write_merge(0,0,0,8,'出勤统计报表',style) style = xlwt.easyxf('borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;') sheet.write_merge(1,1,0,1,'单位(盖章):',style) sheet.write_merge(2,2,0,1,'*经办人:',style) sheet.write(1,3,'填表日期:',style) sheet.write_merge(1,1,4,8,strToday(),style) sheet.write(2,3,'*联系电话:',style) sheet.write(2,2,adminName,style) sheet.write_merge(2,2,4,8,adminMobil,style) for i,t in enumerate(head.strip().split(',')): sheet.write(3,i,t,style) with open('考勤表.csv', 'r') as f: for i,row in enumerate(csv.reader(f)): if i==0:continue for j,col in enumerate(row): sheet.write(3+i,j,col,style) excelfile = 'Output_'+strDateTime()+'('+defaultValue+').xls' myxl.save(excelfile)
另外不赶时间的话,还可以用PySimpleGUI库写个带漂亮gui界面的程序,具体操作方法请参见: