Python 快速合并PDF表格转换输出CSV文件-阿里云开发者社区

Python 快速合并PDF表格转换输出CSV文件

2024-01-30 152

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python 快速合并PDF表格转换输出CSV文件

单位的刷脸考勤机后台系统做得比较差，只能导出每个部门的出勤统计表pdf，格式如下：

近期领导要看所有部门的考勤数据，于是动手快速写了个合并pdf并输出csv文件的脚本。

安装模块

pypdf2，pdfplumber，前者用于合并，后者用于读表格。

C:\>pip install pypdf2

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Collecting pypdf2

Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/5e/c86a5643653825d3c913719e788e41386bee415c2b87b4f955432f2de6b2/pypdf2-3.0.1-py3-none-any.whl (232 kB)

Installing collected packages: pypdf2

Successfully installed pypdf2-3.0.1

C:\>pip install pdfplumber

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Collecting pdfplumber

Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f8/d3/f58c2d5d86a585e438c6708f568eca79e7c4e6ee3d5210cf8b31d38cb021/pdfplumber-0.10.3-py3-none-any.whl (48 kB)

Requirement already satisfied: pdfminer.six==20221105 in d:\program files\python\lib\site-packages (from pdfplumber) (20221105)

Requirement already satisfied: Pillow>=9.1 in d:\program files\python\lib\site-packages (from pdfplumber) (10.2.0)

Requirement already satisfied: pypdfium2>=4.18.0 in d:\program files\python\lib\site-packages (from pdfplumber) (4.25.0)

Requirement already satisfied: charset-normalizer>=2.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (3.3.2)

Requirement already satisfied: cryptography>=36.0.0 in d:\program files\python\lib\site-packages (from pdfminer.six==20221105->pdfplumber) (41.0.7)

Requirement already satisfied: cffi>=1.12 in d:\program files\python\lib\site-packages (from cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (1.16.0)

Requirement already satisfied: pycparser in d:\program files\python\lib\site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (2.21)

Installing collected packages: pdfplumber

Successfully installed pdfplumber-0.10.3

读取、合并文件

PyPDF2

PyPDF2 用于对PDF文件的分离、合并、裁剪、转换、加密、解密等操作。

读取和合并pdf文件正好以前写过，主要代码如下：

with codecs.open(file_path, 'rb', encoding='utf-16') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            tt = pdf_reader.pages[page_num].extract_text()
            print(tt)
            text += tt
......
    pdfMerge = PyPDF2.PdfMerger()
    try:
        for pdf in pdfLists:
            pdfMerge.append(pdf, import_outline=False)
        pdfMerge.write(pdfFileN)
        pdfMerge.close
        print("PDF files merged successfully!")
......

表格读取

pdfplumber

pdfplumber 用于读取PDF文件文本和表格提取，功能比较均衡。

优点：

每页单独对象，支持文本、表格数据的抽取（亮点）
文本抽取：保留了文本的格式，比如换行位置有空格，可以通过这个特点将一段的文本整合
表格数据抽取：不会被换行数据所干扰

缺点:

进行文本抽取时，如果一页有文本和表格，那么抽取的文本数据也会包括表格数据
对于有合并单元格的表格，无法还原表格结构
表格数据不能100%保证和原数据一致，可能缺少几个字，可能识别出错等
对于无边框的表格，处理效果很差
流程图和时序图会对处理产生严重影响

读取代码如下：

pdf =  pdfplumber.open(pdfFileN)
for page in pdf.pages:
    tables = page.extract_tables(table_settings = {})
    for table in tables:
        print(table)

遍历得到的是一个个二维列表，可以根据需要自己清洗数据。

程序界面

easygui

就用这个库，弄2个对话框简单了事：

选择一个文件夹，把其下的所有pdf文件合并，然后转换输出csv文件：

输出的文件格式如下：

更多easygui内容请见：

以上几样库拼凑在一起，就可以完成合并和转换pdf表格，完整代码如下：

import sys,os
import datetime as dt
import PyPDF2,pdfplumber
import easygui as eg
def get_pdf_text(file_path):
    with codecs.open(file_path, 'rb', encoding='utf-16') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            tt = pdf_reader.pages[page_num].extract_text()
            print(tt)
            text += tt
    return text
def strDateTime(diff=0):
    now = dt.datetime.now()
    future_time = now + dt.timedelta(days=diff)    
    return f'{future_time.year:04}{future_time.month:02}{future_time.day:02}_{future_time.hour:02}{future_time.minute:02}{future_time.second:02}'
txtStart = "PDFmerged_"
try:
    Dir = eg.diropenbox(msg=None, title=None, default='./')
    pdfLists = [f for f in os.listdir(Dir) if f.endswith('.pdf') and not f.startswith(txtStart)]
    pdfFileN = Dir + '\\' + txtStart + strDateTime() + ".pdf"
except:
    print('取消退出！')
    sys.exit(0)
if len(pdfLists)==0:
    eg.msgbox("此文件夹没有Pdf文件！", title="结束", ok_button="Fail")
    sys.exit(0)
else:
    pdfMerge = PyPDF2.PdfMerger()
    try:
        for pdf in pdfLists:
            pdfMerge.append(pdf, import_outline=False)
        pdfMerge.write(pdfFileN)
        pdfMerge.close
        print("PDF files merged successfully!")
    except:
        eg.msgbox("合并pdf失败！", title="结束", ok_button="Fail")
        sys.exit(0)
pdf =  pdfplumber.open(pdfFileN)
dct = dict()
for page in pdf.pages:
    tables = page.extract_tables(table_settings = {})
    for table in tables:
        for lst in table:
            tmp = lst[1:]
            tmp = [tmp[0]]+tmp[3:8]+[tmp[-1]]
            try:
                tmp[0] = tmp[0].replace('\n','')
                tmp[0] = tmp[0].split('/')
                tmp[0] = tmp[0][-1]
            except:
                pass
            if lst[0]=='时间':
                dct[lst[0]] = tmp[0]
            else:
                dct[','.join([lst[0],tmp[0] if tmp[0] else ''])] = ','.join(tmp[1:]) if all(tmp[1:]) else ''
pdf.close()
try:os.remove(pdfFileN)
except:pass
try:
    fn = "考勤表(" + dct['时间'] + ")"+strDateTime()+".csv"
except:
    fn = "考勤表"+strDateTime()+".csv"
try:
    with open(fn, 'w') as f:
        for k,v in dct.items():
            print(','.join([k,v]), file=f)
    eg.msgbox(f"考勤表保存成功！\n\n\n\t文件名：{fn}", title="结束", ok_button="Good！")
    print(f"CSV file written successfully! by HannYang {strDateTime()}")
except:
    eg.msgbox("保存csv文件失败！", title="结束", ok_button="Fail")

后话

如果要直接输出Excel表格，则需要另安装和导入xlwt模块。实现的代码大致如下：

myxl = xlwt.Workbook()
    style = xlwt.easyxf('align: wrap yes; align: horiz center; font: bold yes;borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;') 
    sheet = myxl.add_sheet('考勤表')
    wcol = [20,40,50,75,40,75]
    for i in range(6):
        sheet.col(i).width = wcol[i]*80
    sheet.write_merge(0,0,0,8,'出勤统计报表',style)
    style = xlwt.easyxf('borders:top thin; borders:bottom thin; borders:left thin; borders:right thin;') 
    sheet.write_merge(1,1,0,1,'单位（盖章）：',style)
    sheet.write_merge(2,2,0,1,'*经办人：',style)
    sheet.write(1,3,'填表日期：',style)
    sheet.write_merge(1,1,4,8,strToday(),style)
    sheet.write(2,3,'*联系电话：',style)
    sheet.write(2,2,adminName,style)
    sheet.write_merge(2,2,4,8,adminMobil,style)
    for i,t in enumerate(head.strip().split(',')):
            sheet.write(3,i,t,style)
    with open('考勤表.csv', 'r') as f:
        for i,row in enumerate(csv.reader(f)):
            if i==0:continue
            for j,col in enumerate(row):
                    sheet.write(3+i,j,col,style)
    excelfile = 'Output_'+strDateTime()+'('+defaultValue+').xls'
    myxl.save(excelfile)

另外不赶时间的话，还可以用PySimpleGUI库写个带漂亮gui界面的程序，具体操作方法请参见：

Python 快速合并PDF表格转换输出CSV文件

安装模块

读取、合并文件

PyPDF2

表格读取

pdfplumber

程序界面

easygui

后话

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python 快速合并PDF表格转换输出CSV文件

安装模块

读取、合并文件

PyPDF2

表格读取

pdfplumber

程序界面

easygui

后话

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像