python编程：tabula、pdfplumber、camelot进行表格数据识别

2021-11-23 1517

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： python编程：tabula、pdfplumber、camelot进行表格数据识别

本文就目前python图表识别的库进行测试

1、tabula

2、pdfplumber

3、camelot

准备数据

excel：names.xlsx，两个表格

表格1：所有字段都被线条包围

表格2：最外层没有线条包围

将excel另存为pdf：names.pdf

1、tabula

java项目：https://github.com/tabulapdf

自带可视化界面的pdf提取表格数据工具：

https://tabula.technology/

python接口：https://github.com/chezou/tabula-py

安装:

pip install tabula-py

依赖：

Java 7, 8

代码示例：

import tabula
tabula.convert_into(
    input_path="source/names.pdf",
    output_path="source/names.csv",
    output_format='csv'
)

转换出来的names.csv，发现只有表格1被提取出来了，而且不规范，中间多了逗号

"姓名",年龄,性别
"李雷",,20 男
"韩梅梅",,23 女
"赵小三",,25 女

2、pdfplumber

github: https://github.com/jsvine/pdfplumber

安装

pip install pdfplumber

代码示例：

import pdfplumber
import pandas as pd
with pdfplumber.open("source/names.pdf") as pdf:
    # 获取第一页
    first_page = pdf.pages[0]
  # 解析文本
    text = first_page.extract_text()
    print(text)
  # 解析表格
    tables = first_page.extract_tables()
    for table in tables:
        print(table)
        # df = pd.DataFrame(table[1:], columns=table[0])
        for row in table:
            for cell in row:
                print(cell, end="\t|")
            print()
""" 
表格1:
姓名 年龄 性别
李雷 20 男
韩梅梅 23 女
赵小三 25 女
Table2：
Name Age Gender
Tom 30 Male
Jack 33 Male
Jone 31 Female
[['姓名', '年龄', '性别'], ['李雷', '20', '男'], ['韩梅梅', '23', '女'], ['赵小三', '25', '女']]
姓名  |年龄 |性别 |
李雷  |20 |男  |
韩梅梅 |23 |女  |
赵小三 |25 |女  |
[['30'], ['33']]
30  |
33  |
"""

文本解析的很全，只有表格1解析完全了，表格2只是解析了有框的部分

3、camelot

github: https://github.com/socialcopsdev/camelot

安装：

pip install camelot-py[cv]

示例

import camelot
tables = camelot.read_pdf("source/names.pdf")
tables.export("source/names.csv")

生成2个文件：

source/names-page-1-table-1.csv

"姓名","年龄","性别"
"李雷","20 男",""
"韩梅梅","23 女",""
"赵小三","25 女",""

source/names-page-1-table-2.csv

"Name","Age","Gender"
"Tom","","30 Male"
"Jack","","33 Male"
"Jone","","31 Female"

发现表格2的内容被解析出来了，不过两个表格的内容都错位了

经过测试后，发现这3个库对表格识别都不是太好

总结

库名	说明
tabula	能提取完整表格，提取结果不规范
pdfplumber	能提取完整表格，提取结果较为规范
camelot	能提取完整表格和不完整表格，提取结果不规范

python编程：tabula、pdfplumber、camelot进行表格数据识别

准备数据

1、tabula

2、pdfplumber

3、camelot

总结

热门文章

最新文章

相关课程

相关电子书

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

python编程：tabula、pdfplumber、camelot进行表格数据识别

准备数据

1、tabula

2、pdfplumber

3、camelot

总结

热门文章

最新文章

相关课程

相关电子书

推荐镜像