PDF转Word完全免费？这么好的事情我怎么不知道？？？？-阿里云开发者社区

PDF转Word完全免费？这么好的事情我怎么不知道？？？？

2023-08-09 280

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： PDF转Word完全免费？这么好的事情我怎么不知道？？？？

”阅读此篇需要三分钟“

首先来看看来个PDF文件

我们来选择其中一个论文摘要

使用我们的python代码转化后：

是不是很神奇？

现在网络上大部分的PDF转Word都是收费的，基本都是按页收费，有了我们的python代码后，我们就可以完全免费的将PDF转成Word了，这么好的福利我们赶紧来了解一下吧！

首先来看看我们要安装一些什么模块：

attrs==17.4.0
lxml==4.1.1
pdfminer3k==1.3.1
pluggy==0.6.0
ply==3.11
py==1.5.2
pytest==3.4.1
python-docx==0.8.6
six==1.11.0

使用pip模块管理工具即可安装。

如上图，将每个模块都安装好。

或者直接将模块放到requirements.txt文件里，运行

pip install -r requirements

安装即可

下一步就来开始coding了！

首先导入需要使用的模块：

import os
from io import StringIO
from io import open
from concurrent.futures import ProcessPoolExecutor
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from docx import Document

然后定义好PDF文件的读取路径和Word文件的生成路径。

pdf_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/pdf'
word_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/word'

接下来我们定义使用的方法：

def read_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        resource_manager = PDFResourceManager()
        return_str = StringIO()
        lap_params = LAParams()
        device = TextConverter(
            resource_manager,
            return_str,
            laparams=lap_params)
        process_pdf(resource_manager, device, file)
        device.close()
        content = return_str.getvalue()
        return_str.close()
        return content

通过字节流的方式打开文件，读取内容。我们主要使用process_pdf这个函数处理pdf，详情处理步骤我们可以看看API是这么处理的（这API写好的代码，供参考，不需要你再次手写）：

def process_pdf(rsrcmgr, device, fp, pagenos=None, maxpages=0, password='',
                caching=True, check_extractable=True):
    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)
    # Create a PDF document object that stores the document structure.
    doc = PDFDocument(caching=caching)
    # Connect the parser and document objects.
    parser.set_document(doc)
    doc.set_parser(parser)
    # Supply the document password for initialization.
    # (If no password is set, give an empty string.)
    doc.initialize(password)
    # Check if the document allows text extraction. If not, abort.
    if check_extractable and not doc.is_extractable:
raise PDFTextExtractionNotAllowed(
                        'Text extraction is not allowed: %r' % fp)
# Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.
    for (pageno,page) in enumerate(doc.get_pages()):
        if pagenos and (pageno not in pagenos): continue
        interpreter.process_page(page)
        if maxpages and maxpages <= pageno+1: break

下面我们考虑将字节流存成docx文档：

def save_text_to_word(content, file_path):
    doc = Document()
    for line in content.split('\n'):
        paragraph = doc.add_paragraph()
        paragraph.add_run(remove_control_characters(line))
    doc.save(file_path)

# 将两个函数封装起来
def pdf_to_word(pdf_file_path, word_file_path):
content = read_from_pdf(pdf_file_path)
    save_text_to_word(content, word_file_path)

主要功能完成，这样就算完工了

下面我们来调用读取pdf生成docx的方法

tasks = []
with ProcessPoolExecutor(max_workers=5) as executor:
    for file in os.listdir(pdf_folder):
        extension_name = os.path.splitext(file)[1]
        if extension_name != '.pdf':
            continue
        file_name = os.path.splitext(file)[0]
        pdf_file = pdf_folder + '/' + file
        word_file = word_folder + '/' + file_name + '.docx'
        print('正在处理: ', file)
        result = executor.submit(pdf_to_word, pdf_file, word_file)
        tasks.append(result)
while True:
    exit_flag = True
    for task in tasks:
        if not task.done():
            exit_flag = False
    if exit_flag:
        print('完成')
        exit(0)

这样就可以生成doc文件了，怎么样是不是很简单？

你也来动手试一试？

完整代码请点击阅读原文

文章标签：

Python

API

关键词：

PDF word

PDF转Word完全免费？这么好的事情我怎么不知道？？？？

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

PDF转Word完全免费？这么好的事情我怎么不知道？？？？

热门文章

最新文章

相关电子书