Python基础教程（第3版）中文版第20章项目1：自动添加标签(纯文本转HTML格式) （笔记）-阿里云开发者社区

Python基础教程（第3版）中文版第20章项目1：自动添加标签(纯文本转HTML格式) （笔记）

2024-06-13 29

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python基础教程（第3版）中文版第20章项目1：自动添加标签(纯文本转HTML格式) （笔记）

第20章项目1：自动添加标签(纯文本转HTML格式)

1.问题描述

给纯文本文件添加HTML标签，变成HTML格式。

任务是将文本元素分类，然后标记。

目标：

输入无需包含人工编码或标签

能处理不同的文本块

可扩展，及支持其他标记语言。

2.有用的工具

必须要：读写文件，输出

可能：迭代输入行，字符串处理，生成器，re

3.准备工作

一个用于测试的纯文本文件 test_input.txt

Welcome to World Wide Spam, Inc.

These are the corporate web pages of *World Wide Spam*, Inc. We hope

you find your stay enjoyable, and that you will sample many of our

products.

A short history of the company

World Wide Spam was started in the summer of 2000. The business

concept was to ride the dot-com wave and to make money both through

bulk email and by selling canned meat online.

After receiving several complaints from customers who weren't

satisfied by their bulk email, World Wide Spam altered their profile,

and focused 100 on canned goods. Today, they rank as the world's

13,892nd online supplier of SPAM.

Destinations

From this page you may visit several of our interesting web pages:

- What is SPAM? (http://wwspam.fu/whatisspam)

- How do they make it? (http://wwspam.fu/howtomakeit)

- Why should I eat it? (http://wwspam.fu/whyeatit)

How to get in touch with us

You can get in touch with us in *many* ways: By phone (555-1234), by

email (wwspam@wwspam.fu) or by visiting our customer feedback page

(http://wwspam.fu/feedback).

4.初次实现

首先将文本分成段落。即找出文本块。

从文本可知，段落之间有一个或多个空行。

因此，可以通过收集空行前的行来得到文本块。创建util.py，用来得到文本块

#line生成器，在文件末尾添加1空行
def lines(file):
    for line in file: yield line
    yield '\n'
 
#block生成器,去除两端空白
def blocks(file):
    block = []
    for line in lines(file):
        if line.strip():
            block.append(line)
        elif block:
            yield ''.join(block).strip()
            block = []

接着对文本块添加标记

创建标记程序simple_markup.py：

import sys, re
from util import *
 
print('<html><head><title>...</title><body>')
 
title = True
for block in blocks(sys.stdin):
    block = re.sub(r'\*(.+?)\*', r'<em>\1</em>', block)
    if title:
        print('<h1>')
        print(block)
        print('</h1>')
        title = False
    else:
        print('<p>')
        print(block)
        print('</p>')
 
print('</body></html>

得到test_output.html，用浏览器打开，就可以看到有标题和段落的一个文章。

Python基础教程（第3版）中文版第20章项目1：自动添加标签(纯文本转HTML格式) （笔记）

第20章项目1：自动添加标签(纯文本转HTML格式)

1.问题描述

2.有用的工具

3.准备工作

4.初次实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Python基础教程（第3版）中文版 第20章 项目1： 自动添加标签(纯文本转HTML格式) （笔记）

第20章 项目1： 自动添加标签(纯文本转HTML格式)

1.问题描述

2.有用的工具

3.准备工作

4.初次实现

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

Python基础教程（第3版）中文版第20章项目1：自动添加标签(纯文本转HTML格式) （笔记）

第20章项目1：自动添加标签(纯文本转HTML格式)