项目名字:即时标记
大致的意思就是把一个纯文本文件标记成自己想要的格式文件。
首先就是待处理文本,我找不到电子版的,就自己手写了一份
Welcome to World Wide Spam, Inc
These are the corporate web pages of *World Wide Spam*, Inc. We hopeyou find your enjoyable, and that you will sample many of our products
A short history of the company
World Wide Spam was started in the summer of 2000. The business concept was to ride the dot-com wave and to make money both through bulk email and by selling canned meat online
After receiving several complaints from customer who weren't satisfied bu their bulk email .World Wide Spam altered their profile. and foused 100% on canned goods. Today they rank as the world's 13.892nd online suppler of SPAM
Destinations
From this page you may visit several of our interesting web pages:
-What is SPAM?(WWW.baidu.com)
-How do they make it?(WWW.baidu.com)
-Why should i eat is?(WWWW.baidu.com)
How to get in touch with us
You can get in touch with us in *many* ways: By phone(123456789). by email(dream_dog@163.com) or by visiting our customer feedback page(wwww.baidu.com)
第一步,就是那文件切分成段落。
找出块的一个简单方法就是搜集遇到的所有行,知道遇到一个空行,然后返回已经搜集的行。那些返回的行就是一个块,之后在开始收集,不需要手机空行,也不要返回空块,同时要确保文件的最后一行是空行,否则程序就不知道什么时候结束
编写一个文件快生成器
def lines(file): for line in file:yield line yield '\n' def blocks(file): block = [] for line in lines(file): if line.strip(): block.append(line) elif block: yield ' '.join(block).strip() block = []
代码中,lines生成器只是在文件尾追加一行空行,blocks生成器实现了前面说的方法。
添加一些标记
import sys. re from util import * print('<html><head><title>...</title><body>') title = True for block in blocks(sys.stdin): block = re.sub(r'\*(.+?)\*',r'<em>\1</em>',block) if title: print('<h1>') print(block) print('</h1>') title = false else: print('<p>') print(block) print('</p>') print('</body></html>')
这里抱错了,显示我的sys不是一个包