一款名为 Beautiful Soup 的常用配套工具帮助 Python 程序理解 Web 站点中包含的脏乱“基本是 HTML” 内容。是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。
使用 Beautiful Soup 从无序的内容中生成整齐的数据
from glob import glob from BeautifulSoup import BeautifulSoup def process(): print "!MOVIE,DIRECTOR,KEY_GRIP,THE_MOOSE" for fname in glob('result_*'): # Put that sloppy HTML into the soup soup = BeautifulSoup(open(fname)) # Try to find the fields we want, but default to unknown values try: movie = soup.findAll('span', {'class':'movie_title'})[1].contents[0] except IndexError: fname = "UNKNOWN" try: director = soup.findAll('div', {'class':'director'})[1].contents[0] except IndexError: lname = "UNKNOWN" try: # Maybe multiple grips listed, key one should be in there grips = soup.findAll('p', {'id':'grip'})[0] grips = " ".join(grips.split()) # Normalize extra spaces except IndexError: title = "UNKNOWN" try: # Hide some stuff in the HTML <meta> tags moose = soup.findAll('meta', {'name':'shibboleth'})[0]['content'] except IndexError: moose = "UNKNOWN" print '"%s","%s","%s","%s"' % (movie, director, grips, moose) |
具体可参考:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html
与其类似的还有PyQuery库,看参考其网址 http://packages.python.org/pyquery/