前言
SAX和DOM都是用于处理XML文件的技术,但它们的处理方式不同。SAX是一种基于事件驱动的解析方式,它逐行读取XML文件并触发相应的事件加粗样式,从而实现对XML文件的解析。而DOM则是将整个XML文件加载到内存中,形成一棵树形结构,通过对树的遍历来实现对XML文件的解析。两种方式各有优缺点,具体使用哪种方式取决于具体的需求。
SAX模块
SAX模块是一种解析XML文档的方式,它基于事件驱动的模型,逐个解析XML文档中的元素和属性,并触发相应的事件。相比于DOM模型,SAX模型更加轻量级,适用于处理大型XML文档。
用SAX读取XML文件
XML.sax
是一种Python库,用于解析XML文档。它提供了一种基于事件的API,可以在解析XML文档时触发事件,从而实现对XML文档的解析和处理。
常用函数
make_parser
建立并返回一个SAX解析器的XMLReader对象
def make_parser(parser_list=()): """Creates and returns a SAX parser.解析器 Creates the first parser it is able to instantiate of the ones given in the iterable created by chaining parser_list and default_parser_list. The iterables must contain the names of Python modules containing both a SAX parser and a create_parser function."""
创建它能够实例化的第一个解析器在通过链接 parser _ list 和Default _ parser _ list: 迭代程序必须包含同时包含 SAX 解析器和 create _ parser 函数的 Python 模块的名称。
parse
建立一个SAX解析器,并用它来解析XML文档
def parse(source, handler, errorHandler=ErrorHandler()): parser = make_parser() parser.setContentHandler(handler) parser.setErrorHandler(errorHandler) parser.parse(source)
parseString
与parse函数类似,但从string参数所提供的字符串中解析XML
def parseString(string, handler, errorHandler=ErrorHandler()):
SAXException
封装了XML操作相关错误或警告
class SAXException(Exception): """Encapsulate an XML error or warning. This class can contain basic error or warning information from either the XML parser or the application: you can subclass子类 it to provide additional functionality, or to add localization. Note that although you will receive a SAXException as the argument to the handlers in the ErrorHandler interface, you are not actually required to raise the exception; instead, you can simply read the information in it."""
SAX解析器
主要作用是:向事件处理器发送时间
SAX事件处理器
ContentHandler
类来实现
# ===== CONTENTHANDLER ===== class ContentHandler: """Interface for receiving logical document content events. This is the main callback interface in SAX, and the one most important to applications. The order of events in this interface mirrors the order of the information in the document."""
此接口中事件的顺序反映了文档中信息的顺序。
class ContentHandler: """Interface for receiving logical document content events. This is the main callback interface in SAX, and the one most important to applications. The order of events in this interface mirrors the order of the information in the document.""" def __init__(self): self._locator = None定位器 def setDocumentLocator(self, locator): """Called by the parser to give the application a locator for locating the origin of document events.由解析器调用,为应用程序提供一个定位文档事件的起源。 SAX parsers are strongly encouraged 鼓励(though not absolutely required虽然不是绝对必需的) to supply提供 a locator: if it does so, it must supply the locator to the application by invoking this method before invoking调用 any of the other methods in the DocumentHandler interface. The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application's business rules). The information returned by the locator is probably not sufficient for use with a search engine. Note that the locator will return correct information only during the invocation 调用of the events in this interface. The application should not attempt to use it at any other time.""" self._locator = locator def startDocument(self): """Receive notification of the beginning of a document. The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).""" def endDocument(self): """Receive notification of the end of a document. The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.""" def startPrefixMapping(self, prefix, uri): """Begin the scope of a prefix-URI Namespace mapping. 开始了prefix-URI名称空间映射的范围。 The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default). There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary. Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.""" def endPrefixMapping(self, prefix): """End the scope of a prefix-URI mapping映射. See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.不以其他方式保证""" def startElement(self, name, attrs): """Signals the start of an element in non-namespace mode. The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.""" def endElement(self, name): """Signals the end of an element in non-namespace mode. The name parameter contains the name of the element type, just as with the startElement event.""" def startElementNS(self, name, qname, attrs): """Signals the start of an element in namespace mode. The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element. The uri part of the name tuple is None for elements which have no namespace.""" def endElementNS(self, name, qname): """Signals the end of an element in namespace mode. The name parameter contains the name of the element type, just as with the startElementNS event.""" def characters(self, content): """Receive notification of character data. The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.""" def ignorableWhitespace(self, whitespace): """Receive notification of ignorable whitespace in element content. Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models. SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.""" def processingInstruction(self, target, data): """Receive notification of a processing instruction. The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element. A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.""" def skippedEntity(self, name): """Receive notification of a skipped entity.实体 The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.""" # ===== DTDHandler =====
用SAX解析XML文件综合代码
SAX_parse_XML.py
# coding=gbk import xml.sax import sys get_record=[] # 接受获取xml文档数据 class GetStorehouse(xml.sax.ContentHandler):# 事件处理器 def __init__(self): self.CurrentDate=""# 自定义当前元素标签名属性 self.title=""# 自定义商品二级分类属性 self.name="" self.amount="" self.price="" def startElement(self,label,atrributes):# 遇到元素开始标签出发该函数 self.CurrentDate=label # label为实例对象在解析的时候传递的标签名 if label=="goods": category=atrributes["category"] return category def endElement(self,label): global get_record if self.CurrentDate=="title": get_record.append(self.title) elif self.CurrentDate=="name": get_record.append(self.name) elif self.CurrentDate=="amount": get_record.append(self.amount) elif self.CurrentDate=="price": get_record.append(self.price) def characters(self,content): if self.CurrentDate=="title": self.title=content elif self.CurrentDate=="name": self.name=content elif self.CurrentDate=="amount": self.amount=content elif self.CurrentDate=="price": self.price=content #======= parser=xml.sax.make_parser()#创建一个解析器的XMLreader对象 parser.setFeature(xml.sax.handler.feature_namespaces,0)# 从xml文件解析数据,关闭从命名空间解析数据 Handler=GetStorehouse() parser.setContentHandler(Handler) parser.parse("storehouse.xml") print(get_record)
['淡水鱼', '鲫鱼', '18', '8', ' ', '温带水果', '猕猴桃', '10', '10', ' ', '\n']
<storehouse> <goods category="fish"> <title>淡水鱼</title> <name>鲫鱼</name> <amount>18</amount> <price>8</price> </goods> <goods category="fruit"> <title>温带水果</title> <name>猕猴桃</name> <amount>10</amount> <price>10</price> </goods> </storehouse>