【python】SAX和DOM处理XML文件

简介: 【python】SAX和DOM处理XML文件



前言

SAX和DOM都是用于处理XML文件的技术,但它们的处理方式不同。SAX是一种基于事件驱动的解析方式,它逐行读取XML文件并触发相应的事件加粗样式,从而实现对XML文件的解析。而DOM则是将整个XML文件加载到内存中,形成一棵树形结构,通过对树的遍历来实现对XML文件的解析。两种方式各有优缺点,具体使用哪种方式取决于具体的需求。

SAX模块

SAX模块是一种解析XML文档的方式,它基于事件驱动的模型,逐个解析XML文档中的元素和属性,并触发相应的事件。相比于DOM模型,SAX模型更加轻量级,适用于处理大型XML文档

用SAX读取XML文件

XML.sax是一种Python库,用于解析XML文档。它提供了一种基于事件的API,可以在解析XML文档时触发事件,从而实现对XML文档的解析和处理。

常用函数

make_parser建立并返回一个SAX解析器的XMLReader对象

def make_parser(parser_list=()):
    """Creates and returns a SAX parser.解析器
    Creates the first parser it is able to instantiate of the ones
    given in the iterable created by chaining parser_list and
    default_parser_list.  The iterables must contain the names of Python modules containing both a SAX parser and a create_parser function."""

创建它能够实例化的第一个解析器在通过链接 parser _ list 和Default _ parser _ list: 迭代程序必须包含同时包含 SAX 解析器和 create _ parser 函数的 Python 模块的名称。


parse建立一个SAX解析器,并用它来解析XML文档

def parse(source, handler, errorHandler=ErrorHandler()):
    parser = make_parser()
    parser.setContentHandler(handler)
    parser.setErrorHandler(errorHandler)
    parser.parse(source)

parseString与parse函数类似,但从string参数所提供的字符串中解析XML

def parseString(string, handler, errorHandler=ErrorHandler()):

SAXException封装了XML操作相关错误或警告

class SAXException(Exception):
    """Encapsulate an XML error or warning. This class can contain
    basic error or warning information from either the XML parser or
    the application: you can subclass子类 it to provide additional
    functionality, or to add localization. Note that although you will
    receive a SAXException as the argument to the handlers in the
    ErrorHandler interface, you are not actually required to raise
    the exception; instead, you can simply read the information in
    it."""

SAX解析器

主要作用是:向事件处理器发送时间

SAX事件处理器

ContentHandler类来实现

# ===== CONTENTHANDLER =====
class ContentHandler:
    """Interface for receiving logical document content events.
    This is the main callback interface in SAX, and the one most
    important to applications. The order of events in this interface
    mirrors the order of the information in the document."""

此接口中事件的顺序反映了文档中信息的顺序。

class ContentHandler:
    """Interface for receiving logical document content events.
    This is the main callback interface in SAX, and the one most
    important to applications. The order of events in this interface
    mirrors the order of the information in the document."""
    def __init__(self):
        self._locator = None定位器
    def setDocumentLocator(self, locator):
        """Called by the parser to give the application a locator for
        locating the origin of document events.由解析器调用,为应用程序提供一个定位文档事件的起源。
        SAX parsers are strongly encouraged 鼓励(though not absolutely
        required虽然不是绝对必需的) to supply提供 a locator: if it does so, it must supply
        the locator to the application by invoking this method before
        invoking调用 any of the other methods in the DocumentHandler
        interface.
        The locator allows the application to determine the end
        position of any document-related event, even if the parser is
        not reporting an error. Typically, the application will use
        this information for reporting its own errors (such as
        character content that does not match an application's
        business rules). The information returned by the locator is
        probably not sufficient for use with a search engine.
        Note that the locator will return correct information only
        during the invocation 调用of the events in this interface. The
        application should not attempt to use it at any other time."""
        self._locator = locator
    def startDocument(self):
        """Receive notification of the beginning of a document.
        The SAX parser will invoke this method only once, before any
        other methods in this interface or in DTDHandler (except for
        setDocumentLocator)."""
    def endDocument(self):
        """Receive notification of the end of a document.
        The SAX parser will invoke this method only once, and it will
        be the last method invoked during the parse. The parser shall
        not invoke this method until it has either abandoned parsing
        (because of an unrecoverable error) or reached the end of
        input."""
    def startPrefixMapping(self, prefix, uri):
        """Begin the scope of a prefix-URI Namespace mapping.
开始了prefix-URI名称空间映射的范围。
        The information from this event is not necessary for normal
        Namespace processing: the SAX XML reader will automatically
        replace prefixes for element and attribute names when the
        http://xml.org/sax/features/namespaces feature is true (the
        default).
        There are cases, however, when applications need to use
        prefixes in character data or in attribute values, where they
        cannot safely be expanded automatically; the
        start/endPrefixMapping event supplies the information to the
        application to expand prefixes in those contexts itself, if
        necessary.
        Note that start/endPrefixMapping events are not guaranteed to
        be properly nested relative to each-other: all
        startPrefixMapping events will occur before the corresponding
        startElement event, and all endPrefixMapping events will occur
        after the corresponding endElement event, but their order is
        not guaranteed."""
    def endPrefixMapping(self, prefix):
        """End the scope of a prefix-URI mapping映射.
        See startPrefixMapping for details. This event will always
        occur after the corresponding endElement event, but the order
        of endPrefixMapping events is not otherwise guaranteed.不以其他方式保证"""
    def startElement(self, name, attrs):
        """Signals the start of an element in non-namespace mode.
        The name parameter contains the raw XML 1.0 name of the
        element type as a string and the attrs parameter holds an
        instance of the Attributes class containing the attributes of
        the element."""
    def endElement(self, name):
        """Signals the end of an element in non-namespace mode.
        The name parameter contains the name of the element type, just
        as with the startElement event."""
    def startElementNS(self, name, qname, attrs):
        """Signals the start of an element in namespace mode.
        The name parameter contains the name of the element type as a
        (uri, localname) tuple, the qname parameter the raw XML 1.0
        name used in the source document, and the attrs parameter
        holds an instance of the Attributes class containing the
        attributes of the element.
        The uri part of the name tuple is None for elements which have
        no namespace."""
    def endElementNS(self, name, qname):
        """Signals the end of an element in namespace mode.
        The name parameter contains the name of the element type, just
        as with the startElementNS event."""
    def characters(self, content):
        """Receive notification of character data.
        The Parser will call this method to report each chunk of
        character data. SAX parsers may return all contiguous
        character data in a single chunk, or they may split it into
        several chunks; however, all of the characters in any single
        event must come from the same external entity so that the
        Locator provides useful information."""
    def ignorableWhitespace(self, whitespace):
        """Receive notification of ignorable whitespace in element content.
        Validating Parsers must use this method to report each chunk
        of ignorable whitespace (see the W3C XML 1.0 recommendation,
        section 2.10): non-validating parsers may also use this method
        if they are capable of parsing and using content models.
        SAX parsers may return all contiguous whitespace in a single
        chunk, or they may split it into several chunks; however, all
        of the characters in any single event must come from the same
        external entity, so that the Locator provides useful
        information."""
    def processingInstruction(self, target, data):
        """Receive notification of a processing instruction.
        The Parser will invoke this method once for each processing
        instruction found: note that processing instructions may occur
        before or after the main document element.
        A SAX parser should never report an XML declaration (XML 1.0,
        section 2.8) or a text declaration (XML 1.0, section 4.3.1)
        using this method."""
    def skippedEntity(self, name):
        """Receive notification of a skipped entity.实体
        The Parser will invoke this method once for each entity
        skipped. Non-validating processors may skip entities if they
        have not seen the declarations (because, for example, the
        entity was declared in an external DTD subset). All processors
        may skip external entities, depending on the values of the
        http://xml.org/sax/features/external-general-entities and the
        http://xml.org/sax/features/external-parameter-entities
        properties."""
# ===== DTDHandler =====

用SAX解析XML文件综合代码

SAX_parse_XML.py

# coding=gbk
import xml.sax
import sys
get_record=[] # 接受获取xml文档数据
class GetStorehouse(xml.sax.ContentHandler):# 事件处理器
    def __init__(self):
        self.CurrentDate=""# 自定义当前元素标签名属性
        self.title=""# 自定义商品二级分类属性
        self.name=""
        self.amount=""
        self.price=""
    def startElement(self,label,atrributes):# 遇到元素开始标签出发该函数
        self.CurrentDate=label # label为实例对象在解析的时候传递的标签名
        if label=="goods":
            category=atrributes["category"]
            return category
    def endElement(self,label):
        global get_record
        if self.CurrentDate=="title":
            get_record.append(self.title)
        elif self.CurrentDate=="name":
            get_record.append(self.name)
        elif self.CurrentDate=="amount":
            get_record.append(self.amount)
        elif self.CurrentDate=="price":
            get_record.append(self.price)
    def characters(self,content):
        if self.CurrentDate=="title":
            self.title=content
        elif self.CurrentDate=="name":
            self.name=content
        elif self.CurrentDate=="amount":
            self.amount=content
        elif self.CurrentDate=="price":
            self.price=content
#=======
parser=xml.sax.make_parser()#创建一个解析器的XMLreader对象
parser.setFeature(xml.sax.handler.feature_namespaces,0)# 从xml文件解析数据,关闭从命名空间解析数据
Handler=GetStorehouse()
parser.setContentHandler(Handler)
parser.parse("storehouse.xml")
print(get_record)
['淡水鱼', '鲫鱼', '18', '8', '    ', '温带水果', '猕猴桃', '10', '10', '    ', '\n']
<storehouse>
    <goods category="fish">
        <title>淡水鱼</title>
        <name>鲫鱼</name>
        <amount>18</amount>
        <price>8</price>
    </goods>
    <goods category="fruit">
        <title>温带水果</title>
        <name>猕猴桃</name>
        <amount>10</amount>
        <price>10</price>
    </goods>
</storehouse>
相关文章
|
5月前
|
机器学习/深度学习 存储 算法
解锁文件共享软件背后基于 Python 的二叉搜索树算法密码
文件共享软件在数字化时代扮演着连接全球用户、促进知识与数据交流的重要角色。二叉搜索树作为一种高效的数据结构,通过有序存储和快速检索文件,极大提升了文件共享平台的性能。它依据文件名或时间戳等关键属性排序,支持高效插入、删除和查找操作,显著优化用户体验。本文还展示了用Python实现的简单二叉搜索树代码,帮助理解其工作原理,并展望了该算法在分布式计算和机器学习领域的未来应用前景。
|
22天前
|
人工智能 索引 Python
[oeasy]python094_使用python控制音符列表_midi_文件制作
本文介绍了如何使用Python控制音符列表制作MIDI文件。首先回顾了列表下标索引(正数和负数)的用法,接着通过`mido`库实现MIDI文件生成。以《两只老虎》为例,详细解析了代码逻辑:定义音高映射、构建旋律列表、创建MIDI文件框架,并将音符插入音轨。还探讨了音符时值与八度扩展的实现方法。最终生成的MIDI文件可通过不同平台播放或编辑。总结中提到,此技术可用于随机生成符合调性的旋律,同时引发对列表其他实际应用的思考。
35 5
|
2月前
|
Android开发 开发者
Android自定义View之不得不知道的文件attrs.xml(自定义属性)
本文详细介绍了如何通过自定义 `attrs.xml` 文件实现 Android 自定义 View 的属性配置。以一个包含 TextView 和 ImageView 的 DemoView 为例,讲解了如何使用自定义属性动态改变文字内容和控制图片显示隐藏。同时,通过设置布尔值和点击事件,实现了图片状态的切换功能。代码中展示了如何在构造函数中解析自定义属性,并通过方法 `setSetting0n` 和 `setbackeguang` 实现功能逻辑的优化与封装。此示例帮助开发者更好地理解自定义 View 的开发流程与 attrs.xml 的实际应用。
Android自定义View之不得不知道的文件attrs.xml(自定义属性)
|
3月前
|
Python
使用Python实现multipart/form-data文件接收的http服务器
至此,使用Python实现一个可以接收 'multipart/form-data' 文件的HTTP服务器的步骤就讲解完毕了。希望通过我的讲解,你可以更好地理解其中的逻辑,另外,你也可以尝试在实际项目中运用这方面的知识。
198 69
|
3月前
|
Shell 开发者 Docker
Python文件打包:一站式指南
本文深入探讨Python文件打包的各种方法,从基础的zip和tar工具到高级的setuptools、PyInstaller、cx_Freeze等,涵盖Docker镜像、虚拟环境及自包含可执行文件的打包方式。通过示例代码与详细解析,帮助开发者根据项目需求选择合适的打包方案,提升代码分发与部署效率。内容全面,适合各水平读者学习参考。
205 7
|
4月前
|
存储 算法 文件存储
探秘文件共享服务之哈希表助力 Python 算法实现
在数字化时代,文件共享服务不可或缺。哈希表(散列表)通过键值对存储数据,利用哈希函数将键映射到特定位置,极大提升文件上传、下载和搜索效率。例如,在大型文件共享平台中,文件名等信息作为键,物理地址作为值存入哈希表,用户检索时快速定位文件,减少遍历时间。此外,哈希表还用于文件一致性校验,确保传输文件未被篡改。以Python代码示例展示基于哈希表的文件索引实现,模拟文件共享服务的文件索引构建与检索功能。哈希表及其分布式变体如一致性哈希算法,保障文件均匀分布和负载均衡,持续优化文件共享服务性能。
|
6月前
|
监控 网络安全 开发者
Python中的Paramiko与FTP文件夹及文件检测技巧
通过使用 Paramiko 和 FTP 库,开发者可以方便地检测远程服务器上的文件和文件夹是否存在。Paramiko 提供了通过 SSH 协议进行远程文件管理的能力,而 `ftplib` 则提供了通过 FTP 协议进行文件传输和管理的功能。通过理解和应用这些工具,您可以更加高效地管理和监控远程服务器上的文件系统。
151 20
|
6月前
|
存储 数据采集 数据处理
如何在Python中高效地读写大型文件?
大家好,我是V哥。上一篇介绍了Python文件读写操作,今天聊聊如何高效处理大型文件。主要方法包括:逐行读取、分块读取、内存映射(mmap)、pandas分块处理CSV、numpy处理二进制文件、itertools迭代处理及linecache逐行读取。这些方法能有效节省内存,提升效率。关注威哥爱编程,学习更多Python技巧。
155 8
|
6月前
|
存储 JSON 对象存储
如何使用 Python 进行文件读写操作?
大家好,我是V哥。本文介绍Python中文件读写操作的方法,包括文件读取、写入、追加、二进制模式、JSON、CSV和Pandas模块的使用,以及对象序列化与反序列化。通过这些方法,你可以根据不同的文件类型和需求,灵活选择合适的方式进行操作。希望对正在学习Python的小伙伴们有所帮助。欢迎关注威哥爱编程,全栈路上我们并肩前行。
154 4
|
6月前
|
存储 算法 Serverless
剖析文件共享工具背后的Python哈希表算法奥秘
在数字化时代,文件共享工具不可或缺。哈希表算法通过将文件名或哈希值映射到存储位置,实现快速检索与高效管理。Python中的哈希表可用于创建简易文件索引,支持快速插入和查找文件路径。哈希表不仅提升了文件定位速度,还优化了存储管理和多节点数据一致性,确保文件共享工具高效运行,满足多用户并发需求,推动文件共享领域向更高效、便捷的方向发展。

推荐镜像

更多