我想修改大量的XML。它们存储在ZIP文件中。源XML是utf-8编码的(至少是Linux上file
工具的猜测),并且具有正确的XML声明:。
目标ZIP和其中包含的XML也应具有正确的XML声明。但是,(至少对我来说)最明显的方法(使用ElementTree.tostring
)失败。
这是一个独立的示例,应该可以立即使用。简短演练:
请专注于下部,特别是#APPROACH 1
,APPROACH 2
,APPROACH 3
:
import os
import tempfile
import zipfile
from xml.etree.ElementTree import Element, parse
src_1 = os.path.join(tempfile.gettempdir(), "one.xml")
src_2 = os.path.join(tempfile.gettempdir(), "two.xml")
src_zip = os.path.join(tempfile.gettempdir(), "src.zip")
trgt_appr1_zip = os.path.join(tempfile.gettempdir(), "trgt_appr1.zip")
trgt_appr2_zip = os.path.join(tempfile.gettempdir(), "trgt_appr2.zip")
trgt_appr3_zip = os.path.join(tempfile.gettempdir(), "trgt_appr3.zip")
# file on hard disk that must be used due to ElementTree insufficiencies
tmp_xml_name = os.path.join(tempfile.gettempdir(), "curr_xml.tmp")
# prepare src.zip
tree1 = ElementTree(Element('hello', {'beer': 'good'}))
tree1.write(os.path.join(tempfile.gettempdir(), "one.xml"), encoding="UTF-8", xml_declaration=True)
tree2 = ElementTree(Element('scnd', {'äkey': 'a value'}))
tree2.write(os.path.join(tempfile.gettempdir(), "two.xml"), encoding="UTF-8", xml_declaration=True)
with zipfile.ZipFile(src_zip, 'a') as src:
with open(src_1, 'r', encoding="utf-8") as one:
string_representation = one.read()
# write to zip
src.writestr(zinfo_or_arcname="one.xml", data=string_representation.encode("utf-8"))
with open(src_2, 'r', encoding="utf-8") as two:
string_representation = two.read()
# write to zip
src.writestr(zinfo_or_arcname="two.xml", data=string_representation.encode("utf-8"))
os.remove(src_1)
os.remove(src_2)
# read XMLs from zip
with zipfile.ZipFile(src_zip, 'r') as zfile:
updated_trees = []
for xml_name in zfile.namelist():
curr_file = zfile.open(xml_name, 'r')
tree = parse(curr_file)
# modify tree
updated_tree = tree
updated_tree.getroot().append(Element('new', {'newkey': 'new value'}))
updated_trees.append((xml_name, updated_tree))
for xml_name, updated_tree in updated_trees:
# write to target file
with zipfile.ZipFile(trgt_appr1_zip, 'a') as trgt1_zip, zipfile.ZipFile(trgt_appr2_zip, 'a') as trgt2_zip, zipfile.ZipFile(trgt_appr3_zip, 'a') as trgt3_zip:
#
# APPROACH 1 [DESIRED, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration missing
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml')
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 2 [WORKS IN THEORY, BUT DOES NOT WORK]: write tree to zip-file
# encoding in XML declaration is faulty (is 'utf8', should be 'utf-8' or 'UTF-8')
#
# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='utf8', method='xml')
# write XML directly to zip
trgt2_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)
#
# APPROACH 3 [WORKS, BUT LACKS PERFORMANCE]: write to file, then read from file, then write to zip
#
# write to file
updated_tree.write(tmp_xml_name, encoding="UTF-8", method="xml", xml_declaration=True)
# read from file
with open(tmp_xml_name, 'r', encoding="utf-8") as tmp:
string_representation = tmp.read()
# write to zip
trgt3_zip.writestr(zinfo_or_arcname=xml_name, data=string_representation.encode("utf-8"))
os.remove(tmp_xml_name)
方法3可行,但是它比其他两个资源占用更多的资源。
APPROACH 2
is the only way I could get an ElementTree object to be written with an actual XML declaration -- which then turns out to be invalid ( utf8
instead of UTF-8
/ utf-8
).
APPROACH 1
would be most desired -- but fails during reading later in the pipeline, as the XML declaration is missing.
Question: How can I get rid of writing the whole XML to disk first, only to read it afterwards, write it to the zip and delete it after being done with the zip? What am I missing?
问题来源: stackoverflow
You can use an io.BytesIO
object. This allows using ElementTree.write
, while avoiding exporting the tree to disk:
import zipfile
from io import BytesIO
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
bio = BytesIO()
tree.write(bio, encoding='UTF-8', xml_declaration=True)
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
z.writestr('test.xml', bio.getvalue())
If you are using Python 3.6 or higher, there's an even shorter solution: you can get a writable file object from the ZipFile
object, which you can pass to ElementTree.write
:
import zipfile
from xml.etree.ElementTree import ElementTree, Element
tree = ElementTree(Element('hello', {'beer': 'good'}))
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
with z.open('test.xml', 'w') as f:
tree.write(f, encoding='UTF-8', xml_declaration=True)
这还有一个优点,就是您不会在内存中存储树的多个副本,这对于大型树可能是一个相关问题。
回答来源:stackoverflow
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。