原文链接:https://www.jianshu.com/p/a784f196b9c9
本系列文章来源:<a>https://blog.ansheng.me/article/python-full-stack-way</a>
Python’s interfaces for processing XML are grouped in the xml package.
带分隔符的文件仅有两维的数据:行和列。如果你想在程序之间交换数据结构,需要一种方法把层次结构、序列、集合和其他的结构编码成文本
XML是最突出的处理这种转换的标记(markup)格式,它使用标签(tag)分个数据,如下面的实例文件menu.xml所示:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>安生's Blog</title>
<subtitle>大好时光!</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="https://blog.ansheng.me/"/>
<updated>2016-05-24T15:29:19.000Z</updated>
<id>https://blog.ansheng.me/</id>
<author>
<name>安生</name>
</author>
</feed>
XML的一些重要特性
标签以一个<字符开头,例如实例中的feed、title、subtitle、author。
忽略空格
通常一个开始标签跟一段其他的内容,然后是最后相匹配的结束标签,例如大好时光!
标签之间是可以存在多级嵌套的
可选属性(attribute)可以出现在开始标签里
标签中可以包含值(value)
如果一个命名为thing的标签内没有内容或者子标签,那么它可以用在右尖括号的前面添加斜杠的简单标签所表示,例如代替开始和结束都存在的标签。
存放数据的位置可以是任意的—属性、值或者子标签。
在Python中解析XML最简单的方法是使用ElementTree。
创建XML文件
导入ElementTree方法,起一个别名为ET
In [1]: from xml.etree import ElementTree as ET
创建顶级标签
In [2]: level_1 = ET.Element("famliy")
创建二级标签,tag名name,attrib标签属性
In [3]: level_2 = ET.SubElement(level_1,"name",attrib={"enrolled":"yes"})
创建三级标签
In [4]: level_2 = ET.SubElement(level_2,"age",attrib={"checked":"no"})
生成文档
In [5]: tree = ET.ElementTree(level_1)
写入文件中
In [6]: tree.write('learn.xml',encoding='utf-8',short_empty_elements=False)
注释:short_empty_elements 是唯一一个关键字参数,是Python 3.4新增加的参数。它用于控制那些不包含任何内容的elements的格式,如果该参数值为Ture则这些标签将会被输出为一个单独的自关闭标签(如: <a/>),如果值为False则这些标签将会被输出为一个标签对(如:<a></a>)
查看文件
In [7]: cat learn.xml
<famliy><name enrolled="yes"><age checked="no"></age></name></famliy>
创建一个有换行的XML文件
In [9]: from xml.etree import ElementTree as ET
In [10]: from xml.dom import minidom
In [11]: root = ET.Element('level1',{"age":"1"})
In [12]: son = ET.SubElement(root,"level2",{"age":"2"})
In [13]: ET.SubElement(son,"level3",{"age":"3"})
Out[13]: <Element 'level3' at 0x7efc16059bd8>
In [14]: def prettify(root):
...: rough_string = ET.tostring(root,'utf-8')
...: reparsed = minidom.parseString(rough_string)
...: return reparsed.toprettyxml(indent=" ")
...:
#indent是每个tag前填充的字符,如:’ ‘,则表示每个tag前有两个空格
In [15]: new_str = prettify(root)
In [16]: f = open("new_out.xml","w")
In [17]: f.write(new_str)
Out[17]: 99
In [18]: f.close()
#查看生成的XML文件
In [19]: cat new_out.xml
<?xml version="1.0" ?>
<level1 age="1">
<level2 age="2">
<level3 age="3"/>
</level2>
</level1>
解析XML
first.xml文件内容为:
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year age="19">2025</year>
<gdppc>141100</gdppc>
<neighbor direction="E" name="Austria" />
<neighbor direction="W" name="Switzerland" />
</country>
<country name="Singapore">
<rank updated="yes">5</rank>
<year age="19">2028</year>
<gdppc>59900</gdppc>
<neighbor direction="N" name="Malaysia" />
</country>
<country name="Panama">
<rank updated="yes">69</rank>
<year age="19">2028</year>
<gdppc>13600</gdppc>
<neighbor direction="W" name="Costa Rica" />
<neighbor direction="E" name="Colombia" />
</country>
</data>
利用ElementTree.XML将字符串解析成XML对象
In [21]: from xml.etree import ElementTree as ET
# 打开文件,读取XML内容,将字符串解析成xml特殊对象,root代指xml文件的根节点
In [22]: root = ET.XML(open('first.xml','r').read())
In [23]: root.tag
Out[23]: 'data'
In [24]: for node in root:
...: print(node.tag,node.attrib)
...:
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
In [25]: print(node.find('rank').tet)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-7360b3c6a2b0> in <module>()
----> 1 print(node.find('rank').tet)
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'tet'
In [26]: print(node.find('rank').text)
69
利用ElementTree.parse将文件直接解析成XML对象
In [27]: from xml.etree import ElementTree as ET
# 直接解析xml文件
In [28]: tree = ET.parse("first.xml")
# 获取xml文件的根节点
In [29]: root = tree.getroot()
In [30]: root.tag
Out[30]: 'data'
遍历XML中指定的节点
In [31]: from xml.etree import ElementTree as ET
In [32]: tree = ET.parse("first.xml")
In [33]: root = tree.getroot()
In [34]: for node in root.iter('year'):
# 输出node的tag和内容
...: print(node.tag,node.text)
...:
year 2025
year 2028
year 2028
增,删,改XML
为节点添加属性
In [35]: from xml.etree import ElementTree as ET
In [36]: tree = ET.parse("first.xml")
In [37]: root = tree.getroot()
In [38]: for node in root.iter('year'):
# 查看原来的属性
...: print(node.attrib)
...:
...:
{'age': '19'}
{'age': '19'}
{'age': '19'}
In [39]: for node in root.iter('year'):
...: node.set("OS":"Linux")
...:
File "<ipython-input-39-b4970b87ec02>", line 2
node.set("OS":"Linux")
^
SyntaxError: invalid syntax
In [40]: for node in root.iter('year'):
# 添加属性
...: node.set("OS","Linux")
...:
...:
In [41]: for node in root.iter('year'):
...: print(node.attrib)
...:
...:
{'OS': 'Linux', 'age': '19'}
{'OS': 'Linux', 'age': '19'}
{'OS': 'Linux', 'age': '19'}
# 把内容写入文件
In [42]: tree.write("first.xml")
删除节点属性
In [43]: from xml.etree import ElementTree as ET
In [44]: tree = ET.ElementTree(level_1)
In [45]: from xml.etree import ElementTree as ET
In [46]: tree = ET.parse("first.xml")
In [47]: root = tree.getroot()
In [48]: for node in root.iter("year"):
# 删除节点的OS属性
...: del node.attrib['OS']
...:
In [49]: tree.write("first.xml")
修改节点内容
修改year内的数字自加1
In [51]: from xml.etree import ElementTree as ET
In [52]: tree = ET.parse("first.xml")
In [53]: root = tree.getroot()
In [54]: for node in root.iter("year"):
...: print(node.text)
...: new_year = int(node.text) + 1
...: node.text = str(new_year)
...:
2025
2028
2028
In [55]: tree.write("first.xml")
In [56]: for node in root.iter("year"):
...: print(node.text")
File "<ipython-input-56-36380ae210f5>", line 2
print(node.text")
^
SyntaxError: EOL while scanning string literal
In [57]: for node in root.iter("year"):
...: print(node.text)
...:
2026
2029
2029
对节点操作的方法
获取节点的方法
In [58]: from xml.etree import ElementTree as ET
In [59]: tree = ET.parse("first.xml")
In [60]: root = tree.getroot()
In [61]: print(dir(root))
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set']
常用方法:
方法名 说明
tag 获取tag标签名
attrib 获取节点的属性
find 获取节点的内容
iter 进行迭代
set 设置属性
get 获取属性
实例
判断QQ是否在线
腾讯提供了能够查看QQ号码是否在线的API,Y=在线;N=离线;E=QQ号码错误;A=商业用户验证失败;V=免费用户超过数量
In [62]: import requests
In [63]: from xml.etree import ElementTree as ET
In [64]: r = requests.get("http://www.webxml.com.cn//webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=1002
...: 751472")
In [65]: result = r.text
In [66]: node = ET.XML(result)
In [67]: print(node.text)
Y
In [68]: if node.text == "Y":
...: print("online")
...: else :
...: print("notonline")
...:
online
获取列车起止时间
In [1]: import requests
In [2]: from xml.etree import ElementTree as ET
In [3]: r = requests.get("http://www.webxml.com.cn/WebServices/TrainTimeWebService.asmx/getDetailInfoByTrainCode?TrainC
...: ode=K234&UserID=")
In [4]: result = r.text
In [5]: root = ET.XML(result)
In [6]: for node in root.iter('TrainDetailInfo'):
...: print(node.find('TrainStation').text,node.find('ArriveTime').text,node.find("StartTime").text)
...:
上海(车次:K234K235) None 11:12:00
昆山 11:46:00 11:50:00
苏州 12:14:00 12:18:00
南京 15:05:00 15:15:00
蚌埠 17:57:00 18:03:00
徐州 20:20:00 20:25:00
砀山 21:24:00 21:27:00
商丘 22:10:00 22:16:00
宁陵县 22:43:00 22:48:00
兰考 23:21:00 23:24:00
开封 23:55:00 23:59:00
郑州 00:45:00 01:15:00
安阳 03:29:00 03:41:00
邯郸 04:16:00 04:34:00
邢台 05:05:00 05:09:00
石家庄 06:20:00 None