Python 标准库 xml.etree.ElementTree
注意:本文为 xml.etree.ElementTree 学习笔记,仅供自己学习使用,文中会把引用的链接附上。文章目录前言一、ElementTree 和 Element二、XML解析1.我们将使用以下XML文档作为示例数据:2.加载数据2.读取数据总结前言一、ElementTree 和 ElementXML是一种固定的分层数据格式,它最自然的表示方式是树。XML有两个类——ElementTree将
·
注意:本文为 xml.etree.ElementTree 学习笔记,仅供自己学习使用,文中会把引用的链接附上。
前言
一、ElementTree 和 Element
XML是一种固定的分层数据格式,它最自然的表示方式是树。XML有两个类——ElementTree将整个XML文档表示为树,而Element表示树中的单个节点。一般这样调用:import xml.etree.ElementTree as ET(简称ET)。
1.与整个文档的交互(读写文件)通常是在ElementTree级别完成的(文件的读写)。
2.与单个XML元素及其子元素的交互是在Element级完成的。
Element是一个灵活的容器对象,设计用来在内存中存储分层数据结构。它可以被描述为一种介于列表和字典之间的东西。每个Element都有许多与之关联的属性,如下:
属性 | 类型 | 意义 | 调用 |
---|---|---|---|
tag | str | Element名 | Element.tag |
attrib | dic | 元素有哪些属性 | Element.attrib |
text | str | 第一个子元素之前的文本。 | Element.text |
tail | str | 在元素结束标记之后,下一个元素开始标记之前的文本。 | Element.tail |
二、XML解析
1.我们将使用以下XML文档作为示例数据:
<Annotation created="16/05/2018" creator="XMLconverter">
<DocumentSet>
<document id="ED0" document_level_value="CT+">
<sentence id="ES0.0">China issues stern <event id="EE0.0" sentence_level_value="CT+">rebuke</event> over flight -EOP- .</sentence>
<sentence id="ES0.3">Surveillance aircraft intercepted by 2 J_10 fighter jets over East China Sea -EOP- .</sentence>
<sentence id="ES0.4">China urged the United States to immediately <event id="EE0.1" sentence_level_value="CT+">stop</event> its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a US Navy surveillance plane flew in airspace over the East China Sea on Sunday .</sentence>
</document>
<document id="ED1" document_level_value="CT+">
<sentence id="ES1.0">Philippine President Duterte <event id="EE1.0" sentence_level_value="CT+">vows</event> for closer relations with China -EOP- .</sentence>
<sentence id="ES1.6">Visiting Chinese Foreign Minister Wang Yi LRB Front L RRB meets with Philippine President Rodrigo Duterte LRB Front R RRB in Manila , the Philippines , on July 25 , 2017 .</sentence>
<sentence id="ES1.8">MANILA _ Philippine President Rodrigo Duterte <event id="EE1.1" sentence_level_value="CT+">pledged</event> on Tuesday that his country is <event id="EE1.2" sentence_level_value="CT+">pledged1</event> to build stronger bilateral relations with China .</sentence>
</document>
</DocumentSet>
</Annotation>
以上是我们实验室的语料
2.加载数据
代码如下(示例):
import xml.etree.ElementTree as ET
tree = ET.parse('F:/code/final/en_fin/data/english.xml')
root = tree.getroot()
作为一个元素,root有一个标签和一个属性字典:
>>>root.tag
'Annotation'
>>>root.attrib
{'created': '16/05/2018', 'creator': 'XMLconverter'}
2.读取数据
读取sentence中全部的文本信息
for doc in root[0]:
for sent in doc:
print(f"sent :", sent.text) # 输出文本
s = ''
for t in sent.itertext():
s += t
print(s) #输出整个句子的文本
break
break
输出:
sent : China issues stern
China issues stern rebuke over flight -EOP- .
长度,有元素的才有长度
len(root[0][0])
输出:
3
完整的例子
for doc in root[0]:
id = doc.attrib['id']
label = label2idx[doc.attrib['document_level_value']]
sentence_list = []
trigger_word_list = []
flag = False
for sent in doc:
if sent.text == '-EOP-.' or sent.text == '.':
continue
s = ''
for t in sent.itertext():
s += t
s = s.replace('-EOP-.', '。').lower()
print(f"sent.itertext: ",s)
if re.match(r'\d{4}\D\d{2}\D\d{2}\D\d{2}:\d{2}\D$', s) is not None:
flag = True
continue
elif flag:
flag = False
if len(sent) == 0:
continue
if len(s)<=4:
continue
if len(sent) > 0:
tmp = sent.text.lower() if sent.text is not None else ''
for event in sent:
print(f"sent.text: ",tmp)
print(f"event.text: ",event.text.lower())
print(f"event.tail: ",event.tail.lower())
结果:
sent.itertext: china issues stern rebuke over flight -eop- .
sent.text: china issues stern
event.text: rebuke
event.tail: over flight -eop- .
sent.itertext: surveillance aircraft intercepted by 2 j_10 fighter jets over east china sea -eop- .
sent.itertext: china urged the united states to immediately stop its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.text: china urged the united states to immediately
event.text: stop
event.tail: its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.itertext: philippine president duterte vows for closer relations with china -eop- .
sent.text: philippine president duterte
event.text: vows
event.tail: for closer relations with china -eop- .
sent.itertext: visiting chinese foreign minister wang yi lrb front l rrb meets with philippine president rodrigo duterte lrb front r rrb in manila , the philippines , on july 25 , 2017 .
sent.itertext: manila _ philippine president rodrigo duterte pledged on tuesday that his country is pledged1 to build stronger bilateral relations with china .
sent.text: manila _ philippine president rodrigo duterte
event.text: pledged
event.tail: on tuesday that his country is
sent.text: manila _ philippine president rodrigo duterte
event.text: pledged1
event.tail: to build stronger bilateral relations with china .
总结
代码还是配合例子易懂
更多推荐
已为社区贡献1条内容
所有评论(0)