注意:本文为 xml.etree.ElementTree 学习笔记,仅供自己学习使用,文中会把引用的链接附上。


前言


一、ElementTree 和 Element

XML是一种固定的分层数据格式,它最自然的表示方式是树。XML有两个类——ElementTree将整个XML文档表示为树,而Element表示树中的单个节点。一般这样调用:import xml.etree.ElementTree as ET(简称ET)。

1.与整个文档的交互(读写文件)通常是在ElementTree级别完成的(文件的读写)。
2.与单个XML元素及其子元素的交互是在Element级完成的。

Element是一个灵活的容器对象,设计用来在内存中存储分层数据结构。它可以被描述为一种介于列表和字典之间的东西。每个Element都有许多与之关联的属性,如下:

属性类型意义调用
tagstrElement名Element.tag
attribdic元素有哪些属性Element.attrib
textstr第一个子元素之前的文本。Element.text
tailstr在元素结束标记之后,下一个元素开始标记之前的文本。Element.tail

二、XML解析

1.我们将使用以下XML文档作为示例数据:

<Annotation created="16/05/2018"  creator="XMLconverter">
	<DocumentSet>
		<document id="ED0" document_level_value="CT+">
			<sentence id="ES0.0">China issues stern <event id="EE0.0" sentence_level_value="CT+">rebuke</event> over flight -EOP- .</sentence>
			<sentence id="ES0.3">Surveillance aircraft intercepted by 2 J_10 fighter jets over East China Sea -EOP- .</sentence>
			<sentence id="ES0.4">China urged the United States to immediately <event id="EE0.1" sentence_level_value="CT+">stop</event> its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a US Navy surveillance plane flew in airspace over the East China Sea on Sunday .</sentence>
		</document>
		<document id="ED1" document_level_value="CT+">
			<sentence id="ES1.0">Philippine President Duterte <event id="EE1.0" sentence_level_value="CT+">vows</event> for closer relations with China -EOP- .</sentence>
			<sentence id="ES1.6">Visiting Chinese Foreign Minister Wang Yi LRB Front L RRB meets with Philippine President Rodrigo Duterte LRB Front R RRB in Manila , the Philippines , on July 25 , 2017 .</sentence>
			<sentence id="ES1.8">MANILA _ Philippine President Rodrigo Duterte <event id="EE1.1" sentence_level_value="CT+">pledged</event> on Tuesday that his country is <event id="EE1.2" sentence_level_value="CT+">pledged1</event> to build stronger bilateral relations with China .</sentence>
		</document>
	</DocumentSet>
</Annotation>

以上是我们实验室的语料

2.加载数据

代码如下(示例):

import xml.etree.ElementTree as ET
tree = ET.parse('F:/code/final/en_fin/data/english.xml')
root = tree.getroot()

作为一个元素,root有一个标签和一个属性字典:

>>>root.tag
'Annotation'
>>>root.attrib
{'created': '16/05/2018', 'creator': 'XMLconverter'}

2.读取数据

读取sentence中全部的文本信息

for doc in root[0]:
    for sent in doc:
        print(f"sent :", sent.text) # 输出文本
        s = ''
        for t in sent.itertext():
            s += t
        print(s) #输出整个句子的文本
        break
    break

输出:

sent : China issues stern 
China issues stern rebuke over flight -EOP- .

长度,有元素的才有长度

len(root[0][0])

输出:

3

完整的例子


for doc in root[0]:
    id = doc.attrib['id']
    label = label2idx[doc.attrib['document_level_value']]
    sentence_list = []
    trigger_word_list = []
    flag = False
    for sent in doc:
        if sent.text == '-EOP-.' or sent.text == '.':
            continue
        s = ''
        for t in sent.itertext():
            s += t
        s = s.replace('-EOP-.', '。').lower()
        print(f"sent.itertext: ",s)
        if re.match(r'\d{4}\D\d{2}\D\d{2}\D\d{2}:\d{2}\D$', s) is not None:
            flag = True
            continue
        elif flag:
            flag = False
            if len(sent) == 0:
                continue
        if len(s)<=4:
            continue
        if len(sent) > 0:
            tmp = sent.text.lower() if sent.text is not None else ''
            for event in sent:
                print(f"sent.text: ",tmp)
                print(f"event.text: ",event.text.lower())
                print(f"event.tail: ",event.tail.lower())

结果:

sent.itertext:  china issues stern rebuke over flight -eop- .
sent.text:  china issues stern 
event.text:  rebuke
event.tail:   over flight -eop- .
sent.itertext:  surveillance aircraft intercepted by 2 j_10 fighter jets over east china sea -eop- .
sent.itertext:  china urged the united states to immediately stop its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.text:  china urged the united states to immediately 
event.text:  stop
event.tail:   its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.itertext:  philippine president duterte vows for closer relations with china -eop- .
sent.text:  philippine president duterte 
event.text:  vows
event.tail:   for closer relations with china -eop- .
sent.itertext:  visiting chinese foreign minister wang yi lrb front l rrb meets with philippine president rodrigo duterte lrb front r rrb in manila , the philippines , on july 25 , 2017 .
sent.itertext:  manila _ philippine president rodrigo duterte pledged on tuesday that his country is pledged1 to build stronger bilateral relations with china .
sent.text:  manila _ philippine president rodrigo duterte 
event.text:  pledged
event.tail:   on tuesday that his country is 
sent.text:  manila _ philippine president rodrigo duterte 
event.text:  pledged1
event.tail:   to build stronger bilateral relations with china .

总结

代码还是配合例子易懂

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐