NER 常见问题(BIO BIOES BMES)标注之间的转换
实习收到的第一个任务人民日报数据集的训练集用的就是BIO格式然后我们转化为BIOES 和 BMES首先是BIO转BMESpath = r'./input/data_train.txt'res_path = r'./output/BMES.txt'f = open(path, encoding='utf-8')f1 = open(res_path, 'w+', encoding='utf_8')se
·
实习收到的第一个任务
人民日报数据集的训练集用的就是BIO格式
然后我们转化为BIOES 和 BMES
首先是BIO转BMES
path = r'./input/data_train.txt'
res_path = r'./output/BMES.txt'
f = open(path, encoding='utf-8')
f1 = open(res_path, 'w+', encoding='utf_8')
sentences = []
sentence = []
label_set = set()
cnt_line = 0
for line in f:
cnt_line += 1
if len(line) == 0 or line[0] == '\n':
if len(sentence) > 0:
sentences.append(sentence)
print(sentence)
sentence = []
continue
splits = line.split(' ')
sentence.append([splits[0], splits[-1][:-1]])
label_set.add(splits[-1])
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
f.close()
for sen in sentences:
i = 0
for index, word in enumerate(sen):
char = word[0]
label = word[1]
if index < len(sen) - 1:
if (label[0] == 'B'):
if sen[index + 1][1][0] == 'I':
label = label
elif sen[index + 1][1][0] == 'O':
label = 'S' + label[1:]
elif (label[0] == 'I'):
if sen[index + 1][1][0] == 'I':
label = 'M' + label[1:]
if sen[index + 1][1][0] == 'O' or sen[index + 1][1][0] == 'B':
label = 'E' + label[1:]
elif (label[0] == 'O'):
label = label
else:
if (label[0] == 'B'):
label = 'S' + label[1:]
elif (label[0] == 'I'):
label = 'E' + label[1:]
elif (label[0] == 'O'):
label = label
f1.write(f'{char} {label}\n')
f1.write('\n')
f1.close()
然后是BMES转BIOES
f= open(r'./output/BMES.txt', 'r', encoding='utf-8')
f1 = open(r'./output/BIOES.txt', 'w+', encoding='utf-8')
str1=[]
for line in f.readlines():
#print(list(line))
if line!="\n":
line1 = line.split()
str2 = line1[0]
for i in range(1, len(line1)):
line2 = list(line1[i])
if line2[0] == "M":
line2[0] = "I"
str3 = ''
for i in line2:
str3 = str3 + i
str2 = str2 + ' ' + str3
print(str2)
str1.append(str2)
else:
str1.append(line)
for j in str1:
f1.write(j)
f1.write("\n")
不同的标注格式跑出来的召回率是不一样的 以后会经常用到
更多推荐
已为社区贡献3条内容
所有评论(0)