jieba分词+删除停用词
使用jieba对中文文本进行简单的分词并存储操作,详细的分词操作请见:jieba分词的
·
文本预处理
前言
使用jieba对中文文本进行简单的分词并存储操作,详细的分词操作请见:jieba分词的使用
一、jieba分词
1.1 首先自定义一个文档命名为:origin_dataset.txt
1.2 代码实现
import jieba
with open('./data/origin_dataset.txt', encoding='utf-8') as f1:
document = f1.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
print(result)
with open('./data/out_origin_dataset.txt', 'w') as f2:
f2.write(result)
f1.close()
f2.close()
1.3 结果展示:已经成功对中文进行分词,但是还存在一些停用词(标点符号等)
二、删除停用词
2.1 导入停用词文件:stop_words.txt
2.2 jieba分词+删除停用词
f3 = open('./data/origin_dataset.txt', encoding='utf-8')
stopwords = {}.fromkeys([line.rstrip() for line in open(r'./data/stop_words.txt', encoding='utf-8')])
for line in f3:
segs = jieba.cut(line, cut_all=False)
list = []
for seg in segs:
if seg not in stopwords:
list.append(seg)
f3.close()
f4 = open('./data/out_remove_stop_word.txt', 'w+', encoding='utf-8')
f4.write(' '.join(list))
f4.close()
2.3 结果展示
更多推荐
已为社区贡献1条内容
所有评论(0)