jieba分词+删除停用词

使用jieba对中文文本进行简单的分词并存储操作，详细的分词操作请见：jieba分词的

CandyLaº

5794人浏览 · 2022-06-03 19:28:43

CandyLaº · 2022-06-03 19:28:43 发布

文本预处理

前言
一、jieba分词
二、删除停用词

前言

使用jieba对中文文本进行简单的分词并存储操作，详细的分词操作请见：jieba分词的使用

一、jieba分词

1.1 首先自定义一个文档命名为：origin_dataset.txt

在这里插入图片描述

1.2 代码实现

import jieba

with open('./data/origin_dataset.txt', encoding='utf-8') as f1:
    document = f1.read()
    document_cut = jieba.cut(document)
    result = ' '.join(document_cut)
    print(result)
    with open('./data/out_origin_dataset.txt', 'w') as f2:
        f2.write(result)
    f1.close()
    f2.close()

1.3 结果展示：已经成功对中文进行分词，但是还存在一些停用词(标点符号等)

在这里插入图片描述

二、删除停用词

2.1 导入停用词文件：stop_words.txt

在这里插入图片描述

2.2 jieba分词+删除停用词

f3 = open('./data/origin_dataset.txt', encoding='utf-8')
stopwords = {}.fromkeys([line.rstrip() for line in open(r'./data/stop_words.txt', encoding='utf-8')])
for line in f3:
    segs = jieba.cut(line, cut_all=False)
    list = []
    for seg in segs:
        if seg not in stopwords:
            list.append(seg)
f3.close()

f4 = open('./data/out_remove_stop_word.txt', 'w+', encoding='utf-8')
f4.write(' '.join(list))
f4.close()

2.3 结果展示

在这里插入图片描述

华为开发者空间

华为开发者空间，是为全球开发者打造的专属开发空间，汇聚了华为优质开发资源及工具，致力于让每一位开发者拥有一台云主机，基于华为根生态开发、创新。

更多推荐

cover

在华为开发者空间，调用DeepSeek实现代码自动生成

华为开发者空间

cover

HarmonyOS应用开发师资培训圆满收官，助力生态繁荣

华为开发者空间

cover

GaussDB 高性能技术解析：从分布式架构到极致查询优化

华为开发者空间

所有评论(0)

查看更多评论

CandyLaº

@weixin_51531865

已为社区贡献1条内容