cover

Python中英文词频统计

Python中英文词频统计的简单使用方法

桜キャンドル淵

6425人浏览 · 2022-04-10 09:54:21

桜キャンドル淵 · 2022-04-10 09:54:21 发布

英文词频统计

以下代码为英文的词频统计，其中article为我们要统计词频的文章。

article='''Last sunday, it was a fine day. My friend and I went to Mount Daifu.
In the morning, we rode bikes to the foot of the mountain. After a short rest, we climbed the mountain. on its peak, we shared the beautiful scenery in our eyes. There were lots of light foggy clouds around us. What's more, varieties of birds flying around us were pretty. What a harmony situation! At noon, we had lunch in a restaurant. With a happy emotion, we finished our climbing. Although tired, we still felt happy and relaxed. we hope our country will become more and more beautiful.
'''
#替换标点符号，将文章中所有的标点符号全部去掉
article=article.replace(',','').replace('.','').replace(';','').replace(':','').replace('?','')
#大写字母转换为小写字母
exchange=article.lower()
#生成单词列表,split函数会将文章按照空格切割开来，生成一个列表
list=exchange.split()
print(list)
#生成词频统计，这里我们用字典类型统计词频
dict={}
for my_key in list:
    count=list.count(my_key)
    dict[my_key]=count
print(dict)
#排序
dict_sorted=sorted(dict.items(),key=lambda x:x[1],reverse=True)

print(dict_sorted)
#输出词频最大的前十位
for i in range (10):
    print(dict_sorted[i])

中文词频统计

对中文进行词频统计，我们需要安装jieba的包。其统计方法和原理大致和英文相同

import jieba
article='''气候原因也为此次搜救增加了难度，第四次发布会上，广西消防救援总队总队长郑西表示，连日来藤县地区连续降雨，给现场救援增加了不少难度。一是降雨造成了事故核心区域的积水；截至发稿时，MU5735的搜寻援救行动已经持续了超过72小时。3月24日，广西消防救援总队作战训练处副处长黄尚武在接受央视采访时表示，目前救援队伍的任务以搜救生命为主，使用了热成像仪、生命探测仪进行表层搜索，但是由于天气原因，搜寻仪器的信号相对较弱，因此采用人工搜索和空中无人机搜索配合'''
article=article.replace('，','').replace('。','').replace('；','').replace('：','').replace('？','')
#调用jieba包中的lcut方法将整个文本切割成不同的词组
list=jieba.lcut(article)
print(list)
dict={}
#统计词频，将每个单词出现的次数统计到dict字典中
for my_key in list:
    count=list.count(my_key)
    dict[my_key]=count
print(dict)
#排序，此处的x[1]表示取出来的是键值对后面的值，reverse=TRUE表示降序排序
dict_sorted=sorted(dict.items(),key=lambda x:x[1],reverse=True)

print(dict_sorted)
#输出词频最大的前十位
for i in range (10):
    print(dict_sorted[i])

华为开发者空间

华为开发者空间，是为全球开发者打造的专属开发空间，汇聚了华为优质开发资源及工具，致力于让每一位开发者拥有一台云主机，基于华为根生态开发、创新。

更多推荐

cover

解读TaurusDB字段压缩：减少存储成本，避免语句大量修改

华为开发者空间

cover

GaussDB 使用gs_loader工具导入数据

华为开发者空间

如何基于Sharding-JDBC实现GaussDB在客户端应用的读写分离

使用sharing-jdbc中间件实现GaussDB读写分离操作，在服务器资源吃紧与高并发场景下可以考虑采用读写分离架构减轻负载。

华为开发者空间

所有评论(0)

查看更多评论

桜キャンドル淵

@weixin_62684026

已为社区贡献4条内容