Python实现文本词频统计（嵩天老师）

实例10：文本词频统计引用文本英文文本：Hamethttps://python123.io/resources/pye/hamlet.txt中文文本：《三国演义》https://python123.io/resources/pye/threekingdoms.txt

华泽的花

8413人浏览 · 2022-04-02 11:33:18

华泽的花 · 2022-04-02 11:33:18 发布

实例10：文本词频统计

引用文本

英文文本：Hamet

https://python123.io/resources/pye/hamlet.txt

中文文本：《三国演义》

https://python123.io/resources/pye/threekingdoms.txt

代码（哈姆雷特）：

#CalHamlet1.py
def getText():
	txt=open("hamlet.txt","r").read()
	txt=txt.lower()
	for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_{|}.~’‘':
		txt=txt.replace(ch,"")
	return txt
hamletTxt=getText()
words=hamletTxt.split()
counts={}
for word in words:
	counts[word]=counts.get(word,0)+1
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
	word,count=items[i]
	print("{0:<10}".format(word,count))
#注：文本要和代码放在一个文件夹里

逐行分析：

#CalHamlet1.py
def getText():
   txt=open("hamlet.txt","r").read()
打开要处理的文件，并读取它
   txt=txt.lower()
把读取到的所有大写字母转换为小写
   for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_{|}.~’‘':
       txt=txt.replace(ch," ")
   return txt
把文章中的特殊字符全部转换为空格，把转换后的文章返回
hamletTxt=getText()
调用getText,将返回的txt的值赋给hamletTxt
words=hamletTxt.split()
把hamletTxt转换为列表形式，赋值给words
counts={}
定义一个字典
for word in words:
   counts[word]=counts.get(word,0)+1
有word时返回其值，默认是0，+1能够累计次数；没有word时则返回0。
items=list(counts.items())
将字典类型变成列表类型，键值对则表示在列表中是元组。
items.sort(key=lambda x:x[1],reverse=True)
key是待比较的元素，lambda是匿名函数，参数的第一个x表示列表的第一个元素，在
这里表示列表中的元组，x是任意定义的形参，也可以使用任意的字母代替；x[1]表示以元组的第
二个元素排序；若sort()方法中的参数 reverse=True 表示按降序（也就是从大到小）排序，反之
reverse=False 表示升序排序。
for i in range(10):
   word,count=items[i]
   print("{0:<10}".format(word,count))
循环输出前十个

代码（三国演义）:

# CalThreeKingdomsV1.py
import jieba

txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword == "孔明"
    elif word == "关公" or word == "云长":
        rword == "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword == "刘备"
    elif word == "孟德" or word == "丞相":
        rword == "曹操"
    else:
        rword = word
        counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))
#这段代码输出的不是最终结果，代码可进一步优化