1. 下载一长篇中文小说。
小说:鹿鼎记 作者:金庸
2. 从文件读取待分析文本。
3. 安装并使用jieba进行分词" title="中文分词">中文分词。
pip install jieba
import jieba
jieba.lcut(text)
4. 更新词库,加入所分析对象的专业词汇。
jieba.add_word('天罡北斗阵') #逐个添加
jieba.load_userdict(word_dict) #词库文本文件
encoding f jinyong freadsplit
jiebaload_userdictjinyong
newtext jiebalcuttext
参考词库下载地址:https://pinyin.sogou.com/dict/
转换代码:scel_to_text
struct
os
startPy
startChinese
GPy_Table
data
pos
pos data
c structunpack datapos datapos
c
c
pos
data
data data
pos
pos data
index structunpack dataposdatapos
pos
lenPy structunpack datapos datapos
pos
py byte2strdatapospos lenPy
GPy_Tableindex py
pos lenPy
data
pos
ret
pos data
index structunpack datapos datapos
ret GPy_Tableindex
pos
ret
data
GTable
pos
pos data
same structunpack datapos datapos
pos
py_table_len structunpack datapos datapos
pos
py getWordPydatapos pos py_table_len
pos py_table_len
i same
c_len structunpack datapos datapos
pos
word byte2strdatapos pos c_len
pos c_len
ext_len structunpack datapos datapos
pos
count structunpack datapos datapos
GTableappendcount py word
pos ext_len
GTable
file_name
file_name f
data fread
byte2strdata
byte2strdata
byte2strdata
byte2strdatastartPy
getPyTabledatastartPystartChinese
getChinesedatastartChinese
getChinesedatastartChinese
__name__
in_path
out_path
fin fname fname oslistdirin_path fname
f fin
word scel2txtospathjoinin_path f
file_pathospathjoinout_path fsplit
file_pathencoding
writeword
osremoveospathjoinin_path f
Exception e
e
5. 生成词频统计
te w newtext
w
tew tegetw
6. 排序
tesort teitemstesortsortkey x x reverse
7. 排除语法型词汇,代词、冠词、连词等停用词。
stops
tokens=[token for token in wordsls if token not in stops]
encoding f stops freadsplit
newtext2 text1 text1 newtext text1 stops
8. 输出词频最大TOP20,把结果存放到文件里。
i tesorti
pdDataFrametesortto_csv encoding
9. 生成词云。
txt encodingreadludingjilist jiebalcuttxt
wl_spl joinludingjilist
mywc WordCloudgeneratewl_spl
pltimshowmywc
pltaxis
pltshow
10.代码
pandas pd wordcloud WordCloud
jieba
matplotlibpyplot plt
f encoding
text fread
fclose
encoding f
jinyong freadsplit
jiebaload_userdictjinyong
newtext jiebalcuttext
encoding f
stops freadsplit
newtext2 text1 text1 newtext text1 stops
te
w newtext
w
tew tegetw
tesort teitems
tesortsortkey x x reverse
i
tesorti
pdDataFrametesortto_csv encoding
txt encodingread
ludingjilist jiebalcuttxt
wl_spl joinludingjilist
mywc WordCloudgeneratewl_spl
pltimshowmywc
pltaxis
pltshow
还没有评论,来说两句吧...