NLP之tfidf与textrank算法细节对比基于结巴分词

-NLP之tfidf与textrank算法细节对比

注：结巴默认在site-packages目录

关于结巴分词的添加停用词以及增加词相关操作可参考之前的博客，这里重点说下结巴关键词提取的两个算法

1.tfidf算法

官方文档如下：

extract_tagssentence, , False, , False method of jieba.analyse.tfidf.TFIDF instance Extract keywords from sentence using TF-IDF algorithm. Parameter: - topK: how many keywords. all possible words. - withWeight: True, a list of word, weightf False, a list of words. - allowPOS: the allowed POS list eg. , , , ,. the POS of w is not this list,it will be filtered. - withFlag: only work with allowPOS is not empty. True, a list of pairword, weight like posseg.cut False, a list of words

jieba.analyse.extract_tags

–sentence 为待提取的文本

–topK 为返回几个 TF/IDF 权重最大的关键词，默认值为 20

–withWeight 为是否一并返回关键词权重值，默认值为 False

–allowPOS 仅包括指定词性的词，默认值为空，即不筛选

-withFlag 显示词性，这里必须要有allowPOS参数时才有效！

jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例，idf_path 为 IDF 频率文件，关键词提取所使用逆向文件频率（IDF）文本语料库可以切换成自定义语料库的路径

用法:jieba.analyse.set_idf_path(file_name) # file_name为自定义语料库的路径，关键词提取所使用停止词（Stop Words）文本语料库可以切换成自定义语料库的路径

用法： jieba.analyse.set_stop_words(file_name) # file_name为自定义语料库的路径

2、-基于TextRank算法的关键词提取

textranksentence, , False, , , , , False method of jieba.analyse.textrank.TextRank instance Extract keywords from sentence using TextRank algorithm. Parameter: - topK: how many keywords. all possible words. - withWeight: True, a list of word, weight False, a list of words. - allowPOS: the allowed POS list eg. , , , .if the POS of w is not this list, it will be filtered. - withFlag: True, a list of pairword, weight like posseg.cut False, a list of words

jieba.analyse.TextRank() 新建自定义 TextRank 实例

–基本思想：

1，将待抽取关键词的文本进行分词

2，以固定窗口大小(默认为5，通过span属性调整)，词之间的共现关系，构建图

3，计算图中节点的PageRank，注意是无向带权图

如果不是通过import jieba.analyse而是from textrank4zh import TextRank4Keyword即调用textrank那么需要注意

tr4w TextRank4Keyword tr4w.analyzetexttext, True, ,vertex , ,

其中类TextRank4Keyword、TextRank4Sentence在处理一段文本时会将文本拆分成4种格式：

sentences：由句子组成的列表。 words_no_filter：对sentences中每个句子分词而得到的两级列表。 words_no_stop_words：去掉words_no_filter中的停止词而得到的二维列表。 words_all_filters：保留words_no_stop_words中指定词性的单词而得到的二维列表。

在这里并未找到相关可以设置词性的参数，故

analyzetext, , False, vertex_source , , pagerank_ method of textrank4zh.TextRank4Keyword.TextRank4 Keyword instance分析文本 Keyword arguments: text -- 文本内容，字符串。 window --窗口大小，int，用来构造单词之间的边。默认为2。 lower -- 是否将文本转换为小写。默认为False。 vertex_source -- 选择使用words_no_filter, words_ no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点。默认值为，可选值为。关键词也来自。 edge_source --选用words_no_filter, words_no_stop_ words,words_all_filters中的哪一个来构造pagerank对应的图中的节点之间的边。默认值为，可选值为。边的构造要结合参数。

具体对比代码整理后回上传连接

算法：

-基于前缀词典实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图（DAG）

-采用动态规划查找最大概率路径，找出基于词频的最大切分组合

-对于未登录词，采用了基于汉字成词能力的HMM模型，使用了Viterbi算法

宙飒天下网