主要内容

使用TextRank从文本数据中提取关键词

这个例子展示了如何使用TextRank从文本数据中提取关键字。

TextRank关键字提取算法使用基于词性标签的方法提取关键字,识别候选关键字,并使用滑动窗口确定的词共现对其评分。关键字可以包含多个标记。此外,TextRank关键字提取算法还在关键字连续出现在文档中时合并它们。

提取关键字

创建包含文本数据的标记化文档数组。

文本数据=[“MATLAB为工程师提供了非常有用的工具。科学家使用了许多有用的MATLAB工具箱。”MATLAB和Simul金宝appink有很多特性。MATLAB和Simu金宝applink使开发模型变得容易。”您可以在MATLAB中轻松导入数据。尤其是,您可以轻松导入文本数据];文件= tokenizedDocument (textData);

提取关键字使用textrankKeywords作用

台= textrankKeywords(文档)
tbl=6×3表关键词DocumentNumber Score uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

如果关键字包含多个单词,则字符串数组的第Th元素对应于关键字的第个字。如果关键字的字数少于最长关键字的字数,则字符串数组的剩余项为空字符串"".

为了便于阅读,请使用加入功能。

如果> 1 tbl. size(tbl. keyword,2);关键词=地带(加入(tbl.Keyword));终止头(台)
ans =6×3表关键词DocumentNumber Score\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

指定每个文档的最大关键字数

这个textrankKeywords默认情况下,函数返回所有已识别的关键字。若要减少关键字的数量,请使用“MaxNumKeywords”选择。

的方法提取每个文档的前两个关键字“MaxNumKeywords”选择2。

台= textrankKeywords(文档,“MaxNumKeywords”, 2)
tbl=5×3表关键词DocumentNumber Score uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

指定词性标记

注意,在上面提取的关键词中,函数不把“导入”这个词当作关键字,这是因为TeXTrk关键字提取算法默认使用带有词性标签“名词”、“专有名词”和“形容词”的令牌作为候选关键字。是一个动词,该算法不认为这是一个候选关键字。同样,该算法不考虑副词“容易”作为候选关键字。

要指定使用哪个词性标签来识别候选关键字,请使用“PartOfSpeech”选择。

从与前面相同的文本中提取关键字,并指定词性标记“副词”“动词”.

新标签=[“副词”“动词”];标签=[“名词”“专有名词”“形容词”newTags];台= textrankKeywords(文档,“PartOfSpeech”,标签)
tbl=7×3表关键字DocumentNumber得分  ____________________________________________ ______________ ______ " 使用”“多”“有用”“MATLAB 5.8839“1”有用 " "" "" "" 1 MATLAB 2.0169” " "" "" "" 1 1.5478“模型”“有”“多”“2 4.5058”模型 "金宝app "" "" "" 2 1.5161“进口”“文本”“数据”“3 4.7921“进口”“数据”“3.4195”“3

注意,这里函数将令牌“import”作为一个候选关键字,并将其合并到多字关键字“import data”和“import text data”中。

指定窗口大小

请注意,在上面提取的关键字中,该函数不会将副词“轻松”提取为关键字。这是因为文本中这些单词与其他候选关键字很接近。

TextRank关键字提取算法使用滑动窗口内成对共出现的数量对候选关键字进行评分。要增加窗口大小,请使用“窗口”选项。增加窗口大小可使函数查找更多关键字之间的共现,从而增加关键字重要性得分。这可能导致查找更多相关关键字,但代价可能是对相关性较低的关键字进行高分。

从与以前相同的文本中提取关键字,并指定窗口大小为3。

台= textrankKeywords(文档,...“PartOfSpeech”,标签,...“窗口”3)
tbl=8×3表UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU金宝appUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU“”“2 1.0794”“轻松”“导入”“文本”“数据”“3 5.2989”“轻松”“导入”“数据” 3 4.0842

注意,这里的函数将标记“easily”视为关键字,并将其合并为多字关键字“easily import text data”和“easily import data”。

要了解更多关于TextRank关键字提取算法的信息,请参见TextRank关键字提取.

选择

你可以尝试不同的关键字提取算法,看看什么最适合你的数据。由于TextRank关键字算法采用基于词性标签的方法提取候选关键字,因此提取的关键字可以很短。或者,您可以尝试使用RAKE算法提取关键字,该算法提取分隔符之间出现的标记序列作为候选关键字。要使用RAKE提取关键字,请使用rakeKeywords函数。要了解更多信息,请参阅利用RAKE从文本数据中提取关键字.

参考文献

Mihalcea, Rada和Paul Tarau。“Textrank:将秩序带入文本。”在2004年自然语言处理经验方法会议论文集,第404-411页,2004年。

另见

|||

相关的话题