主要内容

使用textrank从文本数据中提取关键字

此示例显示使用Textrank从文本数据中提取关键字。

Textrank关键字提取算法利用基于词语标签的方法提取关键字来识别候选关键字,并使用由滑动窗口确定的字共同发生来分量它们。关键字可以包含多个令牌。此外,Textrank关键字提取算法还可以在文档中连续出现时合并关键字。

提取关键词

创建包含文本数据的标记化文档数组。

文本数据=[“MATLAB为工程师提供了非常有用的工具。科学家使用了许多有用的MATLAB工具箱。”“Matlab和Simu金宝applink有许多功能。Matlab和Simulink可以轻松开发模型。”您可以在MATLAB中轻松导入数据。尤其是,您可以轻松导入文本数据];文档= tokenizeddocument(textdata);

使用该关键字提取textrankKeywords作用

tbl = textraprackkeywords(文件)
tbl=6×3表关键词DocumentNumber Score uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

如果关键字包含多个单词,那么字符串数组的元素对应于关键字的第个字。如果关键字的字数少于最长关键字的字数,则字符串数组的剩余项为空字符串"".

为了便于阅读,请使用加入跳闸职能。

如果尺寸(tbl.keyword,2)> 1 tbl.keyword = strip(加入(tbl.keyword));终止头(TBL)
ans =.6×3表关键词DocumentNumber Score\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

指定每个文档的最大关键字数

这个textrankKeywords默认情况下,函数返回所有已识别的关键字。若要减少关键字的数量,请使用'maxnumkeywords'选项。

通过设置来提取每个文档的前两个关键字'maxnumkeywords'选择2。

tbl = textrankkeywords(文件,'maxnumkeywords'2)
tbl=5×3表关键词DocumentNumber Score uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

指定词性标记

注意,在上面提取的关键词中,函数不把“导入”这个词当作关键字,这是因为TeXTrk关键字提取算法默认使用带有词性标签“名词”、“专有名词”和“形容词”的令牌作为候选关键字。是一个动词,该算法不认为这是一个候选关键字。同样,该算法不考虑副词“容易”作为候选关键字。

要指定要用于识别候选关键字的哪个词段标记,请使用'partofspeech'选项。

从与前面相同的文本中提取关键字,并指定词性标记“副词”“动词”.

新标签=[“副词”“动词”];标签=[“名词”“适当的名词”“形容词”新标志];tbl = textrankkeywords(文件,'partofspeech',标签)
tbl=7×3表关键字Dextednumber分数______________________________________________“使用”“很多”“有用”“MATLAB”1 5.8839“有用”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”金宝app“”2 4.5058“simu金宝applink”“”“”“”2 1.5161“导入”“文本”“数据”“”3 4.7921“导入”“数据”“”“”3 3.4195

请注意,该函数将令牌“导入”作为候选关键字对待,并将其合并到多字关键字“导入数据”和“导入文本数据”中。

指定窗口大小

请注意,在上面提取的关键字中,该函数不会将副词“轻松”提取为关键字。这是因为文本中这些单词与其他候选关键字很接近。

Textrank关键字提取算法使用滑动窗口内的成对共同数量分数候选关键字。要增加窗口大小,请使用“窗口”选项。增加窗口大小可使函数查找更多关键字之间的共现,从而增加关键字重要性得分。这可能导致查找更多相关关键字,但代价可能是对相关性较低的关键字进行高分。

从与以前相同的文本中提取关键字,并指定窗口大小为3。

tbl = textrankkeywords(文件,...'partofspeech',标签,...“窗口”3,3)
tbl=8×3表UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU金宝appUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU“”“2 1.0794”“轻松”“导入”“文本”“数据”“3 5.2989”“轻松”“导入”“数据” 3 4.0842

请注意,该函数将令牌“轻松”作为关键字对待,并将其与多字关键字“轻松导入文本数据”和“轻松导入数据”合并。

了解有关Textrank关键字提取算法的更多信息,请参阅Textrank关键字提取.

选择

您可以尝试不同的关键字提取算法,看看哪些算法最适合您的数据。因为TextRank关键字算法使用基于词性标记的方法来提取候选关键字,所以提取的关键字可以很短。或者,您可以尝试使用提取t序列的RAKE算法提取关键字分隔符之间出现的oken作为候选关键字。要使用RAKE提取关键字,请使用rakeKeywords函数。要了解更多信息,请参阅使用Rake提取文本数据的关键字.

参考

[1] Mihalcea,Rada和Paul Tarau。“Textrank:将订单融入文字中。”在2004年自然语言处理经验方法会议论文集,第404-411页,2004年。

另见

|||

相关话题