主要内容

rakekeywords

Extract keywords using RAKE

    描述

    example

    tbl= rakekeywords(documents)使用快速自动关键字提取(RAKE)算法提取关键字和各自的分数。该功能支持英语,日语,德语金宝app和韩语文本。学习如何使用rakekeywords有关其他语言,请参阅语言注意事项

    example

    tbl= rakekeywords(documents,名称,价值)specifies additional options using one or more name-value pair arguments.

    Tip

    Therakekeywordsfunction, by default, extracts keywords using stop words and punctuation characters. When using the default values for the'Delimiters'and'MergingDelimiters'options, do not remove stop words or punctuation characters from the input text.

    例子

    全部收缩

    Create an array of tokenized documents containing the text data.

    textData = [“ MATLAB为科学家和工程师提供工具。MATLAB被科学家和工程师使用。”"Analyze text and images. You can import text and images.""Analyze text and images. Analyze text, images, and videos in MATLAB."];文件= tokenizedDocument (textData);

    使用rakekeywordsfunction.

    tbl = rakekeywords(文档)
    tbl=12×3 table关键词文件号码Score _________________________________________ ______________ _____ "MATLAB" "provides" "tools" 1 8 "MATLAB" "" "" 1 2 "scientists" "and" "engineers" 1 2 "engineers" "" "" 1 1 "scientists" "" "" 1 1 "Analyze" "text" "" 2 4 "import" "text" "" 2 4 "images" "" "" 2 1 "Analyze" "text" "" 3 4 "MATLAB" "" "" 3 1 "images" "" "" 3 1 "videos" "" "" 3 1

    如果a keyword contains multiple words, then thei字符串数组的元素对应于ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string""

    For readability, transform the multi-word keywords into a single string using the加入and跳闸functions.

    如果size(tbl.keyword,2)> 1 tbl.keyword = strip(join(tbl.keyword));结尾tbl
    tbl=12×3 table关键词文件号码Score __________________________ ______________ _____ "MATLAB provides tools" 1 8 "MATLAB" 1 2 "scientists and engineers" 1 2 "engineers" 1 1 "scientists" 1 1 "Analyze text" 2 4 "import text" 2 4 "images" 2 1 "Analyze text" 3 4 "MATLAB" 3 1 "images" 3 1 "videos" 3 1

    Create an array of tokenized document containing the text data.

    textData = [“ MATLAB为科学家和工程师提供工具。MATLAB被科学家和工程师使用。”"Analyze text and images. You can import text and images.""Analyze text and images. Analyze text, images, and videos in MATLAB."];文件= tokenizedDocument (textData);

    Extract the top two keywords using therakekeywords功能和设置'maxnumkeywords'option to2

    tbl = rakekeywords(文档,'maxnumkeywords',2)
    tbl=6×3 tableKeyword DocumentNumber Score __________________________________ ______________ _____ "MATLAB" "provides" "tools" 1 8 "MATLAB" "" "" 1 2 "Analyze" "text" "" 2 4 "import" "text" "" 2 4 "Analyze" "文本“”“ 3 4” matlab“”“” 3 1

    如果a keyword contains multiple words, then thei字符串数组的元素对应于ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string""

    For readability, transform the multi-word keywords into a single string using the加入and跳闸functions.

    如果size(tbl.keyword,2)> 1 tbl.keyword = strip(join(tbl.keyword));结尾tbl
    tbl=6×3 table关键词文件号码Score _______________________ ______________ _____ "MATLAB provides tools" 1 8 "MATLAB" 1 2 "Analyze text" 2 4 "import text" 2 4 "Analyze text" 3 4 "MATLAB" 3 1

    输入参数

    全部收缩

    Input documents, specified as atokenizedDocument数组,单词字符串阵列或字符向量的单元格数组。如果documents是not atokenizedDocument数组,然后必须是代表单个文档的行矢量,其中每个元素是一个单词。要指定多个文档,请使用tokenizedDocument大批。

    Name-Value Arguments

    Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, 在哪里Name是the argument name andValue是相应的值。名称值参数必须在其他参数之后出现,但是对的顺序并不重要。

    Before R2021a, use commas to separate each name and value, and encloseName用引号。

    例子:rakekeywords(文档,'maxnumkeywords',20)每个文档最多返回最多20个关键字。

    Maximum number of keywords to return per document, specified as the comma-separated pair consisting of'maxnumkeywords'以及一个积极的整数或Inf

    如果maxnumkeywordsInf, then the function returns all identified keywords.

    数据类型:single|双倍的|int8|int16|INT32|INT64|uint8|UINT16|uint32|uint64

    Tokens for splitting documents into keywords, specified as the comma-separated pair consisting of'Delimiters'and a string array, a character vector, or a cell array of character vectors. IfDelimiters是一个字符向量,然后必须代表单个定界符。

    分界符的默认列表是标点字符的列表。

    如果multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters for merging, use the'MergingDelimiters'option.

    定界符匹配是情况不敏感的。

    数据类型:char|string|细胞

    定界符也用于合并关键字, specified as the comma-separated pair consisting of'MergingDelimiters'and a string array, a character vector, or a cell array of character vectors. If合并二重合体是一个字符向量,然后必须代表单个定界符。

    The default list of merging delimiters is the list of stop words given by the停止字function.

    如果multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters that should not be used for merging, use the'Delimiters'option.

    定界符匹配是情况不敏感的。

    数据类型:char|string|细胞

    Output Arguments

    全部收缩

    Extracted keywords and scores, returned as a table with the following variables:

    • 关键词– Extracted keyword, specified as a 1-by-maxngramlength字符串数组,哪里maxngramlength是最长关键字中的单词数。

    • 文件号码– Document number containing the corresponding keyword.

    • Score- 关键字的分数。

    如果multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    如果a keyword contains multiple words, then thei相应的字符串数组的元素对应于ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string""

    有关更多信息,请参阅快速自动关键字提取

    More About

    全部收缩

    语言注意事项

    Therakekeywordsfunction supports English, Japanese, German, and Korean text only.

    Therakekeywords功能使用基于定界线的方法来识别候选关键字来提取关键字。默认情况下,该函数使用标点符号和由停止字with language given by the language details of the input documents as delimiters.

    For other languages, specify an appropriate set of delimiters using the'Delimiters'and'MergingDelimiters'options.

    提示

    • 您可以尝试使用不同的关键字提取算法来查看最适合数据的方法。由于Rake关键字算法使用基于定界符的方法来提取候选关键字,因此提取的关键字可能很长。另外,您可以尝试使用TexTrank算法提取关键字,该算法从单个令牌开始作为候选关键字开始,然后在适当时将其合并。要使用Textrank提取关键字,请使用textrankKeywordsfunction. To learn more, see使用Textrank从文本数据中提取关键字

    算法

    全部收缩

    快速自动关键字提取

    For each document, therakekeywordsfunction extracts keywords independently using the following steps based on[1]:

    1. Determine candidate keywords:

      • 提取序列的令牌etween the delimiters specified by the'Delimiters'and'MergingDelimiters'options. The function treats each sequence as a single candidate keyword.

    2. 计算候选关键字的分数:

      • 创建一个未方向的,未加权的图,其节点与候选关键字中的单个令牌相对应。

      • Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.

      • 使用公式为每个令牌得分DEG(token) / freq(token), 在哪里DEG(令牌)是指定令牌和弗雷克(令牌)是文档中指定令牌发生的次数。

      • For each candidate keyword, assign a score given by the sum of scores of the contained tokens.

    3. 从候选人中提取顶级关键字:

      • 如果有多个实例的同一对候选关键字由同一单个合并定界符分隔,则将候选关键字和定界符合并为单个关键字,并将相应的分数汇总。

      • 返回顶部k关键字,哪里k'maxnumkeywords'option.

    Language Details

    tokenizedDocument对象包含有关令牌的详细信息,包括语言详细信息。输入文档的语言详细信息确定rakekeywords。ThetokenizedDocumentfunction, by default, automatically detects the language of the input text. To specify the language details manually, use the'Language'名称值对参数tokenizedDocument。要查看令牌详细信息,请使用tokenDetailsfunction.

    References

    [1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents."Text mining: applications and theory1(2010):1-20。

    Version History

    在R2020b中引入