主要内容

addPartOfSpeechDetails

在文档中添加词性标签

描述

UseaddPartOfSpeechDetailsto add part-of-speech tags to documents.

该功能支持英语,日语,德语金宝app和韩语文本。

example

updatedDocuments= addPartOfSpeechDetails(documents)detects parts of speech indocumentsand updates the token details. The function, by default, retokenizes the text for part-of-speech tagging. For example, the function splits the word "you're" into the tokens "you" and "'re". To get the part-of-speech details fromupdatedDocuments, usetokenDetails

updatedDocuments= addPartOfSpeechDetails(documents,Name,Value)specifies additional options using one or more name-value pair arguments.

Tip

UseaddPartOfSpeechDetails使用之前lower,upper,擦除,normalizeWords,removeWords, andremoveStopWordsfunctions asaddPartOfSpeechDetailsuses information that is removed by these functions.

例子

collapse all

Load the example data. The filesonnetspreprocessed.txtcontains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetspreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename ="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents); head(tdetails)
ans =8×5 tableToken DocumentNumber LineNumber Type Language ___________ ______________ __________ _______ ________ "fairest" 1 1 letters en "creatures" 1 1 letters en "desire" 1 1 letters en "increase" 1 1 letters en "thereby" 1 1 letters en "beautys" 1 1 letters en "rose" 1 1 letters en "might" 1 1 letters en

Add part-of-speech details to the documents using theaddPartOfSpeechDetails函数。This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned bytokenDetails。View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans =8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ___________ ______________ ______________ __________ _______ ________ ______________ "fairest" 1 1 1 letters en adjective "creatures" 1 1 1 letters en noun "desire" 1 1 1 letters en noun "increase" 1 1 1 letters en noun "thereby" 1 1 1 letters en adverb "beautys" 1 1 1 letters en noun "rose" 1 1 1 letters en noun "might" 1 1 1 letters en auxiliary-verb

Tokenize Japanese text usingtokenizedDocument

str = ["恋に悩み、苦しむ。"“恋ので。”"空に星が輝き、瞬いている。""空の星が輝きを増している。"“駅まで远く,歩け。”“远くのまでない”"すもももももももものうち。"]; documents = tokenizedDocument(str);

For Japanese text, you can get the part-of-speech details usingtokenDetails。For English text, you must first useaddPartOfSpeechDetails

tdetails = tokenDetails(documents); head(tdetails)
ans =8×8 tableToken DocumentNumber LineNumber Type Language PartOfSpeech Lemma Entity _______ ______________ __________ ___________ ________ ____________ _______ __________ "恋" 1 1 letters ja noun "恋" non-entity "に" 1 1 letters ja adposition "に" non-entity "悩み" 1 1 letters ja verb "悩む" non-entity "、" 1 1 punctuation ja punctuation "、" non-entity "苦しむ" 1 1 letters ja verb "苦しむ" non-entity "。" 1 1 punctuation ja punctuation "。" non-entity "恋" 2 1 letters ja noun "恋" non-entity "の" 2 1 letters ja adposition "の" non-entity

Tokenize German text usingtokenizedDocument

str = [“ GutenMorgen。Wiegeht es dir?”“ heute wird ein guter标签。”]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

要获取德语文本的语音详细信息,请首先使用addPartOfSpeechDetails

documents = addPartOfSpeechDetails(documents);

To view the part of speech details, use thetokenDetails函数。

tdetails = tokenDetails(documents); head(tdetails)
ans =8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ________ ______________ ______________ __________ ___________ ________ ____________ "Guten" 1 1 1 letters de adjective "Morgen" 1 1 1 letters de noun "." 1 1 1 punctuation de punctuation "Wie" 1 2 1 letters de adverb "geht" 1 2 1 letters de verb "es" 1 2 1 letters de pronoun "dir" 1 2 1 letters de pronoun "?" 1 2 1 punctuation de punctuation

Input Arguments

collapse all

Input documents, specified as atokenizedDocument大批。

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, 在哪里Nameis the argument name andValue是相应的值。名称值参数必须在其他参数之后出现,但是对的顺序并不重要。

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

例子:'DiscardKnownValues',true指定先前计算的详细信息并重新计算它们。

Method to retokenize documents, specified as one of the following:

  • “言论一部分”– Transform the tokens for part-of-speech tagging. The function performs these tasks:

    • Split compound words. For example, split the compound word"wanna"into the tokens"want"and"to"。This includes compound words containing apostrophes. For example, the function splits the word"don't"into the tokens"do"and"n't"

    • 合并期间不会用前代币结束句子的时期。例如,合并令牌“先生”and"."into the token“先生。”

    • For German text, merge abbreviations that span multiple tokens. For example, merge the tokens"z",".",“ B”, and"."into the single token"z. B."

    • 合并的时期成椭圆。例如,合并三个实例"."into the single token“……”

  • 'none'- 不要重述文件。

List of abbreviations for sentence detection, specified as a string array, character vector, cell array of character vectors, or a table.

If the input documents do not contain sentence details, then the function first runs theaddSentenceDetails功能并指定由“缩写”。要指定更多句子检测选项(例如,句子启动器)使用addSentenceDetails使用前的功能addPartOfSpeechDetailsdetails.

If缩写sis a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the初学者名称对。

要在缩写时指定不同的行为,请指定缩写sas a table. The table must have variables named缩写andUsage, 在哪里缩写contains the abbreviations, andUsagecontains the type of each abbreviation. The following table describes the possible values ofUsage, and the behavior of the function when passed abbreviations of these types.

Usage 行为 Example Abbreviation Example Text Detected Sentences
常规的 If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period. "appt." "Book an appt. We'll meet then."

"Book an appt."

"We'll meet then."

"Book an appt. today." "Book an appt. today."
落后时期不要休息。 “博士” "Dr. Smith." "Dr. Smith."
参考 如果下一个令牌不是数字,则在尾随期间中断。如果下一个令牌是一个数字,则不要在尾随期间中断。 "fig." “见图3。” “见图3。”
“尝试一个无花果。他们很好。”

"Try a fig."

"They are nice."

unit 如果上一个单词是一个数字,而以下单词是大写的句子启动器,则在尾随时期中断。 "in." "The height is 30 in. The width is 10 in."

“高度为30英寸。”

"The width is 10 in."

如果上一个单词是一个数字,并且以下单词未大写,则不要在后续时间内打破。 "The item is 10 in. wide." "The item is 10 in. wide."
If the previous word is not a number, then break at a trailing period. "Come in. Sit down."

"Come in."

“坐下。”

默认值是缩写函数。日本和韩国文字,abbreviations do not usually impact sentence detection.

Tip

默认情况下,函数对单字母缩写eviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in缩写s

数据类型:char|string|table|cell

Option to discard previously computed details and recompute them, specified astrueorfalse

数据类型:logical

Output Arguments

collapse all

Updated documents, returned as atokenizedDocument大批。To get the token details fromupdatedDocuments, usetokenDetails

More About

collapse all

Part-of-Speech Tags

TheaddPartOfSpeechDetails函数将言论的一部分标签添加到由tokenDetails函数。The function tags each token with a categorical tag with one of the following class names:

  • “形容词”- 形容词

  • “适当”– Adposition

  • "adverb"– Adverb

  • “助动词”- 助动词

  • "coord-conjunction"– Coordinating conjunction

  • "determiner"- 确定器

  • "interjection"– Interjection

  • "noun"– Noun

  • “数字”– Numeral

  • "particle"– Particle

  • “代词”- 代词

  • “专有名称”– Proper noun

  • "punctuation"– Punctuation

  • “下属结合”- 从属连接

  • "symbol"- 象征

  • "verb"– Verb

  • “其他”- 其他

Algorithms

If the input documents do not contain sentence details, then the function first runsaddSentenceDetails

Version History

Introduced in R2018b