Main Content

addPartOfSpeechDetails

Add part-of-speech tags to documents

Description

UseaddPartOfSpeechDetailsto add part-of-speech tags to documents.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments= addPartOfSpeechDetails(documents)detects parts of speech indocumentsand updates the token details. The function, by default, retokenizes the text for part-of-speech tagging. For example, the function splits the word "you're" into the tokens "you" and "'re". To get the part-of-speech details fromupdatedDocuments, usetokenDetails.

updatedDocuments= addPartOfSpeechDetails(documents,Name,Value)specifies additional options using one or more name-value pair arguments.

Tip

UseaddPartOfSpeechDetailsbefore using thelower,upper,erasePunctuation,normalizeWords,removeWords, andremoveStopWordsfunctions asaddPartOfSpeechDetailsuses information that is removed by these functions.

Examples

collapse all

Load the example data. The filesonnetsPreprocessed.txtcontains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename ="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents); head(tdetails)
ans=8×5 tableToken DocumentNumber LineNumber Type Language ___________ ______________ __________ _______ ________ "fairest" 1 1 letters en "creatures" 1 1 letters en "desire" 1 1 letters en "increase" 1 1 letters en "thereby" 1 1 letters en "beautys" 1 1 letters en "rose" 1 1 letters en "might" 1 1 letters en

Add part-of-speech details to the documents using theaddPartOfSpeechDetails函数。This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned bytokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ___________ ______________ ______________ __________ _______ ________ ______________ "fairest" 1 1 1 letters en adjective "creatures" 1 1 1 letters en noun "desire" 1 1 1 letters en noun "increase" 1 1 1 letters en noun "thereby" 1 1 1 letters en adverb "beautys" 1 1 1 letters en noun "rose" 1 1 1 letters en noun "might" 1 1 1 letters en auxiliary-verb

Tokenize Japanese text usingtokenizedDocument.

str = ["恋に悩み、苦しむ。""恋の悩みで 苦しむ。""空に星が輝き、瞬いている。""空の星が輝きを増している。""駅までは遠くて、歩けない。""遠くの駅まで歩けない。""すもももももももものうち。"]; documents = tokenizedDocument(str);

For Japanese text, you can get the part-of-speech details usingtokenDetails. For English text, you must first useaddPartOfSpeechDetails.

tdetails = tokenDetails(documents); head(tdetails)
ans=8×8 table令牌DocumentNumber LineNumber类型语言的一部分OfSpeech Lemma Entity _______ ______________ __________ ___________ ________ ____________ _______ __________ "恋" 1 1 letters ja noun "恋" non-entity "に" 1 1 letters ja adposition "に" non-entity "悩み" 1 1 letters ja verb "悩む" non-entity "、" 1 1 punctuation ja punctuation "、" non-entity "苦しむ" 1 1 letters ja verb "苦しむ" non-entity "。" 1 1 punctuation ja punctuation "。" non-entity "恋" 2 1 letters ja noun "恋" non-entity "の" 2 1 letters ja adposition "の" non-entity

Tokenize German text usingtokenizedDocument.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

To get the part of speech details for German text, first useaddPartOfSpeechDetails.

documents = addPartOfSpeechDetails(documents);

To view the part of speech details, use thetokenDetails函数。

tdetails = tokenDetails(documents); head(tdetails)
ans=8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ________ ______________ ______________ __________ ___________ ________ ____________ "Guten" 1 1 1 letters de adjective "Morgen" 1 1 1 letters de noun "." 1 1 1 punctuation de punctuation "Wie" 1 2 1 letters de adverb "geht" 1 2 1 letters de verb "es" 1 2 1 letters de pronoun "dir" 1 2 1 letters de pronoun "?" 1 2 1 punctuation de punctuation

Input Arguments

collapse all

Input documents, specified as atokenizedDocumentarray.

Name-Value Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:'DiscardKnownValues',truespecifies to discard previously computed details and recompute them.

Method to retokenize documents, specified as one of the following:

  • 'part-of-speech'– Transform the tokens for part-of-speech tagging. The function performs these tasks:

    • Split compound words. For example, split the compound word"wanna"into the tokens"want"and"to". This includes compound words containing apostrophes. For example, the function splits the word"don't"into the tokens"do"and"n't".

    • Merge periods that do not end sentences with preceding tokens. For example, merge the tokens"Mr"and"."into the token"Mr.".

    • For German text, merge abbreviations that span multiple tokens. For example, merge the tokens"z",".","B", and"."into the single token"z. B.".

    • Merge runs of periods into ellipses. For example, merge three instances of"."into the single token"...".

  • 'none'– Do not retokenize the documents.

List of abbreviations for sentence detection, specified as a string array, character vector, cell array of character vectors, or a table.

If the input documents do not contain sentence details, then the function first runs theaddSentenceDetailsfunction and specifies the abbreviation list given by'Abbreviations'. To specify more options for sentence detection (for example, sentence starters) use theaddSentenceDetailsfunction before usingaddPartOfSpeechDetailsdetails.

IfAbbreviationsis a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using theStartersname-value pair.

To specify different behaviors when splitting sentences at abbreviations, specifyAbbreviationsas a table. The table must have variables namedAbbreviationandUsage, whereAbbreviationcontains the abbreviations, andUsagecontains the type of each abbreviation. The following table describes the possible values ofUsage, and the behavior of the function when passed abbreviations of these types.

Usage Behavior Example Abbreviation Example Text Detected Sentences
regular If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period. "appt." "Book an appt. We'll meet then."

"Book an appt."

"We'll meet then."

"Book an appt. today." "Book an appt. today."
inner Do not break after trailing period. "Dr." "Dr. Smith." "Dr. Smith."
reference If the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period. "fig." "See fig. 3." "See fig. 3."
"Try a fig. They are nice."

"Try a fig."

"They are nice."

unit 如果前面的词是一个数字和下面word is a capitalized sentence starter, then break at a trailing period. "in." "The height is 30 in. The width is 10 in."

"The height is 30 in."

"The width is 10 in."

如果前面的词是一个数字和下面word is not capitalized, then do not break at a trailing period. "The item is 10 in. wide." "The item is 10 in. wide."
If the previous word is not a number, then break at a trailing period. "Come in. Sit down."

"Come in."

"Sit down."

The default value is the output of theabbreviations函数。日本和韩国文字,abbreviations do not usually impact sentence detection.

Tip

By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations inAbbreviations.

Data Types:char|string|table|cell

Option to discard previously computed details and recompute them, specified astrueorfalse.

Data Types:logical

Output Arguments

collapse all

Updated documents, returned as atokenizedDocumentarray. To get the token details fromupdatedDocuments, usetokenDetails.

More About

collapse all

Part-of-Speech Tags

TheaddPartOfSpeechDetailsfunction adds part-of-speech tags to the table returned by thetokenDetails函数。The function tags each token with a categorical tag with one of the following class names:

  • "adjective"– Adjective

  • "adposition"– Adposition

  • "adverb"– Adverb

  • "auxiliary-verb"– Auxiliary verb

  • "coord-conjunction"– Coordinating conjunction

  • "determiner"– Determiner

  • "interjection"– Interjection

  • "noun"– Noun

  • "numeral"– Numeral

  • "particle"– Particle

  • "pronoun"– Pronoun

  • "proper-noun"– Proper noun

  • "punctuation"– Punctuation

  • "subord-conjunction"– Subordinating conjucntion

  • "symbol"– Symbol

  • "verb"– Verb

  • "other"– Other

Algorithms

If the input documents do not contain sentence details, then the function first runsaddSentenceDetails.

Introduced in R2018b