addSentenceDetails

Add sentence numbers to documents

所有的页面崩溃

Syntax

updatedDocuments = addSentenceDetails(documents)

updatedDocuments = addSentenceDetails(documents,Name,Value)

Description

UseaddSentenceDetailsto add sentence information to documents.

The function supports English, Japanese, German, and Korean text.

example

updatedDocuments= addSentenceDetails(documents)detects the sentence boundaries indocumentsand updates the token details. To get the sentence details fromupdatedDocuments, usetokenDetails。

updatedDocuments= addSentenceDetails(documents,Name,Value)specifies additional options using one or more name-value pair arguments.

Tip

UseaddSentenceDetailsbefore using thelower,upper,erasePunctuation,normalizeWords,removeWords, andremoveStopWordsfunctions asaddSentenceDetailsuses information that is removed by these functions.

Examples

collapse all

Add Sentence Details to Documents

Open Live Script

Create a tokenized document array.

str = [。.."This is an example document. It has two sentences.""This document has one sentence.""Here is another example document. It also has two sentences."]; documents = tokenizedDocument(str);

Add sentence details to the documents usingaddSentenceDetails。这个函数添加发送ence numbers to the table returned bytokenDetails。View the updated token details of the first few tokens.

documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails)

ans=8×6 tableToken DocumentNumber SentenceNumber LineNumber Type Language __________ ______________ ______________ __________ ___________ ________ "This" 1 1 1 letters en "is" 1 1 1 letters en "an" 1 1 1 letters en "example" 1 1 1 letters en "document" 1 1 1 letters en "." 1 1 1 punctuation en "It" 1 2 1 letters en "has" 1 2 1 letters en

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 &。..tdetails.SentenceNumber == 2; tdetails(idx,:)

ans=6×6 tableToken DocumentNumber SentenceNumber LineNumber Type Language ___________ ______________ ______________ __________ ___________ ________ "It" 3 2 1 letters en "also" 3 2 1 letters en "has" 3 2 1 letters en "two" 3 2 1 letters en "sentences" 3 2 1 letters en "." 3 2 1 punctuation en

Input Arguments

collapse all

`documents`—Input documents
`tokenizedDocument`array

Input documents, specified as atokenizedDocumentarray.

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:'Abbreviations',["cm" "mm" "in"]specifies to detect sentences boundaries where these abbreviations are followed by a period and a capitalized sentence starter.

`Abbreviations`—List of abbreviations
string array|character vector|cell array of character vectors|table

List of abbreviations, specified as a string array, character vector, cell array of character vectors, or a table.

IfAbbreviationsis a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using theStartersname-value pair.

To specify different behaviors when splitting sentences at abbreviations, specifyAbbreviationsas a table. The table must have variables namedAbbreviationandUsage, whereAbbreviationcontains the abbreviations, andUsagecontains the type of each abbreviation. The following table describes the possible values ofUsage, and the behavior of the function when passed abbreviations of these types.

Usage	Behavior	Example Abbreviation	Example Text	Detected Sentences
`regular`	If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period.	"appt."	`"Book an appt. We'll meet then."`	`"Book an appt."` `"We'll meet then."`
`regular`		"appt."	`"Book an appt. today."`	`"Book an appt. today."`
`inner`	Do not break after trailing period.	"Dr."	`"Dr. Smith."`	`"Dr. Smith."`
`reference`	If the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period.	"fig."	`"See fig. 3."`	`"See fig. 3."`
`reference`		"fig."	`"Try a fig. They are nice."`	`"Try a fig."` `"They are nice."`
`unit`	如果前一个词a number and the following word is a capitalized sentence starter, then break at a trailing period.	"in."	`"The height is 30 in. The width is 10 in."`	`"The height is 30 in."` `"The width is 10 in."`
	如果前一个词a number and the following word is not capitalized, then do not break at a trailing period.		`"The item is 10 in. wide."`	`"The item is 10 in. wide."`
	如果前一个词not a number, then break at a trailing period.		`"Come in. Sit down."`	`"Come in."` `"Sit down."`

The default value is the output of theabbreviationsfunction. For Japanese and Korean text, abbreviations do not usually impact sentence detection.

Tip

By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations inAbbreviations。

Example:["cm" "mm" "in"]

Data Types:char|string|table|cell

`Starters`—Words that start a sentence
string array|character vector|cell array of character vectors

Words that start a sentence, specified as a string array, character vector, or a cell array of character vectors. If a sentence starter appears capitalized after a regular abbreviation, then the function detects a sentence boundary at the trailing period. The function ignores any differences in the letter case of the sentence starters.

The default value is the output of thestopWordsfunction.

Data Types:char|string|cell

`DiscardKnownValues`—Option to discard previously computed details
`false`(default) |`true`

Option to discard previously computed details and recompute them, specified astrueorfalse。

Data Types:logical

Output Arguments

collapse all

`updatedDocuments`— Updated documents
`tokenizedDocument`array

Updated documents, returned as atokenizedDocumentarray. To get the token details fromupdatedDocuments, usetokenDetails。

More About

collapse all

Language Considerations

TheaddSentenceDetailsfunction detects sentence boundaries based on punctuation characters and line number information. For English and German text, the function also uses a list of abbreviations passed to the function.

For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the'Abbreviations'option ofaddSentenceDetails。

Algorithms

If emoticons or emoji characters appear after a terminating punctuation character, then the function splits the sentence after the emoticons and emoji.

Version History

Introduced in R2018a

addSentenceDetails

Syntax

Description

Examples

Add Sentence Details to Documents

Input Arguments

`documents`—Input documents
`tokenizedDocument`array

Name-Value Arguments

`Abbreviations`—List of abbreviations
string array|character vector|cell array of character vectors|table

`Starters`—Words that start a sentence
string array|character vector|cell array of character vectors

`DiscardKnownValues`—Option to discard previously computed details
`false`(default) |`true`

Output Arguments

`updatedDocuments`— Updated documents
`tokenizedDocument`array

More About

Language Considerations

Algorithms

Version History

See Also

Topics

addSentenceDetails

Syntax

Description

Examples

Add Sentence Details to Documents

Input Arguments

documents—Input documentstokenizedDocumentarray

Name-Value Arguments

Abbreviations—List of abbreviationsstring array|character vector|cell array of character vectors|table

Starters—Words that start a sentencestring array|character vector|cell array of character vectors

DiscardKnownValues—Option to discard previously computed detailsfalse(default) |true

Output Arguments

updatedDocuments— Updated documentstokenizedDocumentarray

More About

Language Considerations

Algorithms

Version History

See Also

Topics

`documents`—Input documents
`tokenizedDocument`array

`Abbreviations`—List of abbreviations
string array|character vector|cell array of character vectors|table

`Starters`—Words that start a sentence
string array|character vector|cell array of character vectors

`DiscardKnownValues`—Option to discard previously computed details
`false`(default) |`true`

`updatedDocuments`— Updated documents
`tokenizedDocument`array