removeWords

Remove selected words from documents or bag-of-words model

所有的页面崩溃

Syntax

newDocuments = removeWords(documents,words)

newBag = removeWords(bag,words)

newDocuments = removeWords(___,'IgnoreCase',true)

newDocuments = removeWords(documents,idx)

newBag = removeWords(bag,idx)

Description

example

newDocuments= removeWords(documents,words)removes the specified words fromdocuments. The function, by default, is case sensitive.

example

newBag= removeWords(bag,words)removes the specified words from the bag-of-words modelbag. The function, by default, is case sensitive.

newDocuments= removeWords(___,'IgnoreCase',true)removes words ignoring case using any of the previous syntaxes.

example

newDocuments= removeWords(documents,idx)removes words by specifying the numeric or logical indicesidxof the words indocuments.Vocabulary. This syntax is the same asnewDocuments = removeWords(documents,documents.Vocabulary(idx)).

example

newBag= removeWords(bag,idx)removes words by specifying the numeric or logical indicesidxof the words inbag.Vocabulary. This syntax is the same asnewBag = removeWords(bag,bag.Vocabulary(idx)).

Examples

collapse all

Remove Words from Documents

Open Live Script

删除单词from an array of documents by inputting a string array of words toremoveWords.

Create an array of tokenized documents.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]);

Remove the words "short" and "second".

words = ["short""second"]; newDocuments = removeWords(documents,words)

newDocuments = 2x1 tokenizedDocument: 5 tokens: an example of a sentence 2 tokens: a sentence

Remove Custom List of Stop Words from Documents

Open Live Script

To remove the default list of stop words using the language details of documents, useremoveStopWords.

To remove a custom list of stop words, use theremoveWordsfunction. You can use the stop word list returned by thestopWordsfunction as a starting point.

Load the example data. The filesonnetsPreprocessed.txtcontains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename ="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

View the first few documents.

documents(1:5)

ans = 5 x1 tokenizedDocument: 70令牌:公正的creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee 71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold 65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee 71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor 61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

Create a list of stop words starting with the output of thestopWordsfunction.

customStopWords = [stopWords"thy""thee""thou""dost""doth"];

Remove the custom stop words from the documents and view the first few documents.

documents = removeWords(documents,customStopWords); documents(1:5)

ans = 5 x1 tokenizedDocument: 62令牌:公正的creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory contracted thine own bright eyes feedst lights flame selfsubstantial fuel making famine abundance lies self foe sweet self cruel art worlds fresh ornament herald gaudy spring thine own bud buriest content tender churl makst waste niggarding pity world else glutton eat worlds due grave 61 tokens: forty winters shall besiege brow dig deep trenches beautys field youths proud livery gazed tatterd weed small worth held asked beauty lies treasure lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd beautys couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made art old blood warm feelst cold 52 tokens: look glass tell face viewest time face form another whose fresh repair renewest beguile world unbless mother fair whose uneard womb disdains tillage husbandry fond tomb selflove stop posterity art mothers glass calls back lovely april prime windows thine age shalt despite wrinkles golden time live rememberd die single thine image dies 52 tokens: unthrifty loveliness why spend upon self beautys legacy natures bequest gives nothing lend frank lends free beauteous niggard why abuse bounteous largess give profitless usurer why great sum sums yet canst live traffic self alone self sweet self deceive nature calls gone acceptable audit canst leave unused beauty tombed lives th executor 59 tokens: hours gentle work frame lovely gaze every eye dwell play tyrants same unfair fairly excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

Remove Words from Documents by Index

Open Live Script

删除单词from documents by inputting a vector of numeric indices toremoveWords.

Create an array of tokenized documents.

documents = tokenizedDocument(["I love MATLAB""I love MathWorks"])

documents = 2x1 tokenizedDocument: 3 tokens: I love MATLAB 3 tokens: I love MathWorks

View the vocabulary ofdocuments.

documents.Vocabulary

ans =1x4 string"I" "love" "MATLAB" "MathWorks"

Remove the first and third words of the vocabulary from the documents by specifying the numeric indices[1 3].

idx = [1 3]; newDocuments = removeWords(documents,idx)

newDocuments = 2x1 tokenizedDocument: 1 tokens: love 2 tokens: love MathWorks

Alternatively, you can specify logical indices.

idx = logical([1 0 1 0]); newDocuments = removeWords(documents,idx)

newDocuments = 2x1 tokenizedDocument: 1 tokens: love 2 tokens: love MathWorks

Remove Stop Words from Bag-of-Words Model

Open Live Script

Remove the stop words from a bag-of-words model by inputting a list of stop words toremoveWords. Stop words are words such as "a", "the", and "in" which are commonly removed from text before analysis.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]); bag = bagOfWords(documents); newBag = removeWords(bag,stopWords)

newBag = bagOfWords with properties: Counts: [2x4 double] Vocabulary: ["example" "short" "sentence" "second"] NumWords: 4 NumDocuments: 2

Remove Words from Bag-of-Words Model by Index

Open Live Script

删除单词from a bag-of-words model by inputting a vector of numeric indices toremoveWords.

Create an array of tokenized documents.

documents = tokenizedDocument(["I love MATLAB""I love MathWorks"]); bag = bagOfWords(documents)

bag = bagOfWords with properties: Counts: [2x4 double] Vocabulary: ["I" "love" "MATLAB" "MathWorks"] NumWords: 4 NumDocuments: 2

View the vocabulary ofbag.

bag.Vocabulary

ans =1x4 string"I" "love" "MATLAB" "MathWorks"

Remove the first and third words of the vocabulary from the bag-of-words model by specifying the numeric indices[1 3].

idx = [1 3]; newBag = removeWords(bag,idx)

newBag = bagOfWords with properties: Counts: [2x2 double] Vocabulary: ["love" "MathWorks"] NumWords: 2 NumDocuments: 2

Alternatively, you can specify logical indices.

idx = logical([1 0 1 0]); newBag = removeWords(bag,idx)

newBag = bagOfWords with properties: Counts: [2x2 double] Vocabulary: ["love" "MathWorks"] NumWords: 2 NumDocuments: 2

Input Arguments

collapse all

`documents`—Input documents
`tokenizedDocument`array

Input documents, specified as atokenizedDocumentarray.

`bag`—Input bag-of-words model
`bagOfWords`object

Input bag-of-words model, specified as abagOfWordsobject.

`words`—Words to remove
字符串向量|character vector|cell array of character vectors

Words to remove, specified as a string vector, character vector, or cell array of character vectors. If you specifywordsas a character vector, then the function treats it as a single word.

Data Types:string|char|cell

`idx`—Indices of words in vocabulary to remove
vector of numeric indices|vector of logical indices

Indices of words to remove, specified as a vector of numeric indices or a vector of logical indices. The indices inidxcorrespond to the locations of the words in theVocabularyproperty of the input documents or bag-of-words model.

Example:[1 5 10]

Output Arguments

collapse all

`newDocuments`— Output documents
`tokenizedDocument`array

Output documents, returned as atokenizedDocumentarray.

`newBag`— Output bag-of-words model
`bagOfWords`object

Output bag-of-words model, returned as abagOfWordsobject.

Version History

Introduced in R2017b

removeWords

Syntax

Description

Examples

Remove Words from Documents

Remove Custom List of Stop Words from Documents

Remove Words from Documents by Index

Remove Stop Words from Bag-of-Words Model

Remove Words from Bag-of-Words Model by Index

Input Arguments

`documents`—Input documents
`tokenizedDocument`array

`bag`—Input bag-of-words model
`bagOfWords`object

`words`—Words to remove
字符串向量|character vector|cell array of character vectors

`idx`—Indices of words in vocabulary to remove
vector of numeric indices|vector of logical indices

Output Arguments

`newDocuments`— Output documents
`tokenizedDocument`array

`newBag`— Output bag-of-words model
`bagOfWords`object

Version History

See Also

Topics

removeWords

Syntax

Description

Examples

Remove Words from Documents

Remove Custom List of Stop Words from Documents

Remove Words from Documents by Index

Remove Stop Words from Bag-of-Words Model

Remove Words from Bag-of-Words Model by Index

Input Arguments

documents—Input documentstokenizedDocumentarray

bag—Input bag-of-words modelbagOfWordsobject

words—Words to remove字符串向量|character vector|cell array of character vectors

idx—Indices of words in vocabulary to removevector of numeric indices|vector of logical indices

Output Arguments

newDocuments— Output documentstokenizedDocumentarray

newBag— Output bag-of-words modelbagOfWordsobject

Version History

See Also

Topics

`documents`—Input documents
`tokenizedDocument`array

`bag`—Input bag-of-words model
`bagOfWords`object

`words`—Words to remove
字符串向量|character vector|cell array of character vectors

`idx`—Indices of words in vocabulary to remove
vector of numeric indices|vector of logical indices

`newDocuments`— Output documents
`tokenizedDocument`array

`newBag`— Output bag-of-words model
`bagOfWords`object