Main Content

removeStopWords

Remove stop words from documents

Description

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Use this function to remove stop words before analysis.

The function supports English, Japanese, German, and Korean text. To learn how to useremoveStopWordsfor other languages, seeLanguage Considerations.

example

newDocuments= removeStopWords(documents)removes the stop words from thetokenizedDocumentarraydocuments. The function, by default, uses the stop word list given by thestopWordsfunction according to the language details ofdocumentsand is case insensitive.

To remove a custom list of words, use theremoveWordsfunction.

newDocuments= removeStopWords(documents,'IgnoreCase',false)removes stop words with case matching the stop word list given by thestopWordsfunction.

Tip

UseremoveStopWordsbefore using thenormalizeWordsfunction asremoveStopWordsuses information that is removed by this function.

Examples

collapse all

Remove the stop words from an array of documents usingremoveStopWords. ThetokenizedDocument功能检测到文档都是英文的, soremoveStopWordsremoves English stop words.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]); newDocuments = removeStopWords(documents)
newDocuments = 2x1 tokenizedDocument: 3 tokens: example short sentence 3 tokens: second short sentence

Tokenize Japanese text usingtokenizedDocument. The function automatically detects Japanese text.

str = ["ここは静かなので、とても穏やかです""企業内の顧客データを利用し、今年の売り上げを調べることが出来た。""私は先生です。私は英語を教えています。"]; documents = tokenizedDocument(str);

Remove stop words usingremoveStopWords. The function uses the language details fromdocumentsto determine which language stop words to remove.

documents = removeStopWords(documents)
documents = 3x1 tokenizedDocument: 4 tokens: 静か 、 とても 穏やか 10 tokens: 企業 顧客 データ 利用 、 今年 売り上げ 調べる 出来 。 5 tokens: 先生 。 英語 教え 。

Tokenize German text usingtokenizedDocument. The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

Remove stop words using theremoveStopWordsfunction. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)
documents = 2x1 tokenizedDocument: 5 tokens: Guten Morgen . geht ? 5 tokens: Heute wird guter Tag .

Input Arguments

collapse all

Input documents, specified as atokenizedDocumentarray.

Output Arguments

collapse all

Output documents, returned as atokenizedDocumentarray.

More About

collapse all

Language Considerations

ThestopWordsandremoveStopWordsfunctions support English, Japanese, German, and Korean stop words only.

To remove stop words from other languages, useremoveWordsand specify your own stop words to remove.

Algorithms

collapse all

Language Details

tokenizedDocumentobjects contain details about the tokens including language details. The language details of the input documents determine the behavior ofremoveStopWords. ThetokenizedDocumentfunction, by default, automatically detects the language of the input text. To specify the language details manually, use the'Language'name-value pair argument oftokenizedDocument. To view the token details, use thetokenDetailsfunction.

版本历史

Introduced in R2018b