Remove words with low counts from bag-of-words model



newBag= removeInfrequentWords(,count)removes the words that appear at mostcounttimes in total from the bag-of-words model. The function, by default, is case sensitive.


newBag= removeInfrequentWords(,count,'IgnoreCase',true)removes the words that appear at mostcounttimes in total ignoring case. If words differ only by case, then the corresponding counts are merged.


Remove the words that appear two times or fewer from a bag-of-words model.

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence""another example""a short example"]); bag = bagOfWords(documents)
袋= bagOfWords with properties: Counts: [4x8 double] Vocabulary: ["an" "example" "of" "a" "short" ... ] NumWords: 8 NumDocuments: 4

Remove the words that appear two times or fewer from the bag-of-words model.

count = 2; newBag = removeInfrequentWords(bag,count)
newBag = bagOfWords with properties: Counts: [4x3 double] Vocabulary: ["example" "a" "short"] NumWords: 3 NumDocuments: 4

Input Arguments

Input bag-of-words model, specified as a袋OfWordsobject.

Count threshold to remove words, specified as a positive integer. The function removes the words that appearcounttimes in total or fewer.


Introduced in R2017b