Main Content

removeInfrequentNgrams

Remove infrequently seen n-grams from bag-of-n-grams model

Description

example

newBag= removeInfrequentNgrams(英航g,count)removes the n-grams that appear at mostcounttimes in total from the bag-of-n-grams model英航g. The function, by default, is case sensitive.

example

newBag= removeInfrequentNgrams(英航g,count,'NgramLengths',lengths)only removes n-grams with lengths specified bylengths. The function, by default, is case sensitive.

newBag= removeInfrequentNgrams(___,'IgnoreCase',true)removes the n-grams that appear at mostcounttimes ignoring case. If n-grams differ only by case, then the corresponding counts are merged.

Examples

collapse all

Load the example data. The filesonnetsPreprocessed.txtcontains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

文件名="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. Specify to count bigrams (pairs of words) and trigrams (triples of words).

英航g = bagOfNgrams(documents,'NgramLengths',[2 3])
英航g = bagOfNgrams with properties: Counts: [154x18022 double] Vocabulary: ["fairest" "creatures" "desire" ... ] Ngrams: [18022x3 string] NgramLengths: [2 3] NumNgrams: 18022 NumDocuments: 154

Remove n-grams of any length that appear two or fewer times in total.

英航g = removeInfrequentNgrams(bag,2)
英航g = bagOfNgrams with properties: Counts: [154x103 double] Vocabulary: ["thine" "thy" "self" "sweet" "thou" ... ] Ngrams: [103x3 string] NgramLengths: [2 3] NumNgrams: 103 NumDocuments: 154

Remove bigrams that appear four or fewer times in total.

英航g = removeInfrequentNgrams(bag,4,'NgramLengths',2)
英航g = bagOfNgrams with properties: Counts: [154x41 double] Vocabulary: ["thine" "thy" "sweet" "thou" "dost" ... ] Ngrams: [41x3 string] NgramLengths: [2 3] NumNgrams: 41 NumDocuments: 154

Input Arguments

collapse all

Input bag-of-n-grams model, specified as a英航gOfNgramsobject.

计算阈值, specified as a positive integer. The function removes the n-grams that appearcounttimes in total or fewer.

N-gram lengths, specified as a positive integer or a vector of positive integers.

如果您指定lengths, the function removes infrequent n-grams of the specified lengths only. If you do not specifylengths, then the function removes infrequent n-grams regardless of length.

Example:[1 2 3]

Output Arguments

collapse all

Output bag-of-n-grams model, returned as a英航gOfNgramsobject.

Version History

Introduced in R2018a