Main Content

removeEmptyDocuments

Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model

Description

example

newDocuments= removeEmptyDocuments(documents)removes documents which have no words fromdocuments.

example

newBag= removeEmptyDocuments()removes documents which have no words or n-grams from the bag-of-words or bag-of-n-grams model.

example

[___,idx] = removeEmptyDocuments(___)also returns the indices of the removed documents.

Examples

collapse all

从一个数组删除包含没有单词的文档of tokenized documents.

Create an array of tokenized documents which includes empty documents.

documents = tokenizedDocument(["an example of a short sentence""""a second short sentence"""])
documents = 4x1 tokenizedDocument: 6 tokens: an example of a short sentence 0 tokens: 4 tokens: a second short sentence 0 tokens:

Remove the empty documents.

newDocuments = removeEmptyDocuments(documents)
newDocuments = 2x1 tokenizedDocument: 6 tokens: an example of a short sentence 4 tokens: a second short sentence

Remove documents containing no words from bag-of-words model.

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument(["An example of a short sentence.""""A second short sentence."""]); bag = bagOfWords(documents)
袋= bagOfWords with properties: Counts: [4x9 double] Vocabulary: ["An" "example" "of" "a" "short" ... ] NumWords: 9 NumDocuments: 4

Remove the empty documents from the bag-of-words model.

newBag = removeEmptyDocuments(bag)
newBag = bagOfWords with properties: Counts: [2x9 double] Vocabulary: ["An" "example" "of" "a" "short" ... ] NumWords: 9 NumDocuments: 2

从一个数组删除包含没有单词的文档and use the indices of removed documents to remove the corresponding labels also.

Create an array of tokenized documents which includes empty documents.

documents = tokenizedDocument(["an example of a short sentence""""a second short sentence"""])
documents = 4x1 tokenizedDocument: 6 tokens: an example of a short sentence 0 tokens: 4 tokens: a second short sentence 0 tokens:

Create a vector of labels.

labels = ["T";"F";"F";"T"]
labels =4x1 string"T" "F" "F" "T"

Remove the empty documents and get the indices of the removed documents.

[newDocuments, idx] = removeEmptyDocuments(documents)
newDocuments = 2x1 tokenizedDocument: 6 tokens: an example of a short sentence 4 tokens: a second short sentence
idx =2×12 4

Remove the corresponding labels fromlabels.

labels(idx) = []
labels =2x1 string"T" "F"

Input Arguments

collapse all

Input documents, specified as atokenizedDocumentarray.

Input bag-of-words or bag-of-n-grams model, specified as a袋OfWordsobject or a袋OfNgramsobject.

Output Arguments

collapse all

Output documents, returned as atokenizedDocumentarray.

Output model, returned as a袋OfWordsobject or a袋OfNgramsobject. The type ofnewBagis the same as the type of.

Indices of removed documents, returned as a vector of positive integers.

Version History

Introduced in R2017b