encode

Encode documents as matrix of word or n-gram counts

Syntax

计数= encode(bag,documents)

计数= encode(bag,words)

计数= encode(___,Name,Value)

Description

Useencode编码一组标记化的文档作为一个垫子上rix of word or n-gram counts according to a bag-of-words or bag-of-n-grams model. To encode documents as vectors or word indices, use awordEncodingobject.

example

计数= encode(袋,documents)returns a matrix of frequency counts fordocumentsbased on the bag-of-words or bag-of-n-grams model袋.

example

计数= encode(袋,words)returns a matrix of frequency counts for a list of words.

example

计数= encode(___,Name,Value)specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Encode Documents as Word Count Matrix

Open Live Script

Encode an array of documents as a matrix of word counts.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]); bag = bagOfWords(documents)

袋= bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" ... ] NumWords: 7 NumDocuments: 2

documents = tokenizedDocument(["a new sentence""a second new sentence"])

documents = 2x1 tokenizedDocument: 3 tokens: a new sentence 4 tokens: a second new sentence

View the documents encoded as a matrix of word counts. The word "new" does not appear in袋, so it is not counted.

计数= encode(bag,documents); full(counts)

ans =2×70 0 0 1 0 1 0 0 0 0 1 0 1 1

The columns correspond to the vocabulary of the bag-of-words model.

袋.Vocabulary

ans =1x7 string"an" "example" "of" "a" "short" "sentence" "second"

Encode Words as Word Count Vector

Open Live Script

Encode an array of words as a vector of word counts.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]); bag = bagOfWords(documents)

袋= bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" ... ] NumWords: 7 NumDocuments: 2

words = ["another""example""of""a""short""example""sentence"]; counts = encode(bag,words)

计数= (1,2) 2 (1,3) 1 (1,4) 1 (1,5) 1 (1,6) 1

Output Document Word Counts in Columns

Open Live Script

Encode an array of documents as a matrix of word counts with documents in columns.

documents = tokenizedDocument(["an example of a short sentence""a second short sentence"]); bag = bagOfWords(documents)

袋= bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" ... ] NumWords: 7 NumDocuments: 2

documents = tokenizedDocument(["a new sentence""a second new sentence"])

documents = 2x1 tokenizedDocument: 3 tokens: a new sentence 4 tokens: a second new sentence

View the documents encoded as a matrix of word counts with documents in columns. The word "new" does not appear in袋, so it is not counted.

计数= encode(bag,documents,'DocumentsIn','columns'); full(counts)

ans =7×20 0 0 0 0 0 1 1 0 0 1 1 0 1

Input Arguments

collapse all

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

Input bag-of-words or bag-of-n-grams model, specified as a袋OfWordsobject or a袋OfNgramsobject.

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

Input documents, specified as atokenizedDocumentarray, a string array of words, or a cell array of character vectors. Ifdocumentsis a string array or a cell array of character vectors, then it must be a row vector representing a single document, where each element is a word.

Tip

To ensure that the documents are encoded correctly, you must preprocess the input documents using the same steps as the documents used to create the input model. For an example showing how to create a function to preprocess text data, seePrepare Text Data for Analysis.

`words`—Input words
字符串向量|character vector|cell array of character vectors

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specifywordsas a character vector, then the function treats the argument as a single word.

Data Types:string|char|cell

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:'DocumentsIn','rows'specifies the orientation of the output documents as rows.

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

Orientation of output documents in the frequency count matrix, specified as the comma-separated pair consisting of'DocumentsIn'and one of the following:

'rows'– Return a matrix of frequency counts with rows corresponding to documents.
'columns'– Return a transposed matrix of frequency counts with columns corresponding to documents.

Data Types:char

`ForceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of'ForceCellOutput'andtrueorfalse.

Data Types:logical

Output Arguments

collapse all

`计数`— Word or n-gram counts
sparse matrix | cell array of sparse matrices

Word or n-gram counts, returned as a sparse matrix of nonnegative integers or a cell array of sparse matrices.

If袋is a non-scalar array or'ForceCellOutput'istrue, then the function returns the outputs as a cell array of sparse matrices. Each element in the cell array is matrix of word or n-gram counts of the corresponding element of袋.

Version History

Introduced in R2017b

encode

Syntax

Description

Examples

Encode Documents as Word Count Matrix

Encode Words as Word Count Vector

Output Document Word Counts in Columns

Input Arguments

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

`words`—Input words
字符串向量|character vector|cell array of character vectors

Name-Value Arguments

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

`ForceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

Output Arguments

`计数`— Word or n-gram counts
sparse matrix | cell array of sparse matrices

Version History

See Also

Topics

encode

Syntax

Description

Examples

Encode Documents as Word Count Matrix

Encode Words as Word Count Vector

Output Document Word Counts in Columns

Input Arguments

袋—Input bag-of-words or bag-of-n-grams model袋OfWordsobject|袋OfNgramsobject

documents—Input documentstokenizedDocumentarray|string array of words|cell array of character vectors

words—Input words字符串向量|character vector|cell array of character vectors

Name-Value Arguments

DocumentsIn—Orientation of output documents'rows'(default) |'columns'

ForceCellOutput—Indicator for forcing output to be returned as cell arrayfalse(default) |true

Output Arguments

计数— Word or n-gram countssparse matrix | cell array of sparse matrices

Version History

See Also

Topics

`袋`—Input bag-of-words or bag-of-n-grams model
`袋OfWords`object|`袋OfNgrams`object

`documents`—Input documents
`tokenizedDocument`array|string array of words|cell array of character vectors

`words`—Input words
字符串向量|character vector|cell array of character vectors

`DocumentsIn`—Orientation of output documents
`'rows'`(default) |`'columns'`

`ForceCellOutput`—Indicator for forcing output to be returned as cell array
`false`(default) |`true`

`计数`— Word or n-gram counts
sparse matrix | cell array of sparse matrices