doc2sequence

Convert documents to sequences for deep learning

Syntax

sequences = doc2sequence(enc,documents)

sequences = doc2sequence(emb,documents)

sequences = doc2sequence(___,Name,Value)

Description

sequences= doc2sequence(enc,documents)returns a cell array of the numeric indices of the words indocumentsgiven by the word encodingenc。Each element ofsequencesis a vector of the indices of the words in the corresponding document.

example

sequences= doc2sequence(emb,documents)returns a cell array of the embedding vectors of the words indocumentsgiven by the word embeddingemb。Each element ofsequencesis a matrix of the embedding vectors of the words in the corresponding document.

example

sequences= doc2sequence(___,Name,Value)specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Convert Documents to Sequences of Word Indices

Open Live Script

Load the factory reports data and create atokenizedDocumentarray.

文件名="factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);

Create a word encoding.

enc = wordEncoding(documents);

Convert the documents to sequences of word indices.

sequences = doc2sequence(enc,documents);

View the sizes of the first 10 sequences. Each sequence is a 1-by-Svector, whereSis the number of word indices in the sequence. Because the sequences are padded,Sis constant.

sequences(1:10)

ans=10×1 cell array{1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double} {1x17 double}

Convert Documents to Sequences of Word Vectors

Open Live Script

Convert an array of tokenized documents to sequences of word vectors using a pretrained word embedding.

Load a pretrained word embedding using thefastTextWordEmbeddingfunction. This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Load the factory reports data and create atokenizedDocumentarray.

文件名="factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);

Convert the documents to sequences of word vectors usingdoc2sequence。Thedoc2sequencefunction, by default, left-pads the sequences to have the same length. When converting large collections of documents using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from padding the data, set the'PaddingDirection'option to'none'。Alternatively, you can control the amount of padding using the'Length'option.

sequences = doc2sequence(emb,documents,'PaddingDirection','none');

View the sizes of the first 10 sequences. Each sequence isD-by-Smatrix, whereDis the embedding dimension, andSis the number of word vectors in the sequence.

sequences(1:10)

ans=10×1 cell array{300×10 single} {300×11 single} {300×11 single} {300×6 single} {300×5 single} {300×10 single} {300×8 single} {300×9 single} {300×7 single} {300×13 single}

Pad or Truncate Sequences to Specified Length

Open Live Script

Convert a collection of documents to sequences of word vectors using a pretrained word embedding, and pad or truncate the sequences to a specified length.

Load a pretrained word embedding usingfastTextWordEmbedding。This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Load the factory reports data and create atokenizedDocumentarray.

文件名="factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);

Convert the documents to sequences of word vectors. Specify to left-pad or truncate the sequences to have length 100.

sequences = doc2sequence(emb,documents,'Length',100);

View the sizes of the first 10 sequences. Each sequence isD-by-Smatrix, whereDis the embedding dimension, andSis the number of word vectors in the sequence (the sequence length). Because the sequence length is specified,Sis constant.

sequences(1:10)

ans=10×1 cell array{300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single}

Input Arguments

collapse all

`emb`—输入字嵌入
`wordEmbedding`object

输入字嵌入, specified as awordEmbeddingobject.

`enc`—Input word encoding
`wordEncoding`object

Input word encoding, specified as awordEncodingobject.

`documents`—Input documents
`tokenizedDocument`array

Input documents, specified as atokenizedDocumentarray.

Name-Value Pair Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN。

Example:'Length','shortest'truncates the sequences to have the same length as the shortest sequence.

`'UnknownWord'`—Unknown word behavior
`'discard'`(default) |`'nan'`

Unknown word behavior, specified as the comma-separated pair consisting of'UnknownWord'and one of the following:

'discard'– If a word is not in the input map, then discard it.
'nan'– If a word is not in the input map, then return aNaNvalue.

Tip

If you are creating sequences for training a deep learning network with a word embedding, use'discard'。Do not use sequences withNaNvalues, because doing so can propagate errors through the network.

`'PaddingDirection'`—Padding direction
`'left'`(default) |`'right'`|`'none'`

Padding direction, specified as the comma-separated pair consisting of'PaddingDirection'and one of the following:

'left'– Pad sequences on the left.
'right'– Pad sequences on the right.
'none'– Do not pad sequences.

Tip

When converting large collections of data using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from adding too much padding, set the'PaddingDirection'option to'none'or set'Length'to a smaller value.

`'PaddingValue'`—Padding value
0(default) |numeric scalar

Padding value, specified as the comma-separated pair consisting of'PaddingValue'and a numeric scalar. Do not pad sequences withNaN, because doing so can propagate errors through the network.

`'Length'`—Sequence length
`'longest'`(default) |`'shortest'`|positive integer

Sequence length, specified as the comma-separated pair consisting of'Length'and one of the following:

'longest'– Pad sequences to have the same length as the longest sequence.
'shortest'– Truncate sequences to have the same length as the shortest sequence.
Positive integer – Pad or truncate sequences to have the specified length. The function truncates the sequences on the right.

Output Arguments

collapse all

`sequences`— Output sequences
cell array

Output sequences, returned as a cell array.

For word embedding input, theith element ofsequencesis a matrix of the word vectors corresponding to theith input document.

For word encoding input, theith element ofsequencesis a vector of the word encoding indices corresponding to theith input document.

Tips

When converting large collections of data using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from adding too much padding, set the'PaddingDirection'option to'none'or set'Length'to a smaller value.

doc2sequence

Syntax

Description

Examples

Convert Documents to Sequences of Word Indices

Convert Documents to Sequences of Word Vectors

Pad or Truncate Sequences to Specified Length

Input Arguments

`emb`—输入字嵌入
`wordEmbedding`object

`enc`—Input word encoding
`wordEncoding`object

`documents`—Input documents
`tokenizedDocument`array

Name-Value Pair Arguments

`'UnknownWord'`—Unknown word behavior
`'discard'`(default) |`'nan'`

`'PaddingDirection'`—Padding direction
`'left'`(default) |`'right'`|`'none'`

`'PaddingValue'`—Padding value
0(default) |numeric scalar

`'Length'`—Sequence length
`'longest'`(default) |`'shortest'`|positive integer

Output Arguments

`sequences`— Output sequences
cell array

Tips

See Also

Topics

Text Analytics Toolbox Documentation

金宝app

Getting Started with Text Analytics in MATLAB

doc2sequence

Syntax

Description

Examples

Convert Documents to Sequences of Word Indices

Convert Documents to Sequences of Word Vectors

Pad or Truncate Sequences to Specified Length

Input Arguments

emb—输入字嵌入wordEmbeddingobject

enc—Input word encodingwordEncodingobject

documents—Input documentstokenizedDocumentarray

Name-Value Pair Arguments

'UnknownWord'—Unknown word behavior'discard'(default) |'nan'

'PaddingDirection'—Padding direction'left'(default) |'right'|'none'

'PaddingValue'—Padding value0(default) |numeric scalar

'Length'—Sequence length'longest'(default) |'shortest'|positive integer

Output Arguments

sequences— Output sequencescell array

Tips

See Also

Topics

Text Analytics Toolbox Documentation

金宝app

Getting Started with Text Analytics in MATLAB

`emb`—输入字嵌入
`wordEmbedding`object

`enc`—Input word encoding
`wordEncoding`object

`documents`—Input documents
`tokenizedDocument`array

`'UnknownWord'`—Unknown word behavior
`'discard'`(default) |`'nan'`

`'PaddingDirection'`—Padding direction
`'left'`(default) |`'right'`|`'none'`

`'PaddingValue'`—Padding value
0(default) |numeric scalar

`'Length'`—Sequence length
`'longest'`(default) |`'shortest'`|positive integer

`sequences`— Output sequences
cell array