Main Content

Classify Text Data Using Deep Learning

This example shows how to classify text data using a deep learning long short-term memory (LSTM) network.

Text data is naturally sequential. A piece of text is a sequence of words, which might have dependencies between them. To learn and use long-term dependencies to classify sequence data, use an LSTM neural network. An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data.

要将文本输入到LSTM网络,请首先将文本数据转换为数字序列。您可以使用一个单词编码来实现此目标,该单词将文档映射到数字索引序列。为了获得更好的结果,还包括网络中的单词嵌入层。单词嵌入词汇中的单词映射到数字向量而不是标量索引。这些嵌入捕获单词的语义细节,因此具有相似含义的单词具有相似的向量。它们还通过向量算术对单词之间的关系进行建模。例如,关系”Rome is to Italy asParis是法国" is described by the equation Italy-罗马 +巴黎=法国。

There are four steps in training and using the LSTM network in this example:

  • 导入和预处理数据。

  • 使用单词编码将单词转换为数字序列。

  • Create and train an LSTM network with a word embedding layer.

  • Classify new text data using the trained LSTM network.

导入数据

导入工厂报告数据。该数据包含标有出厂事件的文本描述。要导入文本数据作为字符串,请指定文本类型为'string'

文件名="factoryReports.csv";data =可读取(文件名,'TextType',,,,'string');head(data)
ans=8×5桌Description Category Urgency Resolution Cost _____________________________________________________________________ ____________________ ________ ____________________ _____ "Items are occasionally getting stuck in the scanner spools."“机械故障”“中等”“重新调整机” 45“汇编活塞发出的大声嘎嘎声和爆炸声。”“机械故障”“中等”“重新调整机” 35“启动植物时有削减的电源。”“电子故障”“高”“完整替换” 16200“汇编器中的油炸电容器”。“电子故障”“高”“更换组件” 352“混合器绊倒了保险丝”。“电子故障”“低”“添加到观察列表“ 55”构造剂中的突发管正在喷洒冷却液。”“泄漏”“高”“更换组件” 371“搅拌机中的保险丝被吹了。”“电子故障”“低”“更换组件” 441“东西继续从皮带上滚下来”。“机械故障”“低”“重新调整机” 38

The goal of this example is to classify events by the label in theCategorycolumn. To divide the data into classes, convert these labels to categorical.

data.Category = categorical(data.Category);

View the distribution of the classes in the data using a histogram.

位的e histogram(data.Category); xlabel("Class")ylabel("Frequency")标题(“班级分布”

The next step is to partition it into sets for training and validation. Partition the data into a training partition and a held-out partition for validation and testing. Specify the holdout percentage to be 20%.

cvp = cvpartition(data. category,'坚持',0.2);datatrain = data(训练(CVP),:);datavalidation = data(test(cvp),:);

Extract the text data and labels from the partitioned tables.

textDataTrain = dataTrain.Description; textDataValidation = dataValidation.Description; YTrain = dataTrain.Category; YValidation = dataValidation.Category;

要检查您是否正确导入数据,请使用Word Cloud可视化训练文本数据。

图WordCloud(TextDatatrain);标题(“培训数据”

预处理文本数据

创建一个函数,使文本数据具有象征性和预处理。功能preprocessText,,,,listed at the end of the example, performs these steps:

  1. Tokenize the text usingtokenizedDocument

  2. Convert the text to lowercase usinglower

  3. Erase the punctuation usingerasePunctuation

预处理培训数据和验证数据preprocessTextfunction.

documentsTrain = preprocessText(textDataTrain); documentsValidation = preprocessText(textDataValidation);

View the first few preprocessed training documents.

文档特工(1:5)
ANS = 5×1令牌图:9令牌:偶尔会陷入扫描仪杆子10代币:响起和敲击声音来自汇编10代币10代币:启动植物时的电源削减了5个令牌:炸了5个代币:炸式capicitors:炸式cap罐在汇编器4令牌中:混音器绊倒了保险丝

将文档转换为序列

要将文档输入到LSTM网络中,请使用一个单词编码将文档转换为数字索引序列。

To create a word encoding, use the文字编码function.

enc = wordEncoding(documentsTrain);

The next conversion step is to pad and truncate documents so they are all the same length. ThetrainingOptionsfunction provides options to pad and truncate input sequences automatically. However, these options are not well suited for sequences of word vectors. Instead, pad and truncate the sequences manually. If youleft-padand truncate the sequences of word vectors, then the training might improve.

To pad and truncate the documents, first choose a target length, and then truncate documents that are longer than it and left-pad documents that are shorter than it. For best results, the target length should be short without discarding large amounts of data. To find a suitable target length, view a histogram of the training document lengths.

documentLengths = doclength (documentsTrain);位的e histogram(documentLengths) title("Document Lengths")xlabel("Length")ylabel("Number of Documents"

大多数培训文件的标记不到10个。将其用作截断和填充的目标长度。

Convert the documents to sequences of numeric indices usingdoc2sequence。要截断或将序列截断为长度10,请设置'Length'option to 10.

sequenceLength = 10; XTrain = doc2sequence(enc,documentsTrain,'Length',,,,sequenceLength); XTrain(1:5)
ans=5×1 cell array{1×10 double} {1×10 double} {1×10 double} {1×10 double} {1×10 double}

Convert the validation documents to sequences using the same options.

xvalidation = doc2 sequence(enc,documentsvalidation,'Length',,,,sequenceLength);

Create and Train LSTM Network

Define the LSTM network architecture. To input sequence data into the network, include a sequence input layer and set the input size to 1. Next, include a word embedding layer of dimension 50 and the same number of words as the word encoding. Next, include an LSTM layer and set the number of hidden units to 80. To use the LSTM layer for a sequence-to-label classification problem, set the output mode to'最后的'。Finally, add a fully connected layer with the same size as the number of classes, a softmax layer, and a classification layer.

inputSize = 1; embeddingDimension = 50; numHiddenUnits = 80; numWords = enc.NumWords; numClasses = numel(categories(YTrain)); layers = [...sequenceInputlayer(inputSize) wordEmbeddingLayer(embeddingDimension,numWords) lstmLayer(numHiddenUnits,'OutputMode',,,,'最后的') fullyConnectedLayer(numClasses) softmaxLayer classificationLayer]
layers = 6x1 Layer array with layers: 1 '' Sequence Input Sequence input with 1 dimensions 2 '' Word Embedding Layer Word embedding layer with 50 dimensions and 423 unique words 3 '' LSTM LSTM with 80 hidden units 4 '' Fully Connected 4 fully connected layer 5 '' Softmax softmax 6 '' Classification Output crossentropyex

指定培训选项

指定培训选项:

  • Train using the Adam solver.

  • Specify a mini-batch size of 16.

  • 将每个时期的数据洗牌。

  • 通过设置培训进度“绘图”option to“训练过程”

  • 使用'ValidationData'option.

  • Suppress verbose output by setting the'Verbose'option tofalse

By default,trainNetworkuses a GPU if one is available. Otherwise, it uses the CPU. To specify the execution environment manually, use the'ExecutionEnvironment'name-value pair argument oftrainingOptions。对CPU进行的培训可能比在GPU上培训的时间要长得多。使用GPU培训需要并行计算工具箱™和支持的GPU设备。金宝app有关支持设备的信息,请参阅金宝appGPU Support by Release(Parallel Computing Toolbox)

options = trainingOptions('adam',,,,...'MiniBatchSize',,,,16,...'GradientThreshold',,,,2,...'Shuffle',,,,'every-epoch',,,,...'ValidationData',,,,{XValidation,YValidation},...“绘图”,,,,“训练过程”,,,,...'Verbose',,,,false);

使用LSTM网络训练LSTM网络trainNetworkfunction.

net = trainNetwork(XTrain,YTrain,layers,options);

使用新数据预测

Classify the event type of three new reports. Create a string array containing the new reports.

reportsNew = [..."Coolant is pooling underneath sorter.""Sorter blows fuses at start up.""There are some very loud rattling sounds coming from the assembler."];

Preprocess the text data using the preprocessing steps as the training documents.

documentsNew = preprocessText(reportsNew);

Convert the text data to sequences usingdoc2sequencewith the same options as when creating the training sequences.

XNew = doc2sequence(enc,documentsNew,'Length',,,,sequenceLength);

Classify the new sequences using the trained LSTM network.

LabelsNew =分类(NET,XNEW)
labelsNew =3×1 categoricalLeak Electronic Failure Mechanical Failure

Preprocessing Function

功能preprocessText执行以下步骤:

  1. Tokenize the text usingtokenizedDocument

  2. Convert the text to lowercase usinglower

  3. Erase the punctuation usingerasePunctuation

functiondocuments = preprocessText(textData)% Tokenize the text.documents = tokenizedDocument(textData);%转换为小写。documents = lower(documents);% Erase punctuation.documents = erasePunctuation(documents);end

See Also

(Text Analytics Toolbox)|(Text Analytics Toolbox)|(Text Analytics Toolbox)||||(Text Analytics Toolbox)||(Text Analytics Toolbox)

Related Topics