Main Content

创建简单的文本模型以进行分类

This example shows how to train a simple text classifier on word frequency counts using a bag-of-words model.

You can create a simple classification model which uses word frequency counts as predictors. This example trains a simple classification model to predict the category of factory reports using text descriptions.

加载和提取文本数据

加载示例数据。文件Factory Reports.csvcontains factory reports, including a text description and categorical labels for each report.

文件名="factoryReports.csv";data =可读取(文件名,'texttype',,,,'string');头(数据)
ans=8×5桌Description Category Urgency Resolution Cost _____________________________________________________________________ ____________________ ________ ____________________ _____ "Items are occasionally getting stuck in the scanner spools."“机械故障”“中等”“重新调整机” 45“汇编活塞发出的大声嘎嘎声和爆炸声。”“机械故障”“中等”“重新调整机” 35“启动植物时有削减的电源。”“电子故障”“高”“完整替换” 16200“汇编器中的油炸电容器”。“电子故障”“高”“更换组件” 352“混合器绊倒了保险丝”。“电子故障”“低”“添加到观察列表“ 55”构造剂中的突发管正在喷洒冷却液。”“泄漏”“高”“更换组件” 371“搅拌机中的保险丝被吹了。”“电子故障”“低”“更换组件” 441“东西继续从皮带上滚下来”。“机械故障”“低”“重新调整机” 38

Convert the labels in theCategorycolumn of the table to categorical and view the distribution of the classes in the data using a histogram.

data.Category = apcorical(data.category);图直方图(数据。类别)xlabel("Class")ylabel("Frequency")title(“班级分布”

将数据划分为培训分区和持有测试集。指定保留百分比为10%。

cvp = cvpartition(data. category,'坚持',,,,0.1); dataTrain = data(cvp.training,:); dataTest = data(cvp.test,:);

从表中提取文本数据和标签。

textdatatrain = datatrain.description;textDatatest = datatest.description;ytrain = datatrain.category;ytest = datatest.category;

准备文本数据进行分析

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The functionpreprocessText,,,,performs the following steps in order:

  1. Tokenize the text usingtokenizedDocument

  2. 删除使用停止单词的列表(例如,使用“和”,“”和“ The”)removeStopWords

  3. Lemmatize the words using归一化词

  4. Erase punctuation usingerasePunctuation

  5. 使用2个或更少字符的单词使用removeShortWords

  6. 使用15个或更多字符的单词使用removelongwords

使用示例预处理功能preprocessText准备文本数据。

documents = preprocessText(textDataTrain); documents(1:5)
ANS = 5×1令牌文档:6令牌:偶尔会变固定的扫描仪杆子7令牌:响亮的嘎嘎声爆炸声音汇编器4代币4代币:切割动力启动植物3代币3代币:Fry Capicitor Assembler 3令牌3令牌:Mixer Trip Fuse Fuse:Mixer Trip Fibe Fuse:

Create a bag-of-words model from the tokenized documents.

包= bagOfWords(documents)
Bag =带有属性的Bagofword:计数:[432×336 double]词汇:[1×336字符串] NumWords:336 NUMDOCUMENTS:432

Remove words from the bag-of-words model that do not appear more than two times in total. Remove any documents containing no words from the bag-of-words model, and remove the corresponding entries in labels.

bag = removeinfrequentwords(袋子,2);[袋,idx] = emakementydocuments(袋);ytrain(idx)= [];包
Bag =带有属性的Bagofword:计数:[432×155 double]词汇:[1×155字符串] NUM WORDS:155 NUMDOCUMENTS:432

火车监督分类器

使用单词频率计数和标签的频率计数训练监督分类模型。

使用使用多类线性分类模型训练fitcecoc。指定Countsproperty of the bag-of-words model to be the predictors, and the event type labels to be the response. Specify the learners to be linear. These learners support sparse data input.

XTrain = bag.Counts; mdl = fitcecoc(XTrain,YTrain,“学习者”,,,,“线性”
mdl = compactClassificationEcoc响应名称:'y'classNames:[电子故障泄漏机械故障软件故障] scoretransform:'none'''二进制验证者:{6×1个单元}编码matrix:[4×6 double]属性,方法,方法,方法,方法

为了更好的配合,你可以指定不同parameters of the linear learners. For more information on linear classification learner templates, seeTemplatelinear

测试分类器

使用训练有素的模型预测测试数据的标签并计算分类精度。分类精度是该模型正确预测的标签的比例。

Preprocess the test data using the same preprocessing steps as the training data. Encode the resulting test documents as a matrix of word frequency counts according to the bag-of-words model.

documentStest = preprocesstext(textDatatest);Xtest = encode(bag,documentStest);

使用训练有素的模型预测测试数据的标签并计算分类精度。

ypred =预测(mdl,xtest);acc = sum(ypred == ytest)/numel(ytest)
ACC = 0.8542

使用新数据预测

Classify the event type of new factory reports. Create a string array containing the new factory reports.

str = ["Coolant is pooling underneath sorter.""Sorter blows fuses at start up.""There are some very loud rattling sounds coming from the assembler."];documentsNew =预处理(str);Xnew = encode(bag,documentsnew);LabelsNew =预测(MDL,XNew)
LabelsNew =3×1分类Leak Electronic Failure Mechanical Failure

Example Preprocessing Function

The functionpreprocessText,,,,performs the following steps in order:

  1. Tokenize the text usingtokenizedDocument

  2. 删除使用停止单词的列表(例如,使用“和”,“”和“ The”)removeStopWords

  3. Lemmatize the words using归一化词

  4. Erase punctuation usingerasePunctuation

  5. 使用2个或更少字符的单词使用removeShortWords

  6. 使用15个或更多字符的单词使用removelongwords

functiondocuments = preprocessText(textData)% Tokenize the text.documents = tokenizedDocument(textData);% Remove a list of stop words then lemmatize the words. To improve% lemmatization, first use addPartOfSpeechDetails.documents = addPartOfSpeechDetails(documents); documents = removeStopWords(documents); documents = normalizeWords(documents,'风格',,,,'引理');% Erase punctuation.documents = erasePunctuation(documents);% Remove words with 2 or fewer characters, and words with 15 or more% characters.documents = removeShortWords(documents,2); documents = removeLongWords(documents,15);结尾

也可以看看

|||||||||

Related Topics