使用主题模型分析文本数据

这个例子展示了如何使用潜在狄利克雷分配(LDA)主题模型来分析文本数据。

潜在狄利克雷分配(LDA)模型是一种主题模型,发现潜在的主题集合中的文档和推断概率这个词的主题。

加载和数据中提取文本

加载示例数据。该文件factoryReports.csv包含工厂的报告,包括每个事件的文本描述和分类标签。

数据= readtable (“factoryReports.csv”TextType =“字符串”);头(数据)

ans =8×5表类别描述紧急解决成本_____________________________________________________________________ ____________________ ________ ____________________ _____”项目是偶尔陷入扫描仪卷。”"Mechanical Failure" "Medium" "Readjust Machine" 45 "Loud rattling and banging sounds are coming from assembler pistons." "Mechanical Failure" "Medium" "Readjust Machine" 35 "There are cuts to the power when starting the plant." "Electronic Failure" "High" "Full Replacement" 16200 "Fried capacitors in the assembler." "Electronic Failure" "High" "Replace Components" 352 "Mixer tripped the fuses." "Electronic Failure" "Low" "Add to Watch List" 55 "Burst pipe in the constructing agent is spraying coolant." "Leak" "High" "Replace Components" 371 "A fuse is blown in the mixer." "Electronic Failure" "Low" "Replace Components" 441 "Things continue to tumble off of the belt." "Mechanical Failure" "Low" "Readjust Machine" 38

提取文本的数据字段描述。

textData = data.Description;textData (1:10)

ans =10×1的字符串“项目是偶尔陷入扫描仪卷。”"Loud rattling and banging sounds are coming from assembler pistons." "There are cuts to the power when starting the plant." "Fried capacitors in the assembler." "Mixer tripped the fuses." "Burst pipe in the constructing agent is spraying coolant." "A fuse is blown in the mixer." "Things continue to tumble off of the belt." "Falling items from the conveyor belt." "The scanner reel is split, it will soon begin to curve."

准备文本数据进行分析

创建一个函数符和预处理文本数据,因此它可以用于分析。这个函数preprocessText中列出,预处理功能部分的例子,执行下面的步骤为:

在标记文本使用tokenizedDocument。
Lemmatize使用的话normalizeWords。
删除标点符号使用erasePunctuation。
删除列表的停止词(如“和”,“的”,和“的”)removeStopWords。
删除与2或更少的字符使用单词removeShortWords。
删除与15个或更多字符使用单词removeLongWords。

准备使用的文本数据进行分析preprocessText函数。

文件= preprocessText (textData);文档(1:5)

ans = 5×1 tokenizedDocument: 6令牌:项目偶尔卡住扫描仪线轴7令牌:大声作响爆炸声音来汇编活塞4令牌:减少力量开始工厂3令牌:炒电容器汇编3令牌:搅拌机旅行保险丝

创建一个bag-of-words模型的标记化的文档。

袋= bagOfWords(文档)

袋= bagOfWords属性:数量:(480×338双)词汇:[1×338弦]NumWords: 338 NumDocuments: 480

把单词从bag-of-words模型,总共不会超过两次。删除任何包含没有单词的文档从bag-of-words模型。

袋= removeInfrequentWords(袋,2);袋= removeEmptyDocuments(袋)

袋= bagOfWords属性:数量:(480×158双)词汇:[1×158弦]NumWords: 158 NumDocuments: 480

符合LDA模型

适合7 LDA模型与主题。对于一个例子,演示如何选择主题的数量,看看为LDA模型选择的主题。抑制详细输出,设置详细的选项为0。再现性,使用rng函数与“默认”选择。

rng (“默认”)numTopics = 7;mdl = fitlda(袋、numTopics、Verbose = 0);

如果你有一个大的数据集,那么随机近似变分贝叶斯解算器通常是更适合,因为它可以减少通过一个好的模型的数据。默认的解算器fitlda(吉布斯抽样倒塌)可以更准确的在长时间运行的成本。使用随机近似变分贝叶斯设置解算器选项“savb”。为一个例子,演示如何解决LDA进行比较,看看比较LDA解决者。

使用词云可视化主题

你可以使用云来查看单词概率最高的在每一个主题。使用词云可视化的话题。

图t = tiledlayout (“流”);标题(t)“LDA的话题”)为i = 1: numTopics nexttile wordcloud (mdl,我);标题(“主题”+ i)结束

查看主题文件的混合物

创建一个数组标记文件的一系列前所未有的文档使用相同的训练数据预处理功能。

str = [“冷却池下面汇编程序。”“在启动分选机把保险丝烧断了。”“有一些非常响亮的哒哒声来自汇编程序。”];newDocuments = preprocessText (str);

使用变换函数来将文档转换为概率向量的话题。注意,对于非常短的文档,这个话题的混合物可能不是一个强大的表示文档内容。

newDocuments topicMixtures =变换(mdl);

第一个文档的文档主题概率情节在一个条形图。标签的主题,使用相对应的前三个词的主题。

为i = 1: numTopics顶级= topkwords (mdl 3 i);topWords (i) =加入(top.Word,”、“);结束图酒吧(topicMixtures(1:))包含(“主题”)xticklabels (topWords);ylabel (“概率”)标题(“文档主题概率”)

可视化多个主题混合使用堆叠柱形图表。可视化主题文件的混合物。

图barh (topicMixtures,“堆叠”1)xlim([0])标题(“主题混合”)包含(“主题概率”)ylabel (“文档”)传说(topWords…位置=“southoutside”,…NumColumns = 2)

预处理功能

这个函数preprocessText在订单执行以下步骤:

在标记文本使用tokenizedDocument。
Lemmatize使用的话normalizeWords。
删除标点符号使用erasePunctuation。
删除列表的停止词(如“和”,“的”,和“的”)removeStopWords。
删除与2或更少的字符使用单词removeShortWords。
删除与15个或更多字符使用单词removeLongWords。

函数文件= preprocessText (textData)%在标记文本。文件= tokenizedDocument (textData);% Lemmatize的话。= addPartOfSpeechDetails文件(文档);文件= = normalizeWords(文档、风格“引理”);%擦掉标点符号。= erasePunctuation文件(文档);%去除停止词的列表。= removeStopWords文件(文档);% 2或更少的字符删除单词,单词和15或更高%字符。文件= removeShortWords(文件,2);= removeLongWords文档(文档、15);结束

另请参阅