选择主题数量为了LDA Model

Open Live Script

This example shows how to decide on a suitable number of topics for a latent Dirichlet allocation (LDA) model.

为了决定合适的主题，您可以将拟合LDA模型的拟合优度与不同的主题进行比较。您可以通过计算一组文档的困惑来评估LDA模型的合适性。困惑表明该模型描述了一组文档。较低的困惑表明更合适。

提取和预处理文本数据

加载示例数据。文件Factory Reports.csvcontains factory reports, including a text description and categorical labels for each event. Extract the text data from the fieldDescription。

文件名="factoryReports.csv";data =可读取（文件名，'texttype'，，，，'细绳'）;textData = data.Description;

使用该函数来代币和预处理文本数据preprocessTextwhich is listed at the end of this example.

文件= preprocessText (textData);文档(1:5)

ANS = 5×1令牌文档：6代币：偶尔获得卡住的扫描仪7代币7令牌：响亮的嘎嘎声爆炸声音汇编器4代币4代币：切割动力启动植物3代币3代币：Fry Capicitor Assembler 3令牌3令牌：Mixer Trip Fuse Fuse：Mixer Trip Fibe Fuse

随机预留10％的文档以进行验证。

numDocuments = numel(documents); cvp = cvpartition(numDocuments,'坚持'，0.1）;documentsTrain =文档（cvp.training）;documentsValidation =文档（cvp.test）;

Create a bag-of-words model from the training documents. Remove the words that do not appear more than two times in total. Remove any documents containing no words.

bag = bagofwords（文档train）;bag = removeinfrequentwords（袋子，2）;袋子= emakementydocuments（袋）;

选择主题数量

目的是选择与其他主题数量相比，将许多主题最小化。这并不是唯一的考虑：与大量主题拟合的模型可能需要更长的时间来收敛。要查看权衡的效果，请计算拟合度和合适时间。如果最佳主题数量很高，那么您可能需要选择一个较低的值来加快拟合过程。

为一些主题数量的一系列值拟合一些LDA型号。比较固定测试文档集中每个模型的拟合时间和困惑。困惑是第二个输出logpfunction. To obtain the second output without assigning the first output to anything, use the〜象征。合适的时间是TimesIncestart最后迭代的价值。这个值在历史结构Fitinfoproperty of the LDA model.

For a quicker fit, specify'Solver'成为'savb'。要抑制详细的输出，请设置'Verbose'至0。This may take a few minutes to run.

numTopicsRange = [5 10 15 20 40];为了i = 1:numel(numTopicsRange) numTopics = numTopicsRange(i); mdl = fitlda(bag,numTopics,...'Solver'，，，，'savb'，，，，...'Verbose'，0）;[〜，验证perperplexity（i）] = logp（mdl，documentsvalidation）;timeelapsed（i）= mdl.fitinfo.history.timesincestart（end）;结尾

显示图中每个主题的困惑和经过的时间。在左轴上绘制困惑，并在右轴上经过的时间。

figure yyaxis剩下绘图（numtopicsrange，valivationperplexity，'+-'）ylabel(“验证困惑”）yyaxis正确的绘图（numtopicsrange，时间序列，'o-'）ylabel("Time Elapsed (s)"）legend([“验证困惑”"Time Elapsed (s)"],'Location'，，，，'southeast'）xlabel("Number of Topics"）

该图表明，将模型与10-20个主题拟合可能是一个不错的选择。与具有不同主题数量的模型相比，困惑性很低。有了这个求解器，这许多主题的经过的时间也很合理。对于不同的求解器，您可能会发现，增加主题的数量可能会导致更好的拟合度，但是拟合模型需要更长的时间才能收敛。

Example Preprocessing Function

The functionpreprocessText，按顺序执行以下步骤：

使用文本数据转换为小写lower。
使用至kenizedDocument。
Erase punctuation usingerasePunctuation。
删除使用停止单词的列表（例如，使用“和”，“”和“ The”）removeStopWords。
使用2个或更少字符的单词使用removeShortWords。
使用15个或更多字符的单词使用removelongwords。
Lemmatize the words using归一化词。

functiondocuments = preprocessText(textData)％将文本数据转换为小写。cleanTextData = lower(textData);% Tokenize the text.documents = tokenizedDocument（cleanTextData）;% Erase punctuation.documents = erasePunctuation(documents);％删除停止单词的列表。documents = removestopWords（文档）;% Remove words with 2 or fewer characters, and words with 15 or greater% characters.documents = removeShortWords(documents,2); documents = removeLongWords(documents,15);％诱惑单词。documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,'风格'，，，，'引理'）;结尾

也可以看看

选择主题数量为了LDA Model

提取和预处理文本数据

选择主题数量

Example Preprocessing Function

也可以看看

Related Topics