主要内容

Compare LDA Solvers

This example shows how to compare latent Dirichlet allocation (LDA) solvers by comparing the goodness of fit and the time taken to fit the model.

Import Text Data

使用ARXIV API从数学论文中导入一组摘要和类别标签。指定使用该记录的记录数量进口variable.

进口= 50000;

创建一个与SET查询记录的URL"math"和metadata prefix"arXiv".

URL ="https://export.arxiv.org/oai2?verb=ListRecords"+...“&set =数学”+..."&metadataPrefix=arXiv";

提取抽象文本和使用查询URL返回的恢复令牌parseArXivRecords该示例作为支持文件的函数。金宝app要访问此文件,请以实时脚本打开此示例。请注意,ARXIV API受速率有限,需要在多个请求之间等待。

[textdata,〜,resumptionToken] = parsearxivrecords(url);

Iteratively import more chunks of records until the required amount is reached, or there are no more records. To continue importing records from where you left off, use the resumption token from the previous result in the query URL. To adhere to the rate limits imposed by the arXiv API, add a delay of 20 seconds before each query using the暂停功能。

尽管numel(textdata)<进口ifresumptionToken ==""休息endURL ="https://export.arxiv.org/oai2?verb=ListRecords"+...“&ResumptionToken =”+ resumptionToken; pause(20) [textDataNew,labelsNew,resumptionToken] = parseArXivRecords(url); textData = [textData; textDataNew];end

Preprocess Text Data

预留10%的随机文件有效ation.

numDocuments = numel(textData); cvp = cvpartition(numDocuments,'HoldOut',0.1); textDataTrain = textData(training(cvp)); textDataValidation = textData(test(cvp));

Tokenize and preprocess the text data using the functionpreprocessText该示例结束时列出。

documentsTrain = preprocesstext(textdatatrain);documentsValidation = preprocesstext(textDatavalidation);

从培训文件中创建一个单词范围的型号。删除总共不超过两次的单词。删除任何不包含单词的文件。

bag = bagOfWords(documentsTrain); bag = removeInfrequentWords(bag,2); bag = removeEmptyDocuments(bag);

对于验证数据,请从验证文档中创建一个单词模型。您无需从valyaiton数据中删除任何单词,因为在拟合LDA模型中未出现的任何单词都会自动忽略。

validationData = bagOfWords(documentsValidation);

适合和比较模型

对于每个LDA求解器,请拟合具有40个主题的型号。为了区分求解器在同一轴上绘制结果时,请为每个求解器指定不同的线属性。

numTopics = 40; solvers = ["cgs""avb""cvb0"“ savb”];lineSpecs = [“+ - ”“* - ”"x-""o-"];

使用每个求解器安装LDA模型。对于每个求解器,请指定初始主题浓度1,以每个数据通过验证一次模型,并且不符合主题浓度参数。使用数据FitInfoproperty of the fitted LDA models, plot the validation perplexity and the time elapsed.

默认情况下,随机求解器使用1000个小批量的大小,并每10个迭代验证模型。对于该求解器,要验证每个数据通过一次模型,请将验证频率设置为CEIL(NumoBservations/1000), wherenumObservationsis the number of documents in the training data. For the other solvers, set the validation frequency to 1.

对于随机求解器未评估验证困惑的迭代,随机求解器报告NaN在里面FitInfoproperty. To plot the validation perplexity, remove the NaNs from the reported values.

numObservations = bag.numdocuments;数字fori = 1:numel(solvers) solver = solvers(i); lineSpec = lineSpecs(i);ifsolver ==“ savb”numIterationsPerDataPass = ceil(numObservations/1000);elseNumiterationsPerdataPass = 1;endmdl = fitlda(袋子,麻醉,...“求解器”,求解器,...'InitialTopicConcentration',1,...'fittopicConcentration',错误的,...'验证data',validationData,...“验证频率”,numIterationsPerDataPass,...'Verbose',0);历史= mdl.fitinfo.history;TimeElapsed = history.timesincestart;验证Perperplexity =历史。% Remove NaNs.idx = isnan(validationPerplexity); timeElapsed(idx) = []; validationPerplexity(idx) = []; plot(timeElapsed,validationPerplexity,lineSpec) holdendhold离开Xlabel(“时间过去了”)ylabel("Validation Perplexity") ylim([0 inf]) legend(solvers)

对于随机求解器,只有一个数据点。这是因为该求解器一次通过输入数据。要指定更多数据通过,请使用'DataPassLimit'选项。For the batch solvers ("cgs","avb", 和"cvb0"),要指定适合模型的迭代次数,请使用'IterationLimit'选项。

A lower validation perplexity suggests a better fit. Usually, the solvers“ savb”"cgs"converge quickly to a good fit. The solver"cvb0"might converge to a better fit, but it can take much longer to converge.

For theFitInfo财产,fitlda功能estimates the validation perplexity from the document probabilities at the maximum likelihood estimates of the per-document topic probabilities. This is usually quicker to compute, but can be less accurate than other methods. Alternatively, calculate the validation perplexity using thelogp功能。此函数计算更准确的值,但可能需要更长的时间才能运行。例如,显示如何使用logp, see从Word Count Matrix中计算文档日志探针.

预处理功能

功能preprocessText执行以下步骤:

  1. Tokenize the text using象征性文档.

  2. 使用normalizeWords.

  3. 擦除标点符号擦除.

  4. Remove a list of stop words (such as "and", "of", and "the") usingremoveStopWords.

  5. Remove words with 2 or fewer characters using删除词.

  6. Remove words with 15 or more characters usingremoveLongWords.

功能文档=预处理(textdata)%tokenize文本。documents = tokenizedDocument(textData);% Lemmatize the words.documents = addPartofSpeechDetails(Documents);documents =归一化词(文档,'Style','lemma');%擦除标点符号。文档=删除(文档);% Remove a list of stop words.documents = removeStopWords(documents);%以2个或更少的字符删除单词,以及15或更高的单词% 人物。文档= removeshortWords(文档,2);文档= removelongwords(文档,15);end

See Also

|||||||||||||

相关话题