主要内容

ldaModel

潜在狄利克雷分配(LDA)模型

描述

潜在狄利克雷分配(LDA)模型是一个话题模型,发现潜在主题的集合文件和推断单词概率在主题。如果模型是适合使用bag-of-n-grams模型,然后该软件将字格作为单独的单词。

创建

创建一个使用LDA模型fitlda函数。

属性

全部展开

许多主题在LDA模型中,指定为一个正整数。

主题集中,指定为一个积极的标量。每个主题函数集浓度TopicConcentration / NumTopics。有关更多信息,请参见潜在狄利克雷分配

词集中,指定为负的标量。软件集每一浓度WordConcentration / numWords,在那里numWords是词汇输入文档的大小。有关更多信息,请参见潜在狄利克雷分配

主题输入文档集合的概率,指定为一个向量。LDA的语料库主题概率模型的概率是观察每个主题在整个数据集用于符合LDA模型。CorpusTopicProbabilities是1 -K向量,K是主题的数量。的kth的条目CorpusTopicProbabilities对应于观察主题的概率k

每个输入文档主题概率,指定为一个矩阵。LDA的文档主题概率模型的概率在每个文档用于观察每个主题符合LDA模型。DocumentTopicProbabilities是一个D——- - - - - -K矩阵D文档的数量被用来适应LDA模型,然后呢K是主题的数量。的(d、k)th的条目DocumentTopicProbabilities对应于观察主题的概率k在文档d

如果任何主题零概率(CorpusTopicProbabilities包含0),那么相应的列DocumentTopicProbabilitiesTopicWordProbabilities是0。

订单的行DocumentTopicProbabilities对应于在训练数据文件的顺序。

词概率/话题,指定为一个矩阵。LDA的话题词概率模型的概率是观察每个单词每个主题的LDA模型。TopicWordProbabilities是一个V——- - - - - -K矩阵,V单词的数量吗词汇表K是主题的数量。的(v, k)th的条目TopicWordProbabilities对应于观察字的概率v在主题k

如果任何主题零概率(CorpusTopicProbabilities包含0),那么相应的列DocumentTopicProbabilitiesTopicWordProbabilities是0。

订单的行TopicWordProbabilities对应于单词的顺序词汇表

主题订单,指定为以下之一:

  • “initial-fit-probability”——那种最初的主题由语料库主题概率模型。这些概率CorpusTopicProbabilities产权的初始ldaModel返回的对象fitlda。的的简历函数不重新排序结果的话题ldaModel对象。

  • “无序”——不顺序的话题。

时记录信息拟合LDA模型,指定为一个struct以下字段:

  • TerminationCode——优化退出时的状态

    • 0 -迭代达到极限。

    • 1 -公差在对数似满意。

  • TerminationStatus——返回终止代码的解释

  • NumIterations——执行的迭代次数

  • NegativeLogLikelihood——数据传递给负对数似fitlda

  • 困惑——数据传递给困惑fitlda

  • 解算器——解决者使用的名称

  • 历史——结构优化的历史

  • StochasticInfo——结构随机动力学的信息

数据类型:结构体

模型的单词列表中,指定为一个字符串向量。

数据类型:字符串

对象的功能

logp 文档log-probabilities和LDA模型的拟合优度
预测 预测LDA主题的文件
的简历 简历合适LDA模型
topkwords 最重要的是单词bag-of-words模型或LDA的话题
变换 将文档转换成低维空间
wordcloud 创建词云图表从文本、bag-of-words模型bag-of-n-grams模型,或LDA模型

例子

全部折叠

复制的结果在这个例子中,集rng“默认”

rng (“默认”)

加载示例数据。该文件sonnetsPreprocessed.txt莎士比亚的十四行诗的包含预处理版本。文件包含每行一个十四行诗,单词之间用一个空格来分隔。提取的文本sonnetsPreprocessed.txt在换行字符,文本分割成文档,然后标记文件。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);

创建一个bag-of-words模型使用bagOfWords

袋= bagOfWords(文档)
袋= bagOfWords属性:计数:[154 x3092双]词汇:“公平”“生物”“希望”“增加”“从而”“美”“玫瑰”“可能”“从不”“死”“成熟”“时间”“死”“温柔”“继承人”“熊”“记忆”“你”“简约”…]NumWords: 3092 NumDocuments: 154

适合一个LDA模型有四个主题。

numTopics = 4;numTopics mdl = fitlda(袋)
最初的话题作业在0.067833秒内取样。= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |迭代每个相对| | |时间培训|主题|主题| | | | |迭代变化困惑浓度浓度| | | | | |(秒)日志(L) | | |迭代| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = | 0 | 0.03 | 1.000 | 1.215 e + 03 | | 0 | | 1 | 0.01 | 1.0482 e-02 e + 03 | 1.128 | 1.000 | 0 | | 2 | 0.01 | 1.7190 e 03 e + 03 | 1.115 | 1.000 | 0 | | 3 | 0.01 | 4.3796 e-04 e + 03 | 1.118 | 1.000 | 0 | | 4 | 0.01 | 9.4193 e-04 e + 03 | 1.111 | 1.000 | 0 | | 5 | 0.02 | 3.7079 e-04 e + 03 | 1.108 | 1.000 | 0 | | 6 | 0.01 | 9.5777 e-05 e + 03 | 1.107 | 1.000 | 0 | = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
mdl = ldaModel属性:NumTopics: 4 WordConcentration: 1 TopicConcentration: 1 CorpusTopicProbabilities: [0.2500 - 0.2500 0.2500 - 0.2500] DocumentTopicProbabilities: [154 x4双]TopicWordProbabilities: (x4 3092双)词汇:“公平”“生物”“希望”“增加”“从而”“美”“玫瑰”“可能”“从不”“死”“成熟”“时间”“死”“温柔”“继承人”“熊”“记忆”“你”…]TopicOrder:“initial-fit-probability”FitInfo: [1 x1 struct]

使用词云可视化的话题。

topicIdx = 1:4次要情节(2,2,topicIdx) wordcloud (mdl topicIdx);标题(主题:“+ topicIdx)结束

图包含wordcloud类型的对象。wordcloud类型的图表标题主题:1。wordcloud类型的图表标题主题:2。wordcloud类型的图表标题主题:3。wordcloud类型的图表标题主题:4。

创建一个表的概率最高的单词LDA的话题。

复制的结果,集rng“默认”

rng (“默认”)

加载示例数据。该文件sonnetsPreprocessed.txt莎士比亚的十四行诗的包含预处理版本。文件包含每行一个十四行诗,单词之间用一个空格来分隔。提取的文本sonnetsPreprocessed.txt在换行字符,文本分割成文档,然后标记文件。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);

创建一个bag-of-words模型使用bagOfWords

袋= bagOfWords(文件);

适合一个LDA模型与20的话题。抑制详细输出,集“详细”为0。

numTopics = 20;mdl = fitlda(袋、numTopics、“详细”,0);

找到第一个主题的前20个单词。

k = 20;topicIdx = 1;台= topkwords (mdl k topicIdx)
台=20×2表词分________ _____“眼睛”0.11155“美”0.05777”、“0.055778”“0.049801”真正的“0.043825”我“0.033865”找到“0.031873“黑色”0.025897“看“0.023905”是“0.023905”“0.021913”“0.021913”发现“0.017929“罪恶”0.015937”三个“0.013945 0.0099608⋮“黄金”

找到第一个主题的前20个单词并使用逆意味着分数缩放。

台= topkwords (mdl k topicIdx,“缩放”,“inversemean”)
台=20×2表词得分说“眼睛”1.2718“美”0.59022”、“0.5692”“0.50269”真正的“0.43719”我“0.32764”找到“0.32544“黑色”0.25931”这“0.23755”“0.22519”“0.21594”“0.21594”发现“0.17326“罪恶”0.15223”三个“0.13143 0.090698⋮“黄金”

创建一个词云使用比例分数作为大小的数据。

图wordcloud (tbl.Word tbl.Score);

图包含一个wordcloud类型的对象。

得到文档主题概率(也称为主题混合物)的文件用于符合LDA模型。

复制的结果,集rng“默认”

rng (“默认”)

加载示例数据。该文件sonnetsPreprocessed.txt莎士比亚的十四行诗的包含预处理版本。文件包含每行一个十四行诗,单词之间用一个空格来分隔。提取的文本sonnetsPreprocessed.txt在换行字符,文本分割成文档,然后标记文件。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);

创建一个bag-of-words模型使用bagOfWords

袋= bagOfWords(文件);

适合一个LDA模型与20的话题。抑制详细输出,集“详细”为0。

numTopics = 20;mdl = fitlda(袋、numTopics、“详细”,0)
mdl = ldaModel属性:NumTopics: 20 WordConcentration: 1 TopicConcentration: 5 CorpusTopicProbabilities: [0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500] DocumentTopicProbabilities: [154 x20的双]TopicWordProbabilities: [3092 x20的双]词汇:(“公平”“生物”“希望”“增加”“从而”“美”“玫瑰”“可能”“从不”“死”“成熟”“时间”“死”“温柔”“继承人”“熊”“记忆”“你”…]TopicOrder:“initial-fit-probability”FitInfo: [1 x1 struct]

视图中的第一个文档的主题概率训练数据。

topicMixtures = mdl.DocumentTopicProbabilities;图酒吧(topicMixtures(1:))标题(“文档1主题概率”)包含(“主题指数”)ylabel (“概率”)

图包含一个坐标轴对象。坐标轴对象与标题文档1主题概率,包含主题索引,ylabel概率包含一个对象类型的酒吧。

复制的结果在这个例子中,集rng“默认”

rng (“默认”)

加载示例数据。该文件sonnetsPreprocessed.txt莎士比亚的十四行诗的包含预处理版本。文件包含每行一个十四行诗,单词之间用一个空格来分隔。提取的文本sonnetsPreprocessed.txt在换行字符,文本分割成文档,然后标记文件。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);

创建一个bag-of-words模型使用bagOfWords

袋= bagOfWords(文档)
袋= bagOfWords属性:计数:[154 x3092双]词汇:“公平”“生物”“希望”“增加”“从而”“美”“玫瑰”“可能”“从不”“死”“成熟”“时间”“死”“温柔”“继承人”“熊”“记忆”“你”“简约”…]NumWords: 3092 NumDocuments: 154

适合一个LDA模型与20的话题。

numTopics = 20;numTopics mdl = fitlda(袋)
最初的话题作业在0.123507秒内取样。= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |迭代每个相对| | |时间培训|主题|主题| | | | |迭代变化困惑浓度浓度| | | | | |(秒)日志(L) | | |迭代| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = | 0 | 0.04 | 5.000 | 1.159 e + 03 | | 0 | | 1 | 0.05 | 5.4884 e-02 e + 02 | 8.028 | 5.000 | 0 | | 2 | 0.04 | 4.7400 e 03 e + 02 | 7.778 | 5.000 | 0 | | 3 | 0.08 | 3.4597 e 03 e + 02 | 7.602 | 5.000 | 0 | | 4 | 0.06 | 3.4662 e 03 e + 02 | 7.430 | 5.000 | 0 | | 5 | 0.06 | 2.9259 e 03 e + 02 | 7.288 | 5.000 | 0 | | 6 | 0.05 | 6.4180 e-05 e + 02 | 7.291 | 5.000 | 0 | = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
mdl = ldaModel属性:NumTopics: 20 WordConcentration: 1 TopicConcentration: 5 CorpusTopicProbabilities: [0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500] DocumentTopicProbabilities: [154 x20的双]TopicWordProbabilities: [3092 x20的双]词汇:(“公平”“生物”“希望”“增加”“从而”“美”“玫瑰”“可能”“从不”“死”“成熟”“时间”“死”“温柔”“继承人”“熊”“记忆”“你”…]TopicOrder:“initial-fit-probability”FitInfo: [1 x1 struct]

预测主题数组的新文档。

newDocuments = tokenizedDocument ([“在一个叫什么名字?增加了其他名字同样芬芳。”“如果音乐是爱情的食粮,玩。”]);newDocuments topicIdx =预测(mdl)
topicIdx =2×119日8

使用词云可视化预测主题。

图次要情节(1、2、1)wordcloud (mdl, topicIdx (1));标题(“主题”+ topicIdx(1)次要情节(1、2、2)wordcloud (mdl, topicIdx (2));标题(“主题”+ topicIdx (2))

图包含wordcloud类型的对象。19 wordcloud类型的图表标题话题。8 wordcloud类型的图表标题话题。

更多关于

全部展开

版本历史

介绍了R2017b