文本分析工具箱

分析和模型文本数据

文本分析工具箱™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.

文本分析工具箱includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.

Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.

Get Started:

导入和可视化文本数据

从社交媒体,新闻源,设备日志,报告和调查等源中提取文本数据。

提取文本数据

Import text data into MATLAB®from single files or large collections of files, including PDF, HTML, and Microsoft®Word®and Excel®files.

从Microsoft Word文档集合中提取文本。

Visualize Text

使用Word云和文本散点图探索文本数据集。

显示使用字体大小和颜色的词的词云云。

Language Support

Text Analytics Toolbox为英语,日语,德语和韩语提供了语言特定的预处理功能。大多数函数也使用其他语言的文本。

导入,准备和分析日文文本。

Preprocess Text Data

Extract meaningful words from raw text.

Clean Text Data

应用高级过滤功能以删除无关内容,例如URL,HTML标记和标点,以及正确的拼写。

简化原始文本(左)以使用最有意义的单词(右)。

过滤器停止单词并将单词标准化为根形式

Prioritize meaningful text data in your analysis by filtering out common words, words that appear too frequently or infrequently, and very long or very short words. Reduce the vocabulary and focus on the broader sense or sentiment of a document by stemming words to their root form or lemmatizing them to their dictionary form.

Removing stop words like “a” and “of” from documents.

Identify Tokens, Sentences, and Parts-of-Speech

Automatically split raw text into a collection of words using a tokenization algorithm. Add sentence boundaries, part-of-speech details, and other relevant information for context.

将言语和句子详细信息添加到令牌化文件。

Convert Text to Numeric Formats

将文本数据转换为数字表格以用于机器学习和深度学习。

Word and N-Gram Counting

计算字频统计信息以数字方式表示文本数据。

在模型中识别和可视化最常见的发生词。

单词嵌入和编码

Train word-embedding models such as word2vec continuous bag-of-words (CBOW) and skip-gram models. Import pretrained models including fastText and GloVe.

Visualize clusters in a text scatter plot using word embedding.

Machine Learning with Text Data

使用机器学习算法执行主题建模,分类,维数减少和文档摘要提取。

主题建模

使用机器学习算法(如潜在的Dirichlet分配(LDA)和潜在语义分析(LSA))在大型文本数据中发现和可视化底层模式,趋势和复杂关系。

Identifying topics in storm report data.

Document Summarization and Keyword Extraction

Extract summary and relevant keywords from one or more documents automatically and evaluate similarity and importance of documents.

从文本中提取摘要。

Deep Learning with Text Data

Perform sentiment analysis and classification withdeep learningnetworks such as long short-term memory networks (LSTMs).

情绪分析

确定文本数据中表达的态度和意见,将声明分类为正,中立或负面。构建可以实时预测情绪的模型。

识别预测积极和负面情绪的词语。

Text Classification

Classify text descriptions using word embeddings that can identify categories of text through deep learning.

培训深度神经网络以对文本数据进行分类。

Text generation using Jane Austen’s傲慢与偏见and a deep learning LSTM network.

最新特色

关键词提取

提取最能使用rake和textrank算法描述文档的关键字

See发行说明for details on any of these features and corresponding functions.

深入学习的情感分析

分析Live Twitter数据的情绪,了解如何感知给定术语。