文本分析工具箱

分析和模型文本数据

下载免费试用

Watch video

文本分析工具箱™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.

文本分析工具箱includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.

Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.

最新特色
文档和资源
Try or Buy

免费白皮书

Getting Started with Text Analytics in MATLAB

下载白皮书

导入和可视化文本数据

从社交媒体，新闻源，设备日志，报告和调查等源中提取文本数据。

提取文本数据

Import text data into MATLAB^®from single files or large collections of files, including PDF, HTML, and Microsoft^®Word^®and Excel^®files.

从文件中提取文本数据

解析HTML并提取文本内容

Analyze Text Data Containing Emojis

从Microsoft Word文档集合中提取文本。

Visualize Text

使用Word云和文本散点图探索文本数据集。

使用Word云可视化文本数据

使用文本散点图可视化Word Embeddings

显示使用字体大小和颜色的词的词云云。

Language Support

Text Analytics Toolbox为英语，日语，德语和韩语提供了语言特定的预处理功能。大多数函数也使用其他语言的文本。

Language Support

Analyze Japanese Text Data

发现语言的文本

Analyze German Text Data

导入，准备和分析日文文本。

Preprocess Text Data

Extract meaningful words from raw text.

Clean Text Data

应用高级过滤功能以删除无关内容，例如URL，HTML标记和标点，以及正确的拼写。

Prepare Text Data for Analysis

从文本和文档中擦除标点符号

Erase HTTP and HTTPS URLs from Text

Correct spelling in documents

简化原始文本（左）以使用最有意义的单词（右）。

过滤器停止单词并将单词标准化为根形式

Prioritize meaningful text data in your analysis by filtering out common words, words that appear too frequently or infrequently, and very long or very short words. Reduce the vocabulary and focus on the broader sense or sentiment of a document by stemming words to their root form or lemmatizing them to their dictionary form.

从文档中删除停止单词

Stem or Lemmatize Words

Removing stop words like “a” and “of” from documents.

Identify Tokens, Sentences, and Parts-of-Speech

Automatically split raw text into a collection of words using a tokenization algorithm. Add sentence boundaries, part-of-speech details, and other relevant information for context.

通过令牌化将文本拆分为单词

Detect the Sentence Boundaries in Documents

向文档添加语音部分标签

将言语和句子详细信息添加到令牌化文件。

Convert Text to Numeric Formats

将文本数据转换为数字表格以用于机器学习和深度学习。

Word and N-Gram Counting

计算字频统计信息以数字方式表示文本数据。

使用多字词分析文本数据

Term Frequency–Inverse Document Frequency (tf-idf) Matrix

在模型中识别和可视化最常见的发生词。

单词嵌入和编码

Train word-embedding models such as word2vec continuous bag-of-words (CBOW) and skip-gram models. Import pretrained models including fastText and GloVe.

使用文本散点图可视化Word Embeddings

佩带的FastText Word嵌入

Map Word to Embedding Vector