Main Content

Text Data Preparation

Import text data into MATLAB®and preprocess it for analysis

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, seePrepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions work with text from other languages. For more information, seeLanguage Considerations.

Live Editor Tasks

Preprocess Text Data Preprocess and clean up text data for analysis

Functions

expand all

extractFileText Read text from PDF,MicrosoftWord, HTML, and plain text files
extractHTMLText Extract text from HTML
readPDFFormData Read data from PDF forms
pdfinfo PDF file information
writeTextDocument Write documents to text file
htmlTree Parsed HTML tree
findElement Find elements in HTML tree
getAttribute Read HTML attribute of root node of HTML tree
ismissing Find HTML trees without values
string Convert parsed HTML tree to string
tokenizedDocument Array of tokenized documents for text analysis
erasePunctuation Erase punctuation from text and documents
eraseTags Erase HTML and XML tags from text
eraseURLs Erase HTTP and HTTPS URLs from text
removeStopWords Remove stop words from documents
removeShortWords Remove short words from documents or bag-of-words model
removeLongWords Remove long words from documents or bag-of-words model
removeWords Remove selected words from documents or bag-of-words model
normalizeWords Stem or lemmatize words
replaceWords Replace words in documents
replaceNgrams Replace n-grams in documents
splitSentences Split text into sentences
splitParagraphs Split text into paragraphs
stopWords List of stop words
decodeHTMLEntities Convert HTML and XML entities into characters
lower Convert documents to lowercase
upper Convert documents to uppercase
context Search documents for word or n-gram occurrences in context
tokenDetails Details of tokens in tokenized document array
addSentenceDetails Add sentence numbers to documents
addPartOfSpeechDetails Add part-of-speech tags to documents
addLemmaDetails Add lemma forms of tokens to documents
addLanguageDetails 语言标识符添加到文档
addEntityDetails Add entity tags to documents
addDependencyDetails Add grammatical dependency details to documents
addTypeDetails Add token type details to documents
splitSentences Split text into sentences
splitParagraphs Split text into paragraphs
corpusLanguage Detect language of text
abbreviations Table of common abbreviations
topLevelDomains List of top-level domains
bagOfWords Bag-of-words model
bagOfNgrams Bag-of-n-grams model
addDocument Add documents to bag-of-words or bag-of-n-grams model
removeDocument Remove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWords Remove words with low counts from bag-of-words model
removeInfrequentNgrams Remove infrequently seen n-grams from bag-of-n-grams model
removeNgrams Remove n-grams from bag-of-n-grams model
removeEmptyDocuments Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwords Most important words in bag-of-words model or LDA topic
topkngrams Most frequent n-grams
encode Encode documents as matrix of word or n-gram counts
tfidf Term Frequency–Inverse Document Frequency (tf-idf) matrix
join Combine multiple bag-of-words or bag-of-n-grams models
correctSpelling Correct spelling of words
editDistance Find edit distance between two strings or documents
editDistanceSearcher Edit distance nearest neighbor searcher
knnsearch Find nearest neighbors by edit distance
rangesearch Find nearest neighbors by edit distance range
splitGraphemes Split string into graphemes
docfun Apply function to words in documents
containsWords Check if word is member of documents
containsNgrams Check if n-gram is member of documents
contains Check if pattern is substring in documents
plus Append documents
replace Replace substrings in documents
regexprep Replace text in words of documents using regular expression
doclength Length of documents in document array
doc2cell Convert documents to cell array of string vectors
joinWords Convert documents to string by joining words
string Convert scalar document to string vector
textanalytics.unicode.nfc Unicode composed normalized form (NFC)
textanalytics.unicode.nfd Unicode decomposed normalized form (NFD)
textanalytics.unicode.nfkc Unicode compatibility composed normalized form (NFKC)
textanalytics.unicode.nfkd Unicode compatibility decomposed normalized form (NFKD)
textanalytics.unicode.UTF32 Unicode UTF-32 string representation
characterCategories Unicode character categories
hex Convert UTF-32 representation to hexadecimal values
string Convert UTF-32 representation to string

Topics

Import

Preprocessing

Language Support