Main Content

Create Simple Preprocessing Function

这个例子展示了如何创建一个函数cleans and preprocesses text data for analysis.

Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example "stop words" such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

These word clouds illustrate word frequency analysis applied to some raw text data from weather reports, and a preprocessed version of the same text data.

It can be useful to create a preprocessing function, so you can prepare different collections of text data in the same way. For example, when training a model, you can use a function so that you can preprocess new data using the same steps as the training data.

The functionpreprocessTextData, listed at the end of the example, performs the following steps:

  1. Tokenize the text usingtokenizedDocument.

  2. Lemmatize the words usingnormalizeWords.

  3. Erase punctuation usingerasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") usingremoveStopWords.

  5. Remove words with 2 or fewer characters usingremoveShortWords.

  6. Remove words with 15 or more characters usingremoveLongWords.

To use the function, simply input your text data intopreprocessTextData.

textData = ["A large tree is downed and blocking traffic outside Apple Hill.""There is lots of damage to many car windshields in the parking lot."]; documents = preprocessTextData(textData)
documents = 2x1 tokenizedDocument: 8 tokens: large tree down block traffic outside apple hill 7 tokens: lot damage many car windshield parking lot

Preprocessing Function

functiondocuments = preprocessTextData(textData)% Tokenize the text.documents = tokenizedDocument(textData);% Lemmatize the words. To improve lemmatization, first use% addPartOfSpeechDetails.documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,'Style','lemma');% Erase punctuation.documents = erasePunctuation(documents);% Remove a list of stop words.documents = removeStopWords(documents);% Remove words with 2 or fewer characters, and words with 15 or more% characters.documents = removeShortWords(documents,2); documents = removeLongWords(documents,15);end

For an example showing a more detailed workflow, seePrepare Text Data for Analysis.

For next steps in text analytics, you can try creating a classification model or analyze the data using topic models. For examples, seeCreate Simple Text Model for ClassificationandAnalyze Text Data Using Topic Models.

See Also

||||||

相关话题