主要内容

Language Considerations

Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.

Feature Language Consideration Workaround
Tokenization

The象征性文档function has built-in rules for English, Japanese, German, and Korean only. For English and German text, the'unicode'tokenization method of象征性文档detects tokens using rules based on Unicode®Standard Annex #29[1]和the ICU tokenizer[2], modified to better detect complex tokens such as hashtags and URLs. For Japanese and Korean text, the'mecab'tokenization method detects tokens using rules based on the MeCab tokenizer[3].

对于其他语言,您仍然可以尝试使用象征性文档. If象征性文档does not produce useful results, then try tokenizing the text manually. To create a象征性文档来自手动令牌文本的数组,设置'TokenizeMethod'选项'none'.

For more information, see象征性文档.

Stop word removal

ThestopWordsremoveStopWords功能仅支持英语,日语金宝app,德语和韩国停止单词。

To remove stop words from other languages, useremoveWords并指定自己的停止单词以删除。

Sentence detection

TheaddSentenceDetails功能根据标点符号和行号信息检测句子边界。对于英语和德语文本,该功能还使用传递给该功能的缩写列表。

For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the'Abbreviations'option ofaddSentenceDetails.

For more information, seeaddSentenceDetails.

Word clouds

For string input, thewordcloudwordCloudCountsfunctions use English, Japanese, German, and Korean tokenization, stop word removal, and word normalization.

For other languages, you might need to manually preprocess your text data and specify unique words and corresponding sizes inwordcloud.

To specify word sizes inwordcloud,将数据作为表或包含唯一单词和相应尺寸的数组输入。

For more information, seewordcloud.

Word embeddings

File input to thetrainWordEmbeddingfunction requires words separated by whitespace.

For files containing non-English text, you might need to input a象征性文档array totrainWordEmbedding.

创建一个象征性文档array from pretokenized text, use the象征性文档功能并设置'TokenizeMethod'选项'none'.

For more information, seetrainWordEmbedding.

Keyword extraction

TherakeKeywordsfunction supports English, Japanese, German, and Korean text only.

TherakeKeywordsfunction extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by thestopWordswith language given by the language details of the input documents as delimiters.

For other languages, specify an appropriate set of delimiters using the'Delimiters''MergingDelimiters'options.

For more information, seerakeKeywords.

ThetextrankKeywordsfunction supports English, Japanese, German, and Korean text only.

ThetextrankKeywordsfunction extracts keywords by identifying candidate keywords based on their part-of-speech tag. The function uses part-of-speech tags given by theAddPartofSpeechDetailsfunction which supports English, Japanese, German, and Korean text only.

For other languages, try using therakeKeywordsinstead and specify an appropriate set of delimiters using the'Delimiters''MergingDelimiters'options.

For more information, seetextrankKeywords.

独立于语言的功能

Word and N-Gram Counting

ThebagOfWordsbagOfNgramsfunctions support象征性文档input regardless of language. If you have a象征性文档包含数据的数组,然后您可以使用这些功能。

建模和预测

Thefitldafitlsafunctions supportbagOfWordsbagOfNgramsinput regardless of language. If you have abagOfWords或者bagOfNgramsobject containing your data, then you can use these functions.

ThetrainWordEmbeddingfunction supports象征性文档无论语言或文件输入。如果你有a象征性文档array or a file containing your data in the correct format, then you can use this function.

参考

[1]Unicode文本细分.https://www.unicode.org/reports/tr29/

[3]MeCab: Yet Another Part-of-Speech and Morphological Analyzer.https://taku910.github.io/mecab/

See Also

||||||||||

相关话题