Language Considerations

Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.

Feature	Language Consideration	Workaround
Tokenization	The`象征性文档`function has built-in rules for English, Japanese, German, and Korean only. For English and German text, the`'unicode'`tokenization method of`象征性文档`detects tokens using rules based on Unicode^®Standard Annex #29[1]和the ICU tokenizer[2], modified to better detect complex tokens such as hashtags and URLs. For Japanese and Korean text, the`'mecab'`tokenization method detects tokens using rules based on the MeCab tokenizer[3].	对于其他语言，您仍然可以尝试使用`象征性文档`. If`象征性文档`does not produce useful results, then try tokenizing the text manually. To create a`象征性文档`来自手动令牌文本的数组，设置`'TokenizeMethod'`选项`'none'`. For more information, see`象征性文档`.
Stop word removal	The`stopWords`和`removeStopWords`功能仅支持英语，日语金宝app，德语和韩国停止单词。	To remove stop words from other languages, use`removeWords`并指定自己的停止单词以删除。
Sentence detection	The`addSentenceDetails`功能根据标点符号和行号信息检测句子边界。对于英语和德语文本，该功能还使用传递给该功能的缩写列表。	For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the`'Abbreviations'`option of`addSentenceDetails`. For more information, see`addSentenceDetails`.
Word clouds	For string input, the`wordcloud`和`wordCloudCounts`functions use English, Japanese, German, and Korean tokenization, stop word removal, and word normalization.	For other languages, you might need to manually preprocess your text data and specify unique words and corresponding sizes in`wordcloud`. To specify word sizes in`wordcloud`，将数据作为表或包含唯一单词和相应尺寸的数组输入。 For more information, see`wordcloud`.
Word embeddings	File input to the`trainWordEmbedding`function requires words separated by whitespace.	For files containing non-English text, you might need to input a`象征性文档`array to`trainWordEmbedding`. 创建一个`象征性文档`array from pretokenized text, use the`象征性文档`功能并设置`'TokenizeMethod'`选项`'none'`. For more information, see`trainWordEmbedding`.
Keyword extraction	The`rakeKeywords`function supports English, Japanese, German, and Korean text only.	The`rakeKeywords`function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the`stopWords`with language given by the language details of the input documents as delimiters. For other languages, specify an appropriate set of delimiters using the`'Delimiters'`和`'MergingDelimiters'`options. For more information, see`rakeKeywords`.
Keyword extraction	The`textrankKeywords`function supports English, Japanese, German, and Korean text only.	The`textrankKeywords`function extracts keywords by identifying candidate keywords based on their part-of-speech tag. The function uses part-of-speech tags given by the`AddPartofSpeechDetails`function which supports English, Japanese, German, and Korean text only. For other languages, try using the`rakeKeywords`instead and specify an appropriate set of delimiters using the`'Delimiters'`和`'MergingDelimiters'`options. For more information, see`textrankKeywords`.

独立于语言的功能

Word and N-Gram Counting

ThebagOfWords和bagOfNgramsfunctions support象征性文档input regardless of language. If you have a象征性文档包含数据的数组，然后您可以使用这些功能。

建模和预测

Thefitlda和fitlsafunctions supportbagOfWords和bagOfNgramsinput regardless of language. If you have abagOfWords或者bagOfNgramsobject containing your data, then you can use these functions.

ThetrainWordEmbeddingfunction supports象征性文档无论语言或文件输入。如果你有a象征性文档array or a file containing your data in the correct format, then you can use this function.

参考

[1]Unicode文本细分.https://www.unicode.org/reports/tr29/

[2]Boundary Analysis.https://unicode-org.github.io/icu/userguide/boundaryanalysis/

[3]MeCab: Yet Another Part-of-Speech and Morphological Analyzer.https://taku910.github.io/mecab/

Language Considerations

独立于语言的功能

Word and N-Gram Counting

建模和预测

参考

See Also

相关话题