German Language Support

This topic summarizes the Text Analytics Toolbox™ features that support German text. For an example showing how to analyze German text data, seeAnalyze German Text Data。

Tokenization

ThetokenizedDocumentfunction automatically detects German input. Alternatively, set the'Language'option intokenizedDocumentto'de'。This option specifies the language details of the tokens. To view the language details of the tokens, usetokenDetails。These language details determine the behavior of theremoveStopWords,addPartOfSpeechDetails,normalizeWords,addSentenceDetails, andaddEntityDetailsfunctions on the tokens.

Tokenize German Text

Open Live Script

Tokenize German text usingtokenizedDocument。The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str)

documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

Sentence Detection

To detect sentence structure in documents, use theaddSentenceDetails。You can use theabbreviationsfunction to help create custom lists of abbreviations to detect.

Add Sentence Details to German Documents

Open Live Script

Tokenize German text usingtokenizedDocument。

str = ["Guten Morgen, Dr. Schmidt. Geht es Ihnen wieder besser?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str);

Add sentence details to the documents usingaddSentenceDetails。这个函数将这句话号码添加到table returned bytokenDetails。View the updated token details of the first few tokens.

documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails,10)

ans=10×6 tableToken DocumentNumber SentenceNumber LineNumber Type Language _________ ______________ ______________ __________ ___________ ________ "Guten" 1 1 1 letters de "Morgen" 1 1 1 letters de "," 1 1 1 punctuation de "Dr" 1 1 1 letters de "." 1 1 1 punctuation de "Schmidt" 1 1 1 letters de "." 1 1 1 punctuation de "Geht" 1 2 1 letters de "es" 1 2 1 letters de "Ihnen" 1 2 1 letters de

Table of German Abbreviations

Open Live Script

View a table of German abbreviations. Use this table to help create custom tables of abbreviations for sentence detection when usingaddSentenceDetails。

tbl = abbreviations('Language','de'); head(tbl)

ans=8×2 tableAbbreviation Usage ____________ _______ "A.T" regular "ABl" regular "Abb" regular "Abdr" regular "Abf" regular "Abfl" regular "Abh" regular "Abk" regular

Part of Speech Details

To add German part of speech details to documents, use theaddPartOfSpeechDetailsfunction.

Get Part of Speech Details of German Text

Open Live Script

Tokenize German text usingtokenizedDocument。

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str)

documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

得到德国的词性信息文本,first useaddPartOfSpeechDetails。

documents = addPartOfSpeechDetails(documents);

To view the part of speech details, use thetokenDetailsfunction.

tdetails = tokenDetails(documents); head(tdetails)

ans=8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ________ ______________ ______________ __________ ___________ ________ ____________ "Guten" 1 1 1 letters de adjective "Morgen" 1 1 1 letters de noun "." 1 1 1 punctuation de punctuation "Wie" 1 2 1 letters de adverb "geht" 1 2 1 letters de verb "es" 1 2 1 letters de pronoun "dir" 1 2 1 letters de pronoun "?" 1 2 1 punctuation de punctuation

Named Entity Recognition

To add entity tags to documents, use theaddEntityDetailsfunction.

Add Named Entity Tags to German Text

Open Live Script

Tokenize German text usingtokenizedDocument。

str = ["Ernst zog von Frankfurt nach Berlin.""Besuchen Sie Volkswagen in Wolfsburg."]; documents = tokenizedDocument(str);

To add entity tags to German text, use theaddEntityDetailsfunction. This function detects person names, locations, organizations, and other named entities.

documents = addEntityDetails(documents);

To view the entity details, use thetokenDetailsfunction.

tdetails = tokenDetails(documents); head(tdetails)

ans=8×8 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech Entity ___________ ______________ ______________ __________ ___________ ________ ____________ __________ "Ernst" 1 1 1 letters de proper-noun person "zog" 1 1 1 letters de verb non-entity "von" 1 1 1 letters de adposition non-entity "Frankfurt" 1 1 1 letters de proper-noun location "nach" 1 1 1 letters de adposition non-entity "Berlin" 1 1 1 letters de proper-noun location "." 1 1 1 punctuation de punctuation non-entity "Besuchen" 2 1 1 letters de verb non-entity

View the words tagged with entity"person","location","organization", or"other"。These words are the words not tagged with"non-entity"。

idx = tdetails.Entity ~="non-entity"; tdetails(idx,:)

ans=5×8 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech Entity ____________ ______________ ______________ __________ _______ ________ ____________ ____________ "Ernst" 1 1 1 letters de proper-noun person "Frankfurt" 1 1 1 letters de proper-noun location "Berlin" 1 1 1 letters de proper-noun location "Volkswagen" 2 1 1 letters de noun organization "Wolfsburg" 2 1 1 letters de proper-noun location

Stop Words

To remove stop words from documents according to the token language details, useremoveStopWords。For a list of German stop words set the'Language'option instopWordsto'de'。

Remove German Stop Words from Documents

Open Live Script

Tokenize German text usingtokenizedDocument。The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str)

documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .

Remove stop words using theremoveStopWordsfunction. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)

documents = 2x1 tokenizedDocument: 5 tokens: Guten Morgen . geht ? 5 tokens: Heute wird guter Tag .

Stemming

To stem tokens according to the token language details, usenormalizeWords。

Stem German Text

Open Live Script

Tokenize German text using thetokenizedDocumentfunction. The function automatically detects German text.

str = ["Guten Morgen. Wie geht es dir?""Heute wird ein guter Tag."]; documents = tokenizedDocument(str);

Stem the tokens usingnormalizeWords。

documents = normalizeWords(documents)

documents = 2x1 tokenizedDocument: 8 tokens: gut morg . wie geht es dir ? 6 tokens: heut wird ein gut tag .

Language-Independent Features

Word and N-Gram Counting

ThebagOfWordsandbagOfNgramsfunctions supporttokenizedDocumentinput regardless of language. If you have atokenizedDocumentarray containing your data, then you can use these functions.

Modeling and Prediction

Thefitldaandfitlsafunctions supportbagOfWordsandbagOfNgramsinput regardless of language. If you have abagOfWordsorbagOfNgramsobject containing your data, then you can use these functions.

ThetrainWordEmbeddingfunction supportstokenizedDocumentor file input regardless of language. If you have atokenizedDocumentarray or a file containing your data in the correct format, then you can use this function.