addPartOfSpeechDetails
在文档中添加词性标签
Syntax
描述
UseaddPartOfSpeechDetails
to add part-of-speech tags to documents.
该功能支持英语,日语,德语金宝app和韩语文本。
detects parts of speech inupdatedDocuments
= addPartOfSpeechDetails(documents
)documents
and updates the token details. The function, by default, retokenizes the text for part-of-speech tagging. For example, the function splits the word "you're" into the tokens "you" and "'re". To get the part-of-speech details fromupdatedDocuments
, usetokenDetails
。
specifies additional options using one or more name-value pair arguments.updatedDocuments
= addPartOfSpeechDetails(documents
,Name,Value
)
Tip
UseaddPartOfSpeechDetails
使用之前lower
,upper
,擦除
,normalizeWords
,removeWords
, andremoveStopWords
functions asaddPartOfSpeechDetails
uses information that is removed by these functions.
例子
Add Part-of-Speech Details to Documents
Load the example data. The filesonnetspreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text fromsonnetspreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename ="sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
ans =8×5 tableToken DocumentNumber LineNumber Type Language ___________ ______________ __________ _______ ________ "fairest" 1 1 letters en "creatures" 1 1 letters en "desire" 1 1 letters en "increase" 1 1 letters en "thereby" 1 1 letters en "beautys" 1 1 letters en "rose" 1 1 letters en "might" 1 1 letters en
Add part-of-speech details to the documents using theaddPartOfSpeechDetails
函数。This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned bytokenDetails
。View the updated token details of the first few tokens.
documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans =8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ___________ ______________ ______________ __________ _______ ________ ______________ "fairest" 1 1 1 letters en adjective "creatures" 1 1 1 letters en noun "desire" 1 1 1 letters en noun "increase" 1 1 1 letters en noun "thereby" 1 1 1 letters en adverb "beautys" 1 1 1 letters en noun "rose" 1 1 1 letters en noun "might" 1 1 1 letters en auxiliary-verb
Get Part of Speech Details of Japanese Text
Tokenize Japanese text usingtokenizedDocument
。
str = ["恋に悩み、苦しむ。"“恋ので。”"空に星が輝き、瞬いている。""空の星が輝きを増している。"“駅まで远く,歩け。”“远くのまでない”"すもももももももものうち。"]; documents = tokenizedDocument(str);
For Japanese text, you can get the part-of-speech details usingtokenDetails
。For English text, you must first useaddPartOfSpeechDetails
。
tdetails = tokenDetails(documents); head(tdetails)
ans =8×8 tableToken DocumentNumber LineNumber Type Language PartOfSpeech Lemma Entity _______ ______________ __________ ___________ ________ ____________ _______ __________ "恋" 1 1 letters ja noun "恋" non-entity "に" 1 1 letters ja adposition "に" non-entity "悩み" 1 1 letters ja verb "悩む" non-entity "、" 1 1 punctuation ja punctuation "、" non-entity "苦しむ" 1 1 letters ja verb "苦しむ" non-entity "。" 1 1 punctuation ja punctuation "。" non-entity "恋" 2 1 letters ja noun "恋" non-entity "の" 2 1 letters ja adposition "の" non-entity
Get Part of Speech Details of German Text
Tokenize German text usingtokenizedDocument
。
str = [“ GutenMorgen。Wiegeht es dir?”“ heute wird ein guter标签。”]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
要获取德语文本的语音详细信息,请首先使用addPartOfSpeechDetails
。
documents = addPartOfSpeechDetails(documents);
To view the part of speech details, use thetokenDetails
函数。
tdetails = tokenDetails(documents); head(tdetails)
ans =8×7 tableToken DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ________ ______________ ______________ __________ ___________ ________ ____________ "Guten" 1 1 1 letters de adjective "Morgen" 1 1 1 letters de noun "." 1 1 1 punctuation de punctuation "Wie" 1 2 1 letters de adverb "geht" 1 2 1 letters de verb "es" 1 2 1 letters de pronoun "dir" 1 2 1 letters de pronoun "?" 1 2 1 punctuation de punctuation
Input Arguments
documents
—Input documents
tokenizedDocument
大批
Input documents, specified as atokenizedDocument
大批。
Name-Value Arguments
Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN
, 在哪里Name
is the argument name andValue
是相应的值。名称值参数必须在其他参数之后出现,但是对的顺序并不重要。
Before R2021a, use commas to separate each name and value, and encloseName
in quotes.
例子:'DiscardKnownValues',true
指定先前计算的详细信息并重新计算它们。
RetokenizeMethod
—Method to retokenize documents
“言论一部分”
(default) |'none'
Method to retokenize documents, specified as one of the following:
“言论一部分”
– Transform the tokens for part-of-speech tagging. The function performs these tasks:Split compound words. For example, split the compound word
"wanna"
into the tokens"want"
and"to"
。This includes compound words containing apostrophes. For example, the function splits the word"don't"
into the tokens"do"
and"n't"
。合并期间不会用前代币结束句子的时期。例如,合并令牌
“先生”
and"."
into the token“先生。”
。For German text, merge abbreviations that span multiple tokens. For example, merge the tokens
"z"
,"."
,“ B”
, and"."
into the single token"z. B."
。合并的时期成椭圆。例如,合并三个实例
"."
into the single token“……”
。
'none'
- 不要重述文件。
缩写s
—List of abbreviations
string array|角色向量|cell array of character vectors|table
List of abbreviations for sentence detection, specified as a string array, character vector, cell array of character vectors, or a table.
If the input documents do not contain sentence details, then the function first runs theaddSentenceDetails
功能并指定由“缩写”
。要指定更多句子检测选项(例如,句子启动器)使用addSentenceDetails
使用前的功能addPartOfSpeechDetails
details.
If缩写s
is a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the初学者
名称对。
要在缩写时指定不同的行为,请指定缩写s
as a table. The table must have variables named缩写
andUsage
, 在哪里缩写
contains the abbreviations, andUsage
contains the type of each abbreviation. The following table describes the possible values ofUsage
, and the behavior of the function when passed abbreviations of these types.
Usage | 行为 | Example Abbreviation | Example Text | Detected Sentences |
---|---|---|---|---|
常规的 |
If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period. | "appt." | "Book an appt. We'll meet then." |
|
"Book an appt. today." |
"Book an appt. today." |
|||
内 |
落后时期不要休息。 | “博士” | "Dr. Smith." |
"Dr. Smith." |
参考 |
如果下一个令牌不是数字,则在尾随期间中断。如果下一个令牌是一个数字,则不要在尾随期间中断。 | "fig." | “见图3。” |
“见图3。” |
“尝试一个无花果。他们很好。” |
|
|||
unit |
如果上一个单词是一个数字,而以下单词是大写的句子启动器,则在尾随时期中断。 | "in." | "The height is 30 in. The width is 10 in." |
|
如果上一个单词是一个数字,并且以下单词未大写,则不要在后续时间内打破。 | "The item is 10 in. wide." |
"The item is 10 in. wide." |
||
If the previous word is not a number, then break at a trailing period. | "Come in. Sit down." |
|
默认值是缩写
函数。日本和韩国文字,abbreviations do not usually impact sentence detection.
Tip
默认情况下,函数对单字母缩写eviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in缩写s
。
数据类型:char
|string
|table
|cell
DiscardKnownValues
—Option to discard previously computed details
false
(default) |true
Option to discard previously computed details and recompute them, specified astrue
orfalse
。
数据类型:logical
Output Arguments
updatedDocuments
— Updated documents
tokenizedDocument
大批
Updated documents, returned as atokenizedDocument
大批。To get the token details fromupdatedDocuments
, usetokenDetails
。
More About
Part-of-Speech Tags
TheaddPartOfSpeechDetails
函数将言论的一部分标签添加到由tokenDetails
函数。The function tags each token with a categorical tag with one of the following class names:
“形容词”
- 形容词“适当”
– Adposition"adverb"
– Adverb“助动词”
- 助动词"coord-conjunction"
– Coordinating conjunction"determiner"
- 确定器"interjection"
– Interjection"noun"
– Noun“数字”
– Numeral"particle"
– Particle“代词”
- 代词“专有名称”
– Proper noun"punctuation"
– Punctuation“下属结合”
- 从属连接"symbol"
- 象征"verb"
– Verb“其他”
- 其他
Algorithms
If the input documents do not contain sentence details, then the function first runsaddSentenceDetails
。
Version History
Ouvrir l'Eff
Vous possédez une version modifiée de cet exemple. Souhaitez-vous ouvrir cet exemple avec vos modifications ?
Commande MATLAB
Vous avez cliqué sur un lien qui correspond à cette commande MATLAB :
Pour exécuter la commande, saisissez-la dans la fenêtre de commande de MATLAB. Les navigateurs web ne supportent pas les commandes MATLAB.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select:。
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- AméricaLatina(Español)
- Canada(英语)
- United States(英语)
Europe
- 比利时(英语)
- 丹麦(英语)
- Deutschland(Deutsch)
- España(Español)
- Finland(英语)
- France(Français)
- Ireland(英语)
- Italia(Italiano)
- Luxembourg(英语)
- Netherlands(英语)
- Norway(英语)
- Österreich(Deutsch)
- Portugal(英语)
- Sweden(英语)
- Switzerland
- 英国(英语)