主要内容

wordEncoding

单词编码模型,将单词映射到索引和返回

描述

单词编码将词汇表中的单词映射为数字索引。

要将文档编码为单词或n元计数矩阵,请使用编码

创建

描述

例子

内附= wordEncoding (文档从输入的单词创建一个单词编码文档

例子

内附= wordEncoding (单词从单词数组创建单词编码。

例子

内附= wordEncoding (文档名称,值使用一个或多个名称-值对参数指定其他选项。例如,“秩序”、“频率”将较低的索引赋给较常用的单词。

输入参数

全部展开

输入文档,指定为tokenizedDocument数组中。

输入字,指定为字符串向量、字符向量或字符向量的单元格数组。如果您指定单词作为字符向量,函数将参数视为单个单词。

数据类型:字符串|字符|细胞

名称-值对的观点

指定可选的逗号分隔的对名称,值参数。的名字参数名和价值为对应值。的名字必须出现在引号内。可以以任意顺序指定多个名称和值对参数Name1, Value1,…,的家

例子:“秩序”、“频率”按文档中的总频率降序对索引进行排序。

索引排序,指定为逗号分隔对,由“秩序”以及以下其中之一:

  • 首次出现的-按单词在文档中出现的顺序为其分配索引。

  • “频率”—为文档中按总频率降序排序的单词分配索引。

如果“秩序”“频率”如果多个单词有相同的频率,那么这个函数就不会以任何特定的顺序分配索引。

要编码的最大字数,指定为正整数或.函数首先对索引进行排序“秩序”选项,然后编码顶部MaxNumWords单词。如果MaxNumWords,然后该函数对输入文档中的所有单词进行编码。

属性

全部展开

模型中唯一的单词数,指定为非负整数。

模型中唯一的单词,指定为字符串向量。

数据类型:字符串

对象的功能

ind2word 将编码索引映射到word
word2ind 将单词映射到编码索引
isVocabularyWord 测试word是否为word嵌入或编码的成员

例子

全部折叠

加载示例数据。该文件sonnetsPreprocessed.txt包含了经过预处理的莎士比亚十四行诗。该文件每行包含一首十四行诗,单词之间用空格分隔。将文本从sonnetsPreprocessed.txt,将文本以换行符分割为文档,然后标记文档。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);文档(1:10)
ans = 10x1 tokenizedDocument: 70 token:美丽的生物欲望增加从而美丽玫瑰可能永远不会死成熟时间流逝的记忆交给娇嫩的后嗣收缩你的明亮的眼睛feedst你灯火焰selfsubstantial燃料使饥荒丰富是你自我你的敌人你甜蜜的自我残酷的艺术世界新鲜点缀春天华丽使者你自己的花蕾埋葬你知足的温柔的粗鲁的人浪费了吝啬的怜悯,否则贪婪的吃了世界应有的坟墓,你的71个记号:四十冬天围攻你的眉毛挖深沟你的美容领域你年轻人骄傲制服盯着tatterd杂草小值得问你的美丽谎言珍惜你说你自己的深凹的眼睛的日夜alleating羞愧浪费的赞美赞美deservd你的美丽你能够回答我公平的孩子应当和计数使老借口证明美丽继承你的新,你的旧,你的血液温暖,你的感觉寒冷,看你的玻璃告诉脸你查看时间脸形成另一个新鲜的修复17你欺骗世博会unbless母亲的子宫uneard不屑走旁人走过耕作你饲养喜欢墓selflove停止后人你是你母亲的玻璃你电话回可爱的4月最佳窗口你年龄要尽管皱纹你的黄金时间活着,记住,独自死去,你的形象死去,你的象征。unthrifty可爱为什么你花在你的自我你美丽遗产性质遗赠给了什么难道借弗兰克借自由美丽的吝啬鬼为什么你虐待你慷慨的慷慨给无益的高利贷者为什么你伟大的金额总和还能实时路况你自我孤独你自我你甜蜜的自我欺骗内急你消失了你能把你那未用过的美留在坟墓里吗?小时温柔的工作框架可爱的眼睛凝视每个难道住打暴君一样不公平很难道excel neverresting时间导致夏天可怕的冬天混淆sap检查霜精力充沛的叶子很美丽了oersnowed赤裸每个夏天蒸馏液体离开囚禁囚犯墙玻璃美容效果美丽失去也没有鲜花和纪念 distilld though winter meet leese show substance still lives sweet 68 tokens: let winters ragged hand deface thee thy summer ere thou distilld make sweet vial treasure thou place beautys treasure ere selfkilld forbidden usury happies pay willing loan thats thy self breed another thee ten times happier ten ten times thy self happier thou art ten thine ten times refigurd thee death thou shouldst depart leaving thee living posterity selfwilld thou art fair deaths conquest make worms thine heir 64 tokens: lo orient gracious light lifts up burning head eye doth homage newappearing sight serving looks sacred majesty climbd steepup heavenly hill resembling strong youth middle age yet mortal looks adore beauty still attending golden pilgrimage highmost pitch weary car like feeble age reeleth day eyes fore duteous converted low tract look another way thou thyself outgoing thy noon unlookd diest unless thou get son 70 tokens: music hear why hearst thou music sadly sweets sweets war joy delights joy why lovst thou thou receivst gladly else receivst pleasure thine annoy true concord welltuned sounds unions married offend thine ear sweetly chide thee confounds singleness parts thou shouldst bear mark string sweet husband another strikes mutual ordering resembling sire child happy mother pleasing note sing whose speechless song many seeming sings thee thou single wilt prove none 70 tokens: fear wet widows eye thou consumst thy self single life ah thou issueless shalt hap die world wail thee like makeless wife world thy widow still weep thou form thee hast left behind every private widow well keep childrens eyes husbands shape mind look unthrift world doth spend shifts place still world enjoys beautys waste hath world end kept unused user destroys love toward others bosom sits murdrous shame commits 69 tokens: shame deny thou bearst love thy self art unprovident grant thou wilt thou art belovd many thou none lovst evident thou art possessd murderous hate gainst thy self thou stickst conspire seeking beauteous roof ruinate repair thy chief desire o change thy thought change mind shall hate fairer lodgd gentle love thy presence gracious kind thyself least kindhearted prove make thee another self love beauty still live thine thee

创建一个单词编码。

内附= wordEncoding(文档)
enc = wordEncoding with properties: NumWords: 3092 Vocabulary: [" fairrest " "creatures" "desire"…]

要从单词嵌入生成单词编码,输入单词嵌入词汇表wordEncoding作为一个单词列表。

加载预先训练的词嵌入。

emb = fastTextWordEmbedding;

提取词汇。

话说= emb.Vocabulary;

使用词汇表创建一个单词编码。

内附= wordEncoding(字)
enc = wordEncoding with properties: NumWords: 999994

利用词嵌入权值初始化深度学习网络中对应的词嵌入层,使用word2vec函数提取图层权重并设置“重量”的名称-值对wordEmbeddingLayer函数。单词嵌入层需要单词向量的列,所以你必须转置输出word2vec函数。

尺寸= emb.Dimension;numWords =元素个数(单词);层= wordEmbeddingLayer(维、numWords...“重量”word2vec (emb)”)
layer = WordEmbeddingLayer带有属性:Name: " Hyperparameters Dimension: 300 NumWords: 999994 Learnable Parameters Weights: [300×999994 single]显示所有属性

加载示例数据。该文件sonnetsPreprocessed.txt包含了经过预处理的莎士比亚十四行诗。该文件每行包含一首十四行诗,单词之间用空格分隔。将文本从sonnetsPreprocessed.txt,将文本以换行符分割为文档,然后标记文档。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);文档(1:10)
ans = 10x1 tokenizedDocument: 70 token:美丽的生物欲望增加从而美丽玫瑰可能永远不会死成熟时间流逝的记忆交给娇嫩的后嗣收缩你的明亮的眼睛feedst你灯火焰selfsubstantial燃料使饥荒丰富是你自我你的敌人你甜蜜的自我残酷的艺术世界新鲜点缀春天华丽使者你自己的花蕾埋葬你知足的温柔的粗鲁的人浪费了吝啬的怜悯,否则贪婪的吃了世界应有的坟墓,你的71个记号:四十冬天围攻你的眉毛挖深沟你的美容领域你年轻人骄傲制服盯着tatterd杂草小值得问你的美丽谎言珍惜你说你自己的深凹的眼睛的日夜alleating羞愧浪费的赞美赞美deservd你的美丽你能够回答我公平的孩子应当和计数使老借口证明美丽继承你的新,你的旧,你的血液温暖,你的感觉寒冷,看你的玻璃告诉脸你查看时间脸形成另一个新鲜的修复17你欺骗世博会unbless母亲的子宫uneard不屑走旁人走过耕作你饲养喜欢墓selflove停止后人你是你母亲的玻璃你电话回可爱的4月最佳窗口你年龄要尽管皱纹你的黄金时间活着,记住,独自死去,你的形象死去,你的象征。unthrifty可爱为什么你花在你的自我你美丽遗产性质遗赠给了什么难道借弗兰克借自由美丽的吝啬鬼为什么你虐待你慷慨的慷慨给无益的高利贷者为什么你伟大的金额总和还能实时路况你自我孤独你自我你甜蜜的自我欺骗内急你消失了你能把你那未用过的美留在坟墓里吗?小时温柔的工作框架可爱的眼睛凝视每个难道住打暴君一样不公平很难道excel neverresting时间导致夏天可怕的冬天混淆sap检查霜精力充沛的叶子很美丽了oersnowed赤裸每个夏天蒸馏液体离开囚禁囚犯墙玻璃美容效果美丽失去也没有鲜花和纪念 distilld though winter meet leese show substance still lives sweet 68 tokens: let winters ragged hand deface thee thy summer ere thou distilld make sweet vial treasure thou place beautys treasure ere selfkilld forbidden usury happies pay willing loan thats thy self breed another thee ten times happier ten ten times thy self happier thou art ten thine ten times refigurd thee death thou shouldst depart leaving thee living posterity selfwilld thou art fair deaths conquest make worms thine heir 64 tokens: lo orient gracious light lifts up burning head eye doth homage newappearing sight serving looks sacred majesty climbd steepup heavenly hill resembling strong youth middle age yet mortal looks adore beauty still attending golden pilgrimage highmost pitch weary car like feeble age reeleth day eyes fore duteous converted low tract look another way thou thyself outgoing thy noon unlookd diest unless thou get son 70 tokens: music hear why hearst thou music sadly sweets sweets war joy delights joy why lovst thou thou receivst gladly else receivst pleasure thine annoy true concord welltuned sounds unions married offend thine ear sweetly chide thee confounds singleness parts thou shouldst bear mark string sweet husband another strikes mutual ordering resembling sire child happy mother pleasing note sing whose speechless song many seeming sings thee thou single wilt prove none 70 tokens: fear wet widows eye thou consumst thy self single life ah thou issueless shalt hap die world wail thee like makeless wife world thy widow still weep thou form thee hast left behind every private widow well keep childrens eyes husbands shape mind look unthrift world doth spend shifts place still world enjoys beautys waste hath world end kept unused user destroys love toward others bosom sits murdrous shame commits 69 tokens: shame deny thou bearst love thy self art unprovident grant thou wilt thou art belovd many thou none lovst evident thou art possessd murderous hate gainst thy self thou stickst conspire seeking beauteous roof ruinate repair thy chief desire o change thy thought change mind shall hate fairer lodgd gentle love thy presence gracious kind thyself least kindhearted prove make thee another self love beauty still live thine thee

创建一个单词编码。按频率排序索引,只编码前100个单词。

内附= wordEncoding(文件,...“秩序”“频率”...“MaxNumWords”, 100)
enc = wordEncoding with properties: NumWords: 100 words: ["thy" "thou" "love" "thee" "做"…]

查看索引1、2和3对应的单词ind2word函数。

Idx = [1 2 3];话说= ind2word (enc idx)
话说=1 x3字符串“你的”“你”“爱”

加载示例数据。该文件sonnetsPreprocessed.txt包含了经过预处理的莎士比亚十四行诗。该文件每行包含一首十四行诗,单词之间用空格分隔。将文本从sonnetsPreprocessed.txt,将文本以换行符分割为文档,然后标记文档。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);文档(1:10)
ans = 10x1 tokenizedDocument: 70 token:美丽的生物欲望增加从而美丽玫瑰可能永远不会死成熟时间流逝的记忆交给娇嫩的后嗣收缩你的明亮的眼睛feedst你灯火焰selfsubstantial燃料使饥荒丰富是你自我你的敌人你甜蜜的自我残酷的艺术世界新鲜点缀春天华丽使者你自己的花蕾埋葬你知足的温柔的粗鲁的人浪费了吝啬的怜悯,否则贪婪的吃了世界应有的坟墓,你的71个记号:四十冬天围攻你的眉毛挖深沟你的美容领域你年轻人骄傲制服盯着tatterd杂草小值得问你的美丽谎言珍惜你说你自己的深凹的眼睛的日夜alleating羞愧浪费的赞美赞美deservd你的美丽你能够回答我公平的孩子应当和计数使老借口证明美丽继承你的新,你的旧,你的血液温暖,你的感觉寒冷,看你的玻璃告诉脸你查看时间脸形成另一个新鲜的修复17你欺骗世博会unbless母亲的子宫uneard不屑走旁人走过耕作你饲养喜欢墓selflove停止后人你是你母亲的玻璃你电话回可爱的4月最佳窗口你年龄要尽管皱纹你的黄金时间活着,记住,独自死去,你的形象死去,你的象征。unthrifty可爱为什么你花在你的自我你美丽遗产性质遗赠给了什么难道借弗兰克借自由美丽的吝啬鬼为什么你虐待你慷慨的慷慨给无益的高利贷者为什么你伟大的金额总和还能实时路况你自我孤独你自我你甜蜜的自我欺骗内急你消失了你能把你那未用过的美留在坟墓里吗?小时温柔的工作框架可爱的眼睛凝视每个难道住打暴君一样不公平很难道excel neverresting时间导致夏天可怕的冬天混淆sap检查霜精力充沛的叶子很美丽了oersnowed赤裸每个夏天蒸馏液体离开囚禁囚犯墙玻璃美容效果美丽失去也没有鲜花和纪念 distilld though winter meet leese show substance still lives sweet 68 tokens: let winters ragged hand deface thee thy summer ere thou distilld make sweet vial treasure thou place beautys treasure ere selfkilld forbidden usury happies pay willing loan thats thy self breed another thee ten times happier ten ten times thy self happier thou art ten thine ten times refigurd thee death thou shouldst depart leaving thee living posterity selfwilld thou art fair deaths conquest make worms thine heir 64 tokens: lo orient gracious light lifts up burning head eye doth homage newappearing sight serving looks sacred majesty climbd steepup heavenly hill resembling strong youth middle age yet mortal looks adore beauty still attending golden pilgrimage highmost pitch weary car like feeble age reeleth day eyes fore duteous converted low tract look another way thou thyself outgoing thy noon unlookd diest unless thou get son 70 tokens: music hear why hearst thou music sadly sweets sweets war joy delights joy why lovst thou thou receivst gladly else receivst pleasure thine annoy true concord welltuned sounds unions married offend thine ear sweetly chide thee confounds singleness parts thou shouldst bear mark string sweet husband another strikes mutual ordering resembling sire child happy mother pleasing note sing whose speechless song many seeming sings thee thou single wilt prove none 70 tokens: fear wet widows eye thou consumst thy self single life ah thou issueless shalt hap die world wail thee like makeless wife world thy widow still weep thou form thee hast left behind every private widow well keep childrens eyes husbands shape mind look unthrift world doth spend shifts place still world enjoys beautys waste hath world end kept unused user destroys love toward others bosom sits murdrous shame commits 69 tokens: shame deny thou bearst love thy self art unprovident grant thou wilt thou art belovd many thou none lovst evident thou art possessd murderous hate gainst thy self thou stickst conspire seeking beauteous roof ruinate repair thy chief desire o change thy thought change mind shall hate fairer lodgd gentle love thy presence gracious kind thyself least kindhearted prove make thee another self love beauty still live thine thee

创建一个单词编码。

内附= wordEncoding(文档)
enc = wordEncoding with properties: NumWords: 3092 Vocabulary: [" fairrest " "creatures" "desire"…]

查看索引1、3和5对应的单词ind2word函数。

Idx = [1 3 5];话说= ind2word (enc idx)
话说=1 x3字符串“公平”、“愿望”、“因此”

加载示例数据。该文件sonnetsPreprocessed.txt包含了经过预处理的莎士比亚十四行诗。该文件每行包含一首十四行诗,单词之间用空格分隔。将文本从sonnetsPreprocessed.txt,将文本以换行符分割为文档,然后标记文档。

文件名=“sonnetsPreprocessed.txt”;str = extractFileText(文件名);textData =分裂(str,换行符);文件= tokenizedDocument (textData);文档(1:10)
ans = 10x1 tokenizedDocument: 70 token:美丽的生物欲望增加从而美丽玫瑰可能永远不会死成熟时间流逝的记忆交给娇嫩的后嗣收缩你的明亮的眼睛feedst你灯火焰selfsubstantial燃料使饥荒丰富是你自我你的敌人你甜蜜的自我残酷的艺术世界新鲜点缀春天华丽使者你自己的花蕾埋葬你知足的温柔的粗鲁的人浪费了吝啬的怜悯,否则贪婪的吃了世界应有的坟墓,你的71个记号:四十冬天围攻你的眉毛挖深沟你的美容领域你年轻人骄傲制服盯着tatterd杂草小值得问你的美丽谎言珍惜你说你自己的深凹的眼睛的日夜alleating羞愧浪费的赞美赞美deservd你的美丽你能够回答我公平的孩子应当和计数使老借口证明美丽继承你的新,你的旧,你的血液温暖,你的感觉寒冷,看你的玻璃告诉脸你查看时间脸形成另一个新鲜的修复17你欺骗世博会unbless母亲的子宫uneard不屑走旁人走过耕作你饲养喜欢墓selflove停止后人你是你母亲的玻璃你电话回可爱的4月最佳窗口你年龄要尽管皱纹你的黄金时间活着,记住,独自死去,你的形象死去,你的象征。unthrifty可爱为什么你花在你的自我你美丽遗产性质遗赠给了什么难道借弗兰克借自由美丽的吝啬鬼为什么你虐待你慷慨的慷慨给无益的高利贷者为什么你伟大的金额总和还能实时路况你自我孤独你自我你甜蜜的自我欺骗内急你消失了你能把你那未用过的美留在坟墓里吗?小时温柔的工作框架可爱的眼睛凝视每个难道住打暴君一样不公平很难道excel neverresting时间导致夏天可怕的冬天混淆sap检查霜精力充沛的叶子很美丽了oersnowed赤裸每个夏天蒸馏液体离开囚禁囚犯墙玻璃美容效果美丽失去也没有鲜花和纪念 distilld though winter meet leese show substance still lives sweet 68 tokens: let winters ragged hand deface thee thy summer ere thou distilld make sweet vial treasure thou place beautys treasure ere selfkilld forbidden usury happies pay willing loan thats thy self breed another thee ten times happier ten ten times thy self happier thou art ten thine ten times refigurd thee death thou shouldst depart leaving thee living posterity selfwilld thou art fair deaths conquest make worms thine heir 64 tokens: lo orient gracious light lifts up burning head eye doth homage newappearing sight serving looks sacred majesty climbd steepup heavenly hill resembling strong youth middle age yet mortal looks adore beauty still attending golden pilgrimage highmost pitch weary car like feeble age reeleth day eyes fore duteous converted low tract look another way thou thyself outgoing thy noon unlookd diest unless thou get son 70 tokens: music hear why hearst thou music sadly sweets sweets war joy delights joy why lovst thou thou receivst gladly else receivst pleasure thine annoy true concord welltuned sounds unions married offend thine ear sweetly chide thee confounds singleness parts thou shouldst bear mark string sweet husband another strikes mutual ordering resembling sire child happy mother pleasing note sing whose speechless song many seeming sings thee thou single wilt prove none 70 tokens: fear wet widows eye thou consumst thy self single life ah thou issueless shalt hap die world wail thee like makeless wife world thy widow still weep thou form thee hast left behind every private widow well keep childrens eyes husbands shape mind look unthrift world doth spend shifts place still world enjoys beautys waste hath world end kept unused user destroys love toward others bosom sits murdrous shame commits 69 tokens: shame deny thou bearst love thy self art unprovident grant thou wilt thou art belovd many thou none lovst evident thou art possessd murderous hate gainst thy self thou stickst conspire seeking beauteous roof ruinate repair thy chief desire o change thy thought change mind shall hate fairer lodgd gentle love thy presence gracious kind thyself least kindhearted prove make thee another self love beauty still live thine thee

创建一个单词编码。

内附= wordEncoding(文档)
enc = wordEncoding with properties: NumWords: 3092 Vocabulary: [" fairrest " "creatures" "desire"…]

将“玫瑰”,“爱”和“美丽”这些词映射到编码索引中word2ind函数。

话说= [“玫瑰”“爱”“美”];话说idx = word2ind (enc)
idx =1×37 387 79

加载工厂报告数据并创建tokenizedDocument数组中。

文件名=“factoryReports.csv”;data = readtable(文件名,“TextType”“字符串”);textData = data.Description;文件= tokenizedDocument (textData);

创建一个单词编码。

内附= wordEncoding(文件);

将文档转换为单词索引序列。

序列= doc2sequence (enc,文档);

查看前10个序列的大小。每个序列都是1-by-年代向量,年代为序列中的单词索引数。因为序列是填充的,年代是恒定的。

序列(1:10)
ans =10×1单元阵列{[0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10]}{[0 0 0 0 0 0 2 16 17 18 19 11 12 13 14 15 10]}{[0 0 0 0 0 0 20 2 7 7 21日22日23日24日25日26日10]}{[0 0 0 0 0 0 0 0 0 0 0 27 28 6 7 18 10]}{[0 0 0 0 0 0 0 0 0 0 0 0 29 30 7 31 10]}{[0 0 0 0 0 0 0 32 33 6 7 34 35 36 37 38 10]}{[0 0 0 0 0 0 0 0 0 39 40 36 41 6 7 42 10]}{[0 0 0 0 0 0 0 0 43 44 22 45 46 47岁7 48 10]} {[ 0 0 0 0 0 0 0 0 0 0 49 50 17 7 51 48 10]} {[0 0 0 0 52 8 53 36 54 55 56 57 58 59 22 60 10]}
介绍了R2018b