主要内容

语音情感识别

本例演示了一个使用BiLSTM网络的简单语音情感识别(SER)系统。首先下载数据集,然后在各个文件上测试训练好的网络。该网络是在一个小型德语数据库上训练的[1]

该示例引导您完成网络的训练,包括下载、扩充和训练数据集。最后,执行保留一个扬声器(LOSO) 10次交叉验证,以评估网络架构。

本例中使用的特征是使用顺序特征选择来选择的,类似于中描述的方法音频特征的顺序特征选择(音频工具箱)

下载数据集

下载柏林情感语言数据库[1].该数据库包含了10位演员的535句话语,旨在表达以下情绪之一:愤怒、无聊、厌恶、焦虑/恐惧、快乐、悲伤或中性。情感与文本无关。

dataFolder = tempdir;dataset = fullfile(数据文件夹,“Emo-DB”);如果~datasetExists(dataset) url =“http://emodb.bilderbar.info/download/download.zip”;disp ("下载Emo-DB (40.5 MB)…"解压缩(url,数据集)结束
下载Emo-DB (40.5 MB)…

创建一个audioDatastore(音频工具箱)指向音频文件。

ads = audioDatastore(fullfile(dataset,“wav”));

文件名是表示说话人ID、文本语音、情绪和版本的代码。该网站包含一个解释代码的键,以及关于说话者的其他信息,如性别和年龄。用变量创建一个表演讲者而且情感.将文件名解码到表中。

fileppaths = ads.Files;emotionCodes = cellfun(@(x)x(end-5), fileppaths,UniformOutput=false);emotions = replace(emotionCodes,[“W”“L”“E”“一个”“F”“T”“N”),...“愤怒”“无聊”“厌恶”“焦虑和恐惧”“幸福”“悲伤”“中性”]);speakerCodes = cellfun(@(x)x(end-10:end-9), fileppaths,UniformOutput=false);labelTable = cell2table([speakerCodes,emotions],VariableNames=[“议长”“情感”]);labelTable。情感= categorical(labelTable.Emotion); labelTable.Speaker = categorical(labelTable.Speaker); summary(labelTable)
变量:演讲者:535×1 categorical值:03 49 08 58 09 43 10 38 11 55 12 35 13 61 14 69 15 56 16 71情绪:535×1 categorical值:愤怒127焦虑/恐惧69无聊81厌恶46快乐71中性79悲伤62

labelTable和文件的顺序一样吗audioDatastore.设置标签的属性audioDatastorelabelTable

ads.Labels =标签表;

进行语音情绪识别

下载并加载预训练的网络,即audioFeatureExtractor(音频工具箱)对象用于训练网络,并对特征进行归一化因子。该网络使用数据集中除说话人以外的所有说话人进行训练03

downloadFolder = matlab.internal.examples.download金宝appSupportFile(“音频”“SpeechEmotionRecognition.zip”);dataFolder = tempdir;unzip(下载文件夹,数据文件夹)netFolder = fullfile(数据文件夹,数据文件夹)“SpeechEmotionRecognition”);负载(fullfile (netFolder“network_Audio_SER.mat”));

采样率设置在audioFeatureExtractor对应于数据集的抽样率。

fs = afe.SampleRate;

选择发言者和情绪,然后将数据存储子集化为只包括所选发言者和情绪。从数据存储中读取并监听文件。

演讲者=分类(“03”);情感=分类(“厌恶”);ads子集=子集(ads,ads. labels。演讲者==speaker & ads.Labels.Emotion==emotion); audio = read(adsSubset); sound(audio,fs)

使用audioFeatureExtractor对象来提取特征,然后对它们进行转置,使时间沿行排列。将特征归一化,然后将它们转换为20个元素序列,其中10个元素重叠,这对应于大约600毫秒的窗口,300毫秒重叠。利用支撑函数,金宝appHelperFeatureVector2Sequence,将特征向量数组转换为序列。

Features = (extract(afe,audio))';featuresNormalized = (features - normalizers.Mean)./normalizers.标准差;numOverlap =10;featureSequences = HelperFeatureVector2Sequence(featuresNormalized,20,numOverlap);

将特征序列输入网络进行预测。计算平均预测,并绘制所选情绪的概率分布为饼图。你可以尝试不同的说话者、情绪、序列重叠和预测平均来测试网络的性能。要获得网络性能的真实近似,请使用扬声器03这个网络没有接受过训练。

YPred = double(predict(net,featureSequences));平均=“模式”开关平均情况下“的意思是”probs = mean(YPred,1);情况下“中值”probs = median(YPred,1);情况下“模式”probs = mode(YPred,1);结束派(probs. /笔(聚合氯化铝),字符串(net.Layers(结束). class))

示例的其余部分说明了如何训练和验证网络。

列车网络的

由于训练数据不足,第一次尝试训练的10倍交叉验证精度约为60%。在不充分的数据上训练的模型会过度拟合一些折叠,而不适合于其他折叠。若要改善整体拟合,请使用audioDataAugmenter(音频工具箱).根据经验,在处理时间和准确性改进之间选择了每个文件50个增强。您可以减少增加的数量以加快示例的速度。

创建一个audioDataAugmenter对象。设置应用pitch shift的概率为0.5并使用默认范围。设置应用时移的概率为1并使用范围[-0.3, 0.3]秒。设置添加噪声的概率为1并指定信噪比范围为(-20年,40)dB。

numAugmentations =50;augmenter = audioDataAugmenter(NumAugmentations= numaugments,...TimeStretchProbability = 0,...VolumeControlProbability = 0,......PitchShiftProbability = 0.5,......TimeShiftProbability = 1,...TimeShiftRange = [-0.3, 0.3],......AddNoiseProbability = 1,...SNRRange = [-20, 40]);

在当前文件夹中创建一个新文件夹以保存增强的数据集。

currentDir = pwd;writeDirectory = fullfile(currentDir,“augmentedData”);mkdir (writeDirectory)

对于音频数据存储中的每个文件:

  1. 创建50个增强。

  2. 规范化音频,使其最大绝对值为1。

  3. 将增强音频数据写入WAV文件。附加_augK到每个文件名,其中K是增广数。为了加快处理速度,请使用parfor并对数据存储进行分区。

这种扩充数据库的方法既费时又耗空间。然而,在迭代选择网络架构或特征提取管道时,这种前期成本通常是有利的。

N = nummel (ads.Files)* numaugations;reset(ads) numPartitions = 18;抽搐parforadsPart =分区(ads,numPartitions,ii);hasdata(adsPart) [x,adsInfo] = read(adsPart);数据=增强(增强器,x,fs);[~,fn] = fileparts(adsInfo.FileName);i = 1:size(data,1) augmentedAudio = data. audio {i};augmentedAudio = augmentedAudio/max(abs(augmentedAudio),[],“所有”);augNum = num2str(i);如果数字(augNum)==1' 0 'augNum);其他的iString = augNum;结束audiowrite (fullfile (writeDirectory sprintf (“% s_aug % s.wav”、fn iString))、augmentedAudio fs);结束结束结束disp (“扩展完成”+ round(toc/60,2) +“分钟”。)
扩增在3.84分钟内完成。

创建指向增强数据集的音频数据存储。复制原始数据存储的标签表的行NumAugmentations次数来确定增强数据存储的标签。

adsAug = audioDatastore(writeDirectory);adsAug。标签= repelem(ads.Labels,augmenter.NumAugmentations,1);

创建一个audioFeatureExtractor(音频工具箱)对象。集窗口到周期性的30毫秒的汉明窗口,OverlapLength0,SampleRate到数据库的抽样率。集gtccgtccDeltamfccDelta,spectralCrest真正的来提取它们。集SpectralDescriptorInputmelSpectrum所以spectralCrest为MEL谱计算。

赢=汉明(轮(0.03*fs),“周期”);overlapLength = 0;afe = audioFeatureExtractor(...窗口=赢,...OverlapLength = OverlapLength,...SampleRate = fs,......gtcc = true,...gtccDelta = true,...mfccDelta = true,......SpectralDescriptorInput =“melSpectrum”...spectralCrest = true);

部署培训

在为部署进行培训时,请使用数据集中所有可用的扬声器。将训练数据存储设置为增强数据存储。

adsTrain = adsAug;

将训练音频数据存储转换为一个高数组。如果您拥有并行计算工具箱™,则提取将自动并行化。如果没有并行计算工具箱™,代码将继续运行。

tallTrain = tall(adsTrain);

提取训练特征并重新定位特征,使时间沿着行与之兼容sequenceInputLayer

featuresTallTrain = cellfun(@(x)extract(afe,x),tallTrain,UniformOutput=false);featuresTallTrain = cellfun(@(x)x',featuresTallTrain,UniformOutput=false);featuresTrain = gather(featurerestalltrain);
使用并行池“本地”评估tall表达式:-通过1:0%完成评估0%完成-通过1:1:在1分7秒内完成评估在1分7秒内完成

使用训练集来确定每个特征的均值和标准差。

allFeatures = cat(2,featuresTrain{:});M = mean(allFeatures,2,“omitnan”);S = std(allFeatures,0,2,“omitnan”);featuresTrain = cellfun(@(x)(x- m)./S,featuresTrain,UniformOutput=false);

将特征向量缓冲为序列,使每个序列由20个特征向量组成,其中10个特征向量重叠。

featureVectorsPerSequence = 20;featureVectorOverlap = 10;[sequencesTrain,sequencePerFileTrain] = HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOverlap);

复制训练集和验证集的标签,使它们与序列一一对应。不是所有的演讲者都能表达所有的情绪。创建一个空的分类数组,其中包含所有情感类别,并将其附加到验证标签,以便类别数组包含所有情感。

labelsTrain = repelem(adsTrain.Labels.Emotion,[sequencePerFileTrain{:}]);emptyEmotions = ads.Labels.Emotion;emptyEmotions(:) = [];

使用定义一个BiLSTM网络bilstmLayer.放置一个dropoutLayer前后bilstmLayer以帮助防止过拟合。

dropoutProb1 = 0.3;numUnits = 200;dropoutProb2 = 0.6;层= [...sequenceInputLayer(afe.FeatureVectorLength) dropoutLayer(dropoutProb1) bilstmLayer(numUnits,OutputMode=“最后一次”) dropoutLayer(dropoutProb2) fullyConnectedLayer(nummel (categories(emptyEmotions))) softmaxLayer classificationLayer];

使用以下方法定义培训选项trainingOptions

miniBatchSize = 512;initialLearnRate = 0.005;learnratdropperiod = 2;maxEpochs = 3;选项= trainingOptions(“亚当”...MiniBatchSize = MiniBatchSize,...InitialLearnRate = InitialLearnRate,...LearnRateDropPeriod = LearnRateDropPeriod,...LearnRateSchedule =“分段”...MaxEpochs = MaxEpochs,...洗牌=“every-epoch”...Verbose = false,...情节=“训练进步”);

使用trainNetwork

net = trainNetwork(sequencesTrain,labelsTrain,layers,options);

保存网络,配置完毕audioFeatureExtractor,归一化因子,设saveSERSystem真正的

saveSERSystem =如果saveSERSystem标准化者。均值= M;标准化者。标准差= S;保存(“network_Audio_SER.mat”“净”“安全的”“标准化者”)结束

系统验证培训

为了对您在本例中创建的模型提供准确的评估,使用保留一个说话者(LOSO)进行训练和验证k-fold交叉验证。在这种方法中,你训练使用 k - 1 演讲者,然后对遗漏的演讲者进行验证。重复这个过程k次了。最终验证精度为k折叠。

创建一个包含演讲者id的变量。确定折叠的数量:每个扬声器1个。该数据库包含了来自10位独特演讲者的话语。使用总结以显示说话者id(左列)和他们对数据库的贡献的话语数(右列)。

speaker = ads.Labels.Speaker;numFolds = nummel(扬声器);总结(扬声器)
03 49 08 58 09 43 10 38 11 55 12 35 13 61 14 69 15 56 16 71

辅助函数HelperTrainAndValidateNetwork对所有10次折叠执行上述步骤,并为每次折叠返回真实标签和预测标签。调用HelperTrainAndValidateNetworkaudioDatastore,增强的audioDatastore,以及audioFeatureExtractor

[labelsTrue,labelsPred] = HelperTrainAndValidateNetwork(ads,adsAug,afe);

打印每折精度并绘制10折混淆图。

ii = 1:数字(labelsTrue) foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;disp (“折”+ ii +,精度= "+圆(foldAcc, 2))结束
折叠1,精确度= 65.31折叠2,精确度= 68.97折叠3,精确度= 79.07折叠4,精确度= 71.05折叠5,精确度= 72.73折叠6,精确度= 74.29折叠7,精确度= 67.21折叠8,精确度= 85.51折叠9,精确度= 71.43折叠10,精确度= 67.61
labelsTrueMat = cat(1,labelsTrue{:});labelsPredMat = cat(1,labelsPred{:});figure cm = confusichart (labelsTrueMat,labelsPredMat,...Title = (10倍交叉验证的混淆矩阵“平均准确度=”圆的(意味着(labelsTrueMat = = labelsPredMat) * 100, 1)),...ColumnSummary =“column-normalized”RowSummary =“row-normalized”);sortClasses(厘米、类别(emptyEmotions))

金宝app支持功能

将特征向量数组转换为序列

函数[sequences,sequencePerFile] = HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)版权所有2019 MathWorks, Inc.如果featureVectorsPerSequence <= featureVectorOverlap错误(“重叠特征向量的数量必须小于每个序列的特征向量的数量。”)结束如果~iscell(features) features = {features};结束hopLength = featureVectorsPerSequence - featureVectorOverlap;Idx1 = 1;序列= {};sequencePerFile = cell(数字(特征),1);ii = 1:numel(features) sequencePerFile{ii} = floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1;Idx2 = 1;j = 1:sequencePerFile{ii} sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1);% #好< AGROW >Idx1 = Idx1 + 1;idx2 = idx2 + hopLength;结束结束结束

训练和验证网络

函数[truelabelscrosfold, predictedlabelscrosfold] = HelperTrainAndValidateNetwork(varargin)The MathWorks, Inc.版权所有如果Nargin == 3 ads = varargin{1};Augads = varargin{2};Extractor = varargin{3};elseifNargin == 2 ads = varargin{1};Augads = varargin{1};Extractor = varargin{2};结束speaker = categories(ads.Labels.Speaker);numFolds = nummel(扬声器);emptyEmotions = (ads.Labels.Emotion);emptyEmotions(:) = [];%循环每一个折叠。truelabelcrossfold = {};predictedlabelcrosssfold = {};i = 1:numFolds% 1。将音频数据存储分为训练集和验证集。将数据转换为高数组。idxTrain = augads.Labels.Speaker~=speaker(i);augadsTrain =子集(augads,idxTrain);augadsTrain。标签= augadsTrain.Labels.Emotion; tallTrain = tall(augadsTrain); idxValidation = ads.Labels.Speaker==speaker(i); adsValidation = subset(ads,idxValidation); adsValidation.Labels = adsValidation.Labels.Emotion; tallValidation = tall(adsValidation);% 2。从训练集中提取特征。重新定位功能%,使时间与沿线行相兼容% sequenceInputLayer。tallTrain = cellfun(@(x)x/max(abs(x),[],“所有”)、tallTrain UniformOutput = false);tallFeaturesTrain = cellfun(@(x)extract(extractor,x),tallTrain,UniformOutput=false);tallFeaturesTrain = cellfun(@(x)x',tallFeaturesTrain,UniformOutput=false);% #好< NASGU >[~,featuresTrain] = evalc(“收集(tallFeaturesTrain)”);%使用evalc抑制命令行输出。tallValidation = cellfun(@(x)x/max(abs(x),[],“所有”)、tallValidation UniformOutput = false);tallFeaturesValidation = cellfun(@(x)extract(extractor,x),tallValidation,“UniformOutput”、假);tallFeaturesValidation = cellfun(@(x)x',tallFeaturesValidation,UniformOutput=false);% #好< NASGU >[~,featuresValidation] = evalc(“收集(tallFeaturesValidation)”);%使用evalc抑制命令行输出。% 3。使用训练集来确定平均值和标准每个特征的%偏差。规范化培训和验证%设置。allFeatures = cat(2,featuresTrain{:});M = mean(allFeatures,2,“omitnan”);S = std(allFeatures,0,2,“omitnan”);featuresTrain = cellfun(@(x)(x- m)./S,featuresTrain,UniformOutput=false);ii = 1:numel(featuresTrain) idx = find(isnan(featuresTrain{ii}));如果~isempty(idx) featuresTrain{ii}(idx) = 0;结束结束featuresValidation = cellfun(@(x)(x- m)./S,featuresValidation,UniformOutput=false);ii = 1:numel(featuresValidation) idx = find(isnan(featuresValidation{ii}));如果~isempty(idx) featuresValidation{ii}(idx) = 0;结束结束% 4。缓冲序列,使每个序列由20个组成%特征向量与10个特征向量重叠。featureVectorsPerSequence = 20;featureVectorOverlap = 10;[sequencesTrain,sequencePerFileTrain] = HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOverlap);[sequencesValidation,sequencePerFileValidation] = HelperFeatureVector2Sequence(featuresValidation,featureVectorsPerSequence,featureVectorOverlap);% 5。复制训练和验证集的标签,以便它们与序列一一对应。labelsTrain = [emptyEmotions;augadsTrain.Labels];labelsTrain = labelsTrain(:);labelsTrain = repelem(labelsTrain,[sequencePerFileTrain{:}]);% 6。定义一个BiLSTM网络。dropoutProb1 = 0.3;numUnits = 200;dropoutProb2 = 0.6;层= [...sequenceInputLayer(size(sequencesTrain{1},1)) dropoutLayer(dropoutProb1) bilstmLayer(numUnits,OutputMode=“最后一次”) dropoutLayer(dropoutProb2) fullyConnectedLayer(nummel (categories(emptyEmotions))) softmaxLayer classificationLayer];% 7。定义培训选项。miniBatchSize = 512;initialLearnRate = 0.005;learnratdropperiod = 2;maxEpochs = 3;选项= trainingOptions(“亚当”...MiniBatchSize = MiniBatchSize,...InitialLearnRate = InitialLearnRate,...LearnRateDropPeriod = LearnRateDropPeriod,...LearnRateSchedule =“分段”...MaxEpochs = MaxEpochs,...洗牌=“every-epoch”...Verbose = false);% 8。培训网络。net = trainNetwork(sequencesTrain,labelsTrain,layers,options);% 9。评估网络。调用classification来获得预测的标签%为每个序列。得到每个预测标签的模式%序列来获得每个文件的预测标签。predictedLabelsPerSequence =分类(net,sequencesValidation);trueLabels = categorical(adsValidation.Labels);predictedLabels = trueLabels;Idx1 = 1;ii = 1:numel(trueLabels) predictedLabels(ii,:) = mode(predictedLabelsPerSequence(idx1:idx1 + sequencePerFileValidation{ii} - 1,:),1);idx1 = idx1 + sequencePerFileValidation{ii};结束truelabelcrossfold {i} = trueLabels;% #好< AGROW >predictedlabelcrosssfold {i} = predictedLabels;% #好< AGROW >结束结束

参考文献

[1] Burkhardt, F., A. Paeschke, M. Rolfes, W.F. Sendlmeier,和B. Weiss,“德语情感言语数据库”。在2005年会议记录.里斯本,葡萄牙:国际语音交流协会,2005年。

另请参阅

|||

相关的话题