Main Content

vggish

VGGish neural network

    Syntax

    Description

    example

    net= vggishreturns a pretrained VGGish model.

    This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

    Examples

    collapse all

    Download and unzip the Audio Toolbox™ model for VGGish.

    Typevggishat the Command Window. If the Audio Toolbox model for VGGish is not installed, then the function provides a link to the location of the network weights. To download the model, click the link. Unzip the file to a location on the MATLAB path.

    Alternatively, execute these commands to download and unzip the VGGish model to your temporary directory.

    downloadFolder = fullfile(tempdir,'VGGishDownload'); loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/vggish.zip'); VGGishLocation = tempdir; unzip(loc,VGGishLocation) addpath(fullfile(VGGishLocation,'vggish'))

    Check that the installation is successful by typingvggishat the Command Window. If the network is installed, then the function returns aSeriesNetwork(Deep Learning Toolbox)object.

    vggish
    ans = SeriesNetwork with properties: Layers: [24×1 nnet.cnn.layer.Layer] InputNames: {'InputBatch'} OutputNames: {'regressionoutput'}

    Load a pretrained VGGish convolutional neural network and examine the layers and classes.

    使用vggishto load the pretrained VGGish network. The outputnetis aSeriesNetwork(Deep Learning Toolbox)object.

    net = vggish
    net = SeriesNetwork with properties: Layers: [24×1 nnet.cnn.layer.Layer] InputNames: {'InputBatch'} OutputNames: {'regressionoutput'}

    View the network architecture using theLayersproperty. The network has 24 layers. There are nine layers with learnable weights, of which six are convolutional layers and three are fully connected layers.

    net.Layers
    ans = 24×1 Layer array with layers: 1 'InputBatch' Image Input 96×64×1 images 2 'conv1' Convolution 64 3×3×1 convolutions with stride [1 1] and padding 'same' 3 'relu' ReLU ReLU 4 'pool1' Max Pooling 2×2 max pooling with stride [2 2] and padding 'same' 5 'conv2' Convolution 128 3×3×64 convolutions with stride [1 1] and padding 'same' 6 'relu2' ReLU ReLU 7 'pool2' Max Pooling 2×2 max pooling with stride [2 2] and padding 'same' 8 'conv3_1' Convolution 256 3×3×128 convolutions with stride [1 1] and padding 'same' 9 'relu3_1' ReLU ReLU 10 'conv3_2' Convolution 256 3×3×256 convolutions with stride [1 1] and padding 'same' 11 'relu3_2' ReLU ReLU 12 'pool3' Max Pooling 2×2 max pooling with stride [2 2] and padding 'same' 13 'conv4_1' Convolution 512 3×3×256 convolutions with stride [1 1] and padding 'same' 14 'relu4_1' ReLU ReLU 15 'conv4_2' Convolution 512 3×3×512 convolutions with stride [1 1] and padding 'same' 16 'relu4_2' ReLU ReLU 17 'pool4' Max Pooling 2×2 max pooling with stride [2 2] and padding 'same' 18 'fc1_1' Fully Connected 4096 fully connected layer 19 'relu5_1' ReLU ReLU 20 'fc1_2' Fully Connected 4096 fully connected layer 21 'relu5_2' ReLU ReLU 22 'fc2' Fully Connected 128 fully connected layer 23 'EmbeddingBatch' ReLU ReLU 24 'regressionoutput' Regression Output mean-squared-error

    使用analyzeNetwork(Deep Learning Toolbox)to visually explore the network.

    analyzeNetwork(net)

    Read in an audio signal to extract feature embeddings from it.

    [audioIn,fs] = audioread("Ambiance-16-44p1-mono-12secs.wav");

    Plot and listen to the audio signal.

    t = (0:numel(audioIn)-1)/fs; plot(t,audioIn) xlabel("Time (s)") ylabel("Ampltiude") axistight

    % To play the sound, call soundsc(audioIn,fs)

    VGGish requires you to preprocess the audio signal to match the input format used to train the network. The preprocesssing steps include resampling the audio signal and computing an array of mel spectrograms. To learn more about mel spectrograms, seemelSpectrogram. Use vggishPreprocessto preprocess the signal and extract the mel spectrograms to be passed to VGGish. Visualize one of these spectrograms chosen at random.

    spectrograms = vggishPreprocess(audioIn,fs); arbitrarySpect = spectrograms(:,:,1,randi(size(spectrograms,4))); surf(arbitrarySpect,EdgeColor="none") view(90,-90) xlabel("Mel Band") ylabel("Frame") title("Mel Spectrogram for VGGish") axistight

    Create a VGGish neural network. Using thevggishfunction requires installing the pretrained VGGish network. If the network is not installed, the function provides a link to download the pretrained model.

    net = vggish;

    Callpredictwith the network on the preprocessed mel spectrogram images to extract feature embeddings. The feature embeddings are returned as anumFrames-by-128 matrix, wherenumFramesis the number of individual spectrograms and 128 is the number of elements in each feature vector.

    features = predict(net,spectrograms); [numFrames,numFeatures] = size(features)
    numFrames = 24
    numFeatures = 128

    Visualize the VGGish feature embeddings.

    surf(features,EdgeColor="none") view([90 -90]) xlabel("Feature") ylabel("Frame") title("VGGish Feature Embeddings") axistight

    In this example, you transfer the learning in the VGGish regression model to an audio classification task.

    Download and unzip the environmental sound classification data set. This data set consists of recordings labeled as one of 10 different audio sound classes (ESC-10).

    url ='http://ssd.mathworks.com/supportfiles/audio/ESC-10.zip'; downloadFolder = fullfile(tempdir,'ESC-10'); datasetLocation = tempdir;if~exist(fullfile(tempdir,'ESC-10'),'dir') loc = websave(downloadFolder,url); unzip(loc,fullfile(tempdir,'ESC-10'))end

    Create anaudioDatastoreobject to manage the data and split it into train and validation sets. CallcountEachLabelto display the distribution of sound classes and the number of unique labels.

    ads = audioDatastore(downloadFolder,'IncludeSubfolders',true,'LabelSource''foldernames'); labelTable = countEachLabel(ads)
    labelTable=10×2 tableLabel Count ______________ _____ chainsaw 40 clock_tick 40 crackling_fire 40 crying_baby 40 dog 40 helicopter 40 rain 40 rooster 38 sea_waves 40 sneezing 40

    Determine the total number of classes.

    numClasses = size(labelTable,1);

    CallsplitEachLabelto split the data set into training and validation sets. Inspect the distribution of labels in the training and validation sets.

    [adsTrain, adsValidation] = splitEachLabel(ads,0.8); countEachLabel(adsTrain)
    ans=10×2 tableLabel Count ______________ _____ chainsaw 32 clock_tick 32 crackling_fire 32 crying_baby 32 dog 32 helicopter 32 rain 32 rooster 30 sea_waves 32 sneezing 32
    countEachLabel(adsValidation)
    ans=10×2 tableLabel Count ______________ _____ chainsaw 8 clock_tick 8 crackling_fire 8 crying_baby 8 dog 8 helicopter 8 rain 8 rooster 8 sea_waves 8 sneezing 8

    The VGGish network expects audio to be preprocessed into log mel spectrograms. The supporting functionvggishPreprocesstakes anaudioDatastoreobject and the overlap percentage between log mel spectrograms as input, and returns matrices of predictors and responses suitable as input to the VGGish network.

    overlapPercentage =75; [trainFeatures,trainLabels] = vggishPreprocess(adsTrain,overlapPercentage); [validationFeatures,validationLabels,segmentsPerFile] = vggishPreprocess(adsValidation,overlapPercentage);

    Load the VGGish model and convert it to alayerGraph(Deep Learning Toolbox)object.

    net = vggish; lgraph = layerGraph(net.Layers);

    使用removeLayers(Deep Learning Toolbox)to remove the final regression output layer from the graph. After you remove the regression layer, the new final layer of the graph is a ReLU layer named'EmbeddingBatch'.

    lgraph = removeLayers(lgraph,'regressionoutput'); lgraph.Layers(end)
    ans = ReLULayer with properties: Name: 'EmbeddingBatch'

    使用addLayers(Deep Learning Toolbox)to add afullyConnectedLayer(Deep Learning Toolbox),asoftmaxLayer(Deep Learning Toolbox),and aclassificationLayer(Deep Learning Toolbox)to the graph.

    lgraph = addLayers(lgraph,fullyConnectedLayer(numClasses,'Name''FCFinal')); lgraph = addLayers(lgraph,softmaxLayer('Name''softmax')); lgraph = addLayers(lgraph,classificationLayer('Name''classOut'));

    使用connectLayers(Deep Learning Toolbox)to append the fully connected, softmax, and classification layers to the layer graph.

    lgraph = connectLayers(lgraph,'EmbeddingBatch''FCFinal'); lgraph = connectLayers(lgraph,'FCFinal''softmax'); lgraph = connectLayers(lgraph,'softmax''classOut');

    To define training options, usetrainingOptions(Deep Learning Toolbox).

    miniBatchSize = 128; options = trainingOptions('adam'...'MaxEpochs',5,...'MiniBatchSize',miniBatchSize,...'Shuffle''every-epoch'...'ValidationData',{validationFeatures,validationLabels},...'ValidationFrequency',50,...'LearnRateSchedule''piecewise'...'LearnRateDropFactor',0.5,...'LearnRateDropPeriod',2);

    To train the network, usetrainNetwork(Deep Learning Toolbox).

    [trainedNet, netInfo] = trainNetwork(trainFeatures,trainLabels,lgraph,options);
    Training on single GPU. |======================================================================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Validation | Mini-batch | Validation | Base Learning | | | | (hh:mm:ss) | Accuracy | Accuracy | Loss | Loss | Rate | |======================================================================================================================| | 1 | 1 | 00:00:00 | 10.94% | 26.03% | 2.2253 | 2.0317 | 0.0010 | | 2 | 50 | 00:00:05 | 93.75% | 83.75% | 0.1884 | 0.7001 | 0.0010 | | 3 | 100 | 00:00:10 | 96.88% | 80.07% | 0.1150 | 0.7838 | 0.0005 | | 4 | 150 | 00:00:15 | 92.97% | 81.99% | 0.1656 | 0.7612 | 0.0005 | | 5 | 200 | 00:00:20 | 92.19% | 79.04% | 0.1738 | 0.8192 | 0.0003 | | 5 | 210 | 00:00:21 | 95.31% | 80.15% | 0.1389 | 0.8581 | 0.0003 | |======================================================================================================================|

    Each audio file was split into several segments to feed into the VGGish network. Combine the predictions for each file in the validation set using a majority-rule decision.

    validationPredictions = classify(trainedNet,validationFeatures); idx = 1; validationPredictionsPerFile = categorical;forii = 1:numel(adsValidation.Files) validationPredictionsPerFile(ii,1) = mode(validationPredictions(idx:idx+segmentsPerFile(ii)-1)); idx = idx + segmentsPerFile(ii);end

    使用confusionchart(Deep Learning Toolbox)to evaluate the performance of the network on the validation set.

    figure('Units''normalized''Position',[0.2 0.2 0.5 0.5]); cm = confusionchart(adsValidation.Labels,validationPredictionsPerFile); cm.Title = sprintf('Confusion Matrix for Validation Data \nAccuracy = %0.2f %%',mean(validationPredictionsPerFile==adsValidation.Labels)*100); cm.ColumnSummary ='column-normalized'; cm.RowSummary ='row-normalized';

    Supporting Functions

    function[predictor,response,segmentsPerFile] = vggishPreprocess(ads,overlap)%这个函数例如目的和may be changed or removed% in a future release.% Create filter bankFFTLength = 512; numBands = 64; fs0 = 16e3; filterBank = designAuditoryFilterBank(fs0,...'FrequencyScale''mel'...'FFTLength',FFTLength,...'FrequencyRange',[125 7500],...'NumBands',numBands,...'Normalization''none'...'FilterBankDesignDomain''warped');% Define STFT parameterswindowLength = 0.025 * fs0; hopLength = 0.01 * fs0; win = hann(windowLength,'periodic');% Define spectrogram segmentation parameterssegmentDuration = 0.96;% secondssegmentRate = 100;%赫兹segmentLength = segmentDuration*segmentRate;% Number of spectrums per auditory spectrogramssegmentHopDuration = (100-overlap) * segmentDuration / 100;% Duration (s) advanced between auditory spectrogramssegmentHopLength = round(segmentHopDuration * segmentRate);% Number of spectrums advanced between auditory spectrograms% Preallocate cell arrays for the predictors and responsesnumFiles = numel(ads.Files); predictor = cell(numFiles,1); response = predictor; segmentsPerFile = zeros(numFiles,1);% Extract predictors and responses for each fileforii = 1:numFiles [audioIn,info] = read(ads); x = single(resample(audioIn,fs0,info.SampleRate)); Y = stft(x,...'Window',win,...'OverlapLength',windowLength-hopLength,...'FFTLength',FFTLength,...'FrequencyRange''onesided'); Y = abs(Y); logMelSpectrogram = log(filterBank*Y + single(0.01))';% Segment log-mel spectrogramnumHops = floor((size(Y,2)-segmentLength)/segmentHopLength) + 1; segmentedLogMelSpectrogram = zeros(segmentLength,numBands,1,numHops);forhop = 1:numHops segmentedLogMelSpectrogram(:,:,1,hop) = logMelSpectrogram(1+segmentHopLength*(hop-1):segmentLength+segmentHopLength*(hop-1),:);endpredictor{ii} = segmentedLogMelSpectrogram; response{ii} = repelem(info.Label,numHops); segmentsPerFile(ii) = numHops;end% Concatenate predictors and responses into arrayspredictor = cat(4,predictor{:}); response = cat(2,response{:});end

    Output Arguments

    collapse all

    Pretrained VGGish neural network, returned as aSeriesNetwork(Deep Learning Toolbox)object.

    References

    [1] Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),776–80. New Orleans, LA: IEEE. https://doi.org/10.1109/ICASSP.2017.7952261.

    [2]好时,肖恩,微酸的乔杜里,丹尼尔·p·W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. 2017. “CNN Architectures for Large-Scale Audio Classification.” In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),131–35. New Orleans, LA: IEEE. https://doi.org/10.1109/ICASSP.2017.7952132.

    Extended Capabilities

    版本历史

    Introduced in R2020b