Audio Processing Using Deep Learning

Extend deep learning workflows with audio and speech processing applications

Apply deep learning to audio and speech processing applications by using Deep Learning Toolbox™ together with Audio Toolbox™. For signal processing applications, see信号处理ing Using Deep Learning. For applications in wireless communications, seeWireless Communications Using Deep Learning.

Apps

Audio Labeler

Define and visualize ground-truth labels

Functions

expand all

Data Management and Augmentation

`audioDatastore`	Datastore for collection of audio files
`audioDataAugmenter`	Augment audio data

Feature Extraction

`audioFeatureExtractor`	Streamline audio feature extraction
`ivectorSystem`	Create i-vector system
`openl3Features`	Extract OpenL3 features
`pitchnn`	Estimate pitch with deep learning neural network
`vggishFeatures`	Extract VGGish features

Pretrained Networks

`classifySound`	Classify sounds in audio signal
`crepe`	CREPE neural network
`crepePreprocess`	Preprocess audio for CREPE deep learning network
`crepePostprocess`	Postprocess output of CREPE deep learning network
`openl3`	OpenL3 neural network
`openl3Features`	Extract OpenL3 features
`openl3Preprocess`	Preprocess audio for OpenL3 feature extraction
`pitchnn`	Estimate pitch with deep learning neural network
`vggish`	VGGish神经network
`vggishFeatures`	Extract VGGish features
`vggishPreprocess`	Preprocess audio for VGGish feature extraction
`yamnet`	YAMNet neural network
`yamnetGraph`	Graph of YAMNet AudioSet ontology
`yamnetPreprocess`	Preprocess audio for YAMNet classification

Topics

Introduction to Deep Learning for Audio Applications(Audio Toolbox)

学习常用工具和工作流to apply deep learning to audio applications.

Classify Sound Using Deep Learning(Audio Toolbox)

Train, validate, and test a simple long short-term memory (LSTM) to classify sounds.

Transfer Learning with Pretrained Audio Networks

Use transfer learning to retrain YAMNet, a pretrained convolutional neural network (CNN), to classify a new set of audio signals.

Speaker Identification Using Custom SincNet Layer and Deep Learning

Perform speech recognition using a custom deep learning layer that implements a mel-scale filter bank.

Dereverberate Speech Using Deep Learning Networks

Train a deep learning model that removes reverberation from speech.

Speech Command Recognition in Simulink

Detect the presence of speech commands in audio using a Simulink^®model.

Spoken Digit Recognition with Wavelet Scattering and Deep Learning

This example shows how to classify spoken digits using both machine and deep learning techniques.

Cocktail Party Source Separation Using Deep Learning Networks

This example shows how to isolate a speech signal using a deep learning network.

Sequential Feature Selection for Audio Features

This example shows a typical workflow for feature selection applied to the task of spoken digit recognition.

Learn Pre-Emphasis Filter Using Deep Learning

Use a convolutional deep network to learn a pre-emphasis filter for speech recognition.

Featured Examples

Speaker Recognition Using x-vectors

Develop an x-vector system to perform speaker recognition.

Open Live Script

Speaker Diarization Using x-vectors

Speaker diarization is the process of partitioning an audio signal into segments according to speaker identity. It answers the question "who spoke when" without prior knowledge of the speakers and, depending on the application, without prior knowledge of the number of speakers.

Open Live Script

Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data

Trains a spoken digit recognition network on out-of-memory audio data using a transformed datastore. In this example, you apply a random pitch shift to audio data used to train a convolutional neural network (CNN). For each training iteration, the audio data is augmented using the audioDataAugmenter object and then features are extracted using the audioFeatureExtractor object. The workflow in this example applies to any random data augmentation used in a training loop. The workflow also applies when the underlying audio data set or training features do not fit in memory.

Open Live Script

Train Spoken Digit Recognition Network Using Out-of-Memory Features

Trains a spoken digit recognition network on out-of-memory auditory spectrograms using a transformed datastore. In this example, you extract auditory spectrograms from audio using audioDatastore and audioFeatureExtractor, and you write them to disk. You then use a signalDatastore to access the features during training. The workflow is useful when the training features do not fit in memory. In this workflow, you only extract features once, which speeds up your workflow if you are iterating on the deep learning model design.

Open Live Script

Keyword Spotting in Noise Code Generation with Intel MKL-DNN

Demonstrates code generation for keyword spotting using a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficient (MFCC) feature extraction. MATLAB® Coder™ with Deep Learning Support enables the generation of a standalone executable (.exe) file. Communication between the MATLAB® (.mlx) file and the generated executable file occurs over asynchronous User Datagram Protocol (UDP). The incoming speech signal is displayed using a timescope. A mask is shown as a blue rectangle surrounding spotted instances of the keyword, YES. For more details on MFCC feature extraction and deep learning network training, visit Keyword Spotting in Noise Using MFCC and LSTM Networks.

Open Live Script

Keyword Spotting in Noise Code Generation on Raspberry Pi

Demonstrates code generation for keyword spotting using a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficient (MFCC) feature extraction on Raspberry Pi™. MATLAB® Coder™ with Deep Learning Support enables the generation of a standalone executable (.elf) file on Raspberry Pi. Communication between MATLAB® (.mlx) file and the generated executable file occurs over asynchronous User Datagram Protocol (UDP). The incoming speech signal is displayed using a timescope. A mask is shown as a blue rectangle surrounding spotted instances of the keyword, YES. For more details on MFCC feature extraction and deep learning network training, visit Keyword Spotting in Noise Using MFCC and LSTM Networks.

Open Live Script

Speech Command Recognition Using Deep Learning

Train a deep learning model that detects the presence of speech commands in audio. The example uses the Speech Commands Dataset [1] to train a convolutional neural network to recognize a given set of commands.

Open Script

Speech Command Recognition Code Generation with Intel MKL-DNN

Deploy feature extraction and a convolutional neural network (CNN) for speech command recognition on Intel® processors. To generate the feature extraction and network code, you use MATLAB Coder and the Intel Math Kernel Library for Deep Neural Networks (MKL-DNN). In this example, the generated code is a MATLAB executable (MEX) function, which is called by a MATLAB script that displays the predicted speech command along with the time domain signal and auditory spectrogram. For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

Speech Command Recognition Code Generation on Raspberry Pi

Deploy feature extraction and a convolutional neural network (CNN) for speech command recognition to Raspberry Pi™. To generate the feature extraction and network code, you use MATLAB Coder, MATLAB Support Package for Raspberry Pi Hardware, and the ARM® Compute Library. In this example, the generated code is an executable on your Raspberry Pi, which is called by a MATLAB script that displays the predicted speech command along with the signal and auditory spectrogram. Interaction between the MATLAB script and the executable on your Raspberry Pi is handled using the user datagram protocol (UDP). For details about audio preprocessing and network training, see Speech Command Recognition Using Deep Learning.

Open Live Script

Keyword Spotting in Noise Using MFCC and LSTM Networks

Identify a keyword in noisy speech using a deep learning network. In particular, the example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficients (MFCC).

Open Live Script

Denoise Speech Using Deep Learning Networks

Denoise speech signals using deep learning networks. The example compares two types of networks applied to the same task: fully connected, and convolutional.

Open Live Script

Train Generative Adversarial Network (GAN) for Sound Synthesis

Train and use a generative adversarial network (GAN) to generate sounds.

Open Script

Voice Activity Detection in Noise Using Deep Learning

Detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.

Open Live Script

语音情感识别

Illustrates a simple speech emotion recognition (SER) system using a BiLSTM network. You begin by downloading the data set and then testing the trained network on individual files. The network was trained on a small German-language database [1].

Open Live Script

Acoustic Scene Recognition Using Late Fusion

Create a multi-model late fusion system for acoustic scene recognition. The example trains a convolutional neural network (CNN) using mel spectrograms and an ensemble classifier using wavelet scattering. The example uses the TUT dataset for training and evaluation [1].

Open Script