Feature Extraction

梅尔·声谱图,米FCC, pitch, spectral descriptors

Extract features from audio signals for use as input to machine learning or deep learning systems. Use individual functions, such asmelSpectrogram,mfcc,pitch, andspectralCentroid, or use theaudioFeatureExtractor对象创建一个特征提取tha管道t minimizes redundant calculations. In live scripts, useExtract Audio Featuresto graphically select the features to extract.

Objects

`audioFeatureExtractor`	Streamline audio feature extraction
`ivectorSystem`	Create i-vector system

Live Editor Tasks

Extract Audio Features

Streamline audio feature extraction in the Live Editor

Functions

全部展开

Auditory Spectrograms

`audioDelta`	Compute delta features
`designAuditoryFilterBank`	Design auditory filter bank
`melSpectrogram`	Mel spectrogram

Auditory Cepstral Coefficients

`audioDelta`	Compute delta features
`cepstralCoefficients`	Extract cepstral coefficients
`gtcc`	Extract gammatone cepstral coefficients, log-energy, delta, and delta-delta
`mfcc`	Extract MFCC, log energy, delta, and delta-delta of audio signal

Feature Embeddings

`openl3Embeddings`	Extract OpenL3 feature embeddings
`vggishEmbeddings`	Extract VGGish feature embeddings

Periodicity and Harmonicity

`audioDelta`	Compute delta features
`harmonicRatio`	Harmonic ratio
`pitch`	Estimate fundamental frequency of audio signal
`pitchnn`	Estimate pitch with deep learning neural network

Spectral Descriptors

`audioDelta`	Compute delta features
`spectralCentroid`	Spectral centroid for audio signals and auditory spectrograms
`spectralCrest`	Spectral crest for audio signals and auditory spectrograms
`spectralDecrease`	Spectral decrease for audio signals and auditory spectrograms
`spectralEntropy`	Spectral entropy for audio signals and auditory spectrograms
`spectralFlatness`	Spectral flatness for audio signals and auditory spectrograms
`spectralFlux`	Spectral flux for audio signals and auditory spectrograms
`spectralKurtosis`	Spectral kurtosis for audio signals and auditory spectrograms
`spectralRolloffPoint`	Spectral rolloff point for audio signals and auditory spectrograms
`spectralSkewness`	Spectral skewness for audio signals and auditory spectrograms
`spectralSlope`	Spectral slope for audio signals and auditory spectrograms
`spectralSpread`	Spectral spread for audio signals and auditory spectrograms

Domain Conversion

`erb2hz`	Convert from equivalent rectangular bandwidth (ERB) scale to hertz
`bark2hz`	Convert from Bark scale to hertz
`mel2hz`	Convert from mel scale to hertz
`hz2erb`	Convert from hertz to equivalent rectangular bandwidth (ERB) scale
`hz2bark`	Convert from hertz to Bark scale
`hz2mel`	Convert from hertz to mel scale
`phon2sone`	Convert from phon to sone
`sone2phon`	Convert from sone to phon

Blocks

Audio Delta	Compute delta features
Auditory Spectrogram	Extract mel, Bark, or ERB spectrogram from audio
Cepstral Coefficients	Extract cepstral coefficients from spectrogram
Design Auditory Filter Bank	Design frequency-domain auditory filter bank
Design Mel Filter Bank	Design frequency-domain mel filter bank
Mel Spectrogram	Extract mel spectrogram from audio
MFCC	Extract mel-frequency cepstral coefficients from audio

Topics

Feature Selection for Audio Classification
Perform audio feature selection to select a feature set for either speaker recognition or word recognition tasks.
Spectral Descriptors
Overview and applications of spectral descriptors.
Learn Pre-Emphasis Filter Using Deep Learning
Use a convolutional deep network to learn a pre-emphasis filter for speech recognition.

Featured Examples

Speaker Recognition Using x-vectors

Develop an x-vector system to perform speaker recognition.

Open Live Script

Speaker Diarization Using x-vectors

Speaker diarization is the process of partitioning an audio signal into segments according to speaker identity. It answers the question "who spoke when" without prior knowledge of the speakers and, depending on the application, without prior knowledge of the number of speakers.

Open Live Script

Train Spoken Digit Recognition Network Using Out-of-Memory Features

Trains a spoken digit recognition network on out-of-memory auditory spectrograms using a transformed datastore. In this example, you extract auditory spectrograms from audio using audioDatastore and audioFeatureExtractor, and you write them to disk. You then use a signalDatastore to access the features during training. The workflow is useful when the training features do not fit in memory. In this workflow, you only extract features once, which speeds up your workflow if you are iterating on the deep learning model design.

Open Live Script

Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data

Trains a spoken digit recognition network on out-of-memory audio data using a transformed datastore. In this example, you apply a random pitch shift to audio data used to train a convolutional neural network (CNN). For each training iteration, the audio data is augmented using the audioDataAugmenter object and then features are extracted using the audioFeatureExtractor object. The workflow in this example applies to any random data augmentation used in a training loop. The workflow also applies when the underlying audio data set or training features do not fit in memory.

Open Live Script

Train Speech Command Recognition Model Using Deep Learning

Train a deep learning model that detects the presence of speech commands in audio.

Open Live Script

Voice Activity Detection in Noise Using Deep Learning

Detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.

Open Live Script

Spoken Digit Recognition with Wavelet Scattering and Deep Learning

Classify spoken digits using both machine and deep learning techniques. In the example, you perform classification using wavelet time scattering with a support vector machine (SVM) and with a long short-term memory (LSTM) network. You also apply Bayesian optimization to determine suitable hyperparameters to improve the accuracy of the LSTM network. In addition, the example illustrates an approach using a deep convolutional neural network (CNN) and mel-frequency spectrograms.

Open Live Script

Sequential Feature Selection for Audio Features

A typical workflow for feature selection applied to the task of spoken digit recognition.

Open Live Script

Acoustic Scene Recognition Using Late Fusion

Create a multi-model late fusion system for acoustic scene recognition. The example trains a convolutional neural network (CNN) using mel spectrograms and an ensemble classifier using wavelet scattering. The example uses the TUT dataset for training and evaluation [1].

Open Live Script

Speaker Verification Using i-Vectors

演讲者验证或认证,是助教sk of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually mfcc). For an example, see Speaker Verification Using Gaussian Mixture Models. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a total variability space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA) model. [11] and [12] proposed a method to Gaussianize the i-vectors and therefore make Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA. While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or x-vectors [8] [6].

Open Live Script

Speaker Verification Using Gaussian Mixture Model

演讲者验证或认证,是助教sk of verifying that a given speech segment belongs to a given speaker. In speaker verification systems, there is an unknown set of all other speakers, so the likelihood that an utterance belongs to the verification target is compared to the likelihood that it does not. This contrasts with speaker identification tasks, where the likelihood of each speaker is calculated, and those likelihoods are compared. Both speaker verification and speaker identification can be text dependent or text independent. In this example, you create a text-dependent speaker verification system using a Gaussian mixture model/universal background model (GMM-UBM).

Open Live Script

Pitch Tracking Using Multiple Pitch Estimations and HMM

Perform pitch tracking using multiple pitch estimations, octave and median smoothing, and a hidden Markov model (HMM).

Open Live Script

LPC Analysis and Synthesis of Speech

Use the Levinson-Durbin and Time-Varying Lattice Filter blocks for low-bandwidth transmission of speech using linear predictive coding.

Open Model

Speaker Identification Using Custom SincNet Layer and Deep Learning

Perform speech recognition using a custom deep learning layer that implements a mel-scale filter bank.

Open Live Script

Train 3-D Speech Enhancement Network Using Deep Learning

Train a filter and sum network (FaSNet) to perform speech enhancement using ambisonic data.

Open Live Script

Audio-Based Anomaly Detection for Machine Health Monitoring

Design an autoencoder neural network to perform anomaly detection for machine sounds using unsupervised learning.

Open Live Script