classifySound

Classify sounds in audio signal

collapse all in page

Syntax

sounds = classifySound(audioIn,fs)

sounds = classifySound(audioIn,fs,Name,Value)

[sounds,timestamps] = classifySound(___)

[sounds,timestamps,resultsTable] = classifySound(___)

classifySound(___)

Description

example

sounds= classifySound(audioIn,fs)returns the sound classes detected over time in the audio input,audioIn, with sample ratefs.

example

sounds= classifySound(audioIn,fs,Name,Value)specifies options using one or moreName,Valuepair arguments.

Example:sounds = classifySound(audioIn,fs,'SpecificityLevel','low')classifies sounds using low specificity.

example

[sounds,timestamps] = classifySound(___)also returns time stamps associated with each detected sound.

example

[sounds,timestamps,resultsTable] = classifySound(___)also returns a table containing result details.

example

classifySound(___)with no output arguments creates a word cloud of the identified sounds in the audio signal.

This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

Examples

collapse all

Download`classifySound`

Open Live Script

Download and unzip the Audio Toolbox™ support for YAMNet.

If the Audio Toolbox support for YAMNet is not installed, then the first call to the function provides a link to the download location. To download the model, click the link. Unzip the file to a location on the MATLAB path.

Alternatively, execute the following commands to download and unzip the YAMNet model to your temporary directory.

downloadFolder = fullfile(tempdir,'YAMNetDownload'); loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/yamnet.zip'); YAMNetLocation = tempdir; unzip(loc,YAMNetLocation) addpath(fullfile(YAMNetLocation,'yamnet'))

Identify Colored Noise

This example uses:

Open Live Script

Generate 1 second of pink noise assuming a 16 kHz sample rate.

fs = 16e3; x = pinknoise(fs);

CallclassifySoundwith the pink noise signal and the sample rate.

identifiedSound = classifySound(x,fs)

identifiedSound = "Pink noise"

Identify and Locate Sounds in Time

This example uses:

Open Live Script

Read in an audio signal. CallclassifySoundto return the detected sounds and corresponding time stamps.

[audioIn,fs] = audioread('multipleSounds-16-16-mono-18secs.wav'); [sounds,timeStamps] = classifySound(audioIn,fs);

Plot the audio signal and label the detected sound regions.

t = (0:numel(audioIn)-1)/fs; plot(t,audioIn) xlabel('Time (s)') axis([t(1),t(end),-1,1]) textHeight = 1.1;foridx = 1:元素个数(声音) patch([timeStamps(idx,1),timeStamps(idx,1),timeStamps(idx,2),timeStamps(idx,2)],...[-1,1,1,-1],...[0.3010 0.7450 0.9330],...'FaceAlpha',0.2); text(timeStamps(idx,1),textHeight+0.05*(-1)^idx,sounds(idx))end

Select a region and listen only to the selected region.

sampleStamps = floor(timeStamps*fs)+1; soundEvent =3; isolatedSoundEvent = audioIn(sampleStamps(soundEvent,1):sampleStamps(soundEvent,2)); sound(isolatedSoundEvent,fs); display('Detected Sound = '+ sounds(soundEvent))

"Detected Sound = Snoring"

Identify Only Specific Sounds

This example uses:

Open Live Script

Read in an audio signal containing multiple different sound events.

[audioIn,fs] = audioread('multipleSounds-16-16-mono-18secs.wav');

CallclassifySoundwith the audio signal and sample rate.

[sounds,~,soundTable] = classifySound(audioIn,fs);

Thesoundsstring array contains the most likely sound event in each region.

sounds

sounds =1×5 string"Stream" "Machine gun" "Snoring" "Bark" "Meow"

ThesoundTablecontains detailed information regarding the sounds detected in each region, including score means and maximums over the analyzed signal.

soundTable

soundTable=5×2 tableTimeStamps Results ________________ ___________ 0 3.92 {4×3 table} 4.0425 6.0025 {3×3 table} 6.86 9.1875 {2×3 table} 10.658 12.373 {4×3 table} 12.985 16.66 {4×3 table}

View the last detected region.

soundTable.Results{end}

ans=4×3 tableSounds AverageScores MaxScores ________________________ _____________ _________ "Animal" 0.79514 0.99941 "Domestic animals, pets" 0.80243 0.99831 "Cat" 0.8048 0.99046 "Meow" 0.6342 0.90177

CallclassifySoundagain. This time, setIncludedSoundstoAnimalso that the function retains only regions in which theAnimalsound class is detected.

[sounds,timeStamps,soundTable] = classifySound(audioIn,fs,...'IncludedSounds','Animal');

只返回数组的声音听起来specified as included sounds. Thesoundsarray now contains two instances ofAnimalthat correspond to the regions declared asBarkandMeowpreviously.

sounds

sounds =1×2 string"Animal" "Animal"

The sound table only includes regions where the specified sound classes were detected.

soundTable

soundTable=2×2 tableTimeStamps Results ________________ ___________ 10.658 12.373 {4×3 table} 12.985 16.66 {4×3 table}

View the last detected region insoundTable. The results table still includes statistics for all detected sounds in the region.

soundTable.Results{end}

ans=4×3 tableSounds AverageScores MaxScores ________________________ _____________ _________ "Animal" 0.79514 0.99941 "Domestic animals, pets" 0.80243 0.99831 "Cat" 0.8048 0.99046 "Meow" 0.6342 0.90177

To explore which sound classes are supported byclassifySound, useyamnetGraph.

Exclude Specific Sounds

This example uses:

Open Live Script

Read in an audio signal and callclassifySoundto inspect the most likely sounds arranged in chronological order of detection.

[audioIn,fs] = audioread("multipleSounds-16-16-mono-18secs.wav"); sounds = classifySound(audioIn,fs)

sounds =1×5 string"Stream" "Machine gun" "Snoring" "Bark" "Meow"

CallclassifySoundagain and setExcludedSoundstoMeowto exclude the soundMeowfrom the results. The segment previously classified asMeowis now classified asCat, which is its immediate predecessor in the AudioSet ontology.

sounds = classifySound(audioIn,fs,"ExcludedSounds","Meow")

sounds =1×5 string"Stream" "Machine gun" "Snoring" "Bark" "Cat"

CallclassifySoundagain, and setExcludedSoundstoCat. When you exclude a sound, all successors are also excluded. This means that excluding the soundCatalso excludes the soundMeow. The segment originally classified asMeowis now classified asDomestic animals, pets, which is the immediate predecessor toCatin the AudioSet ontology.

sounds = classifySound(audioIn,fs,"ExcludedSounds","Cat")

sounds =1×5 string"Stream" "Machine gun" "Snoring" "Bark" "Domestic animals, pets"

CallclassifySoundagain and setExcludedSoundstoDomestic animals, pets. The sound class,Domestic animals, petsis a predecessor to bothBarkandMeow, so by excluding it, the sounds previously identified asBarkandMeoware now both identified as the predecessor ofDomestic animals, pets, which isAnimal.

sounds = classifySound(audioIn,fs,"ExcludedSounds","Domestic animals, pets")

sounds =1×5 string"Stream" "Machine gun" "Snoring" "Animal" "Animal"

CallclassifySoundagain and setExcludedSoundstoAnimal. The sound classAnimalhas no predecessors.

sounds = classifySound(audioIn,fs,"ExcludedSounds","Animal")

sounds =1×3 string"Stream" "Machine gun" "Snoring"

If you want to avoid detectingMeowand its predecessors, but continue detecting successors under the same predecessors, use theIncludedSoundsoption. CallyamnetGraphto get a list of all supported classes. RemoveMeowand its predecessors from the array of all classes, and then callclassifySoundagain.

[~,classes] = yamnetGraph; classesToInclude = setxor(classes,["Meow","Cat","Domestic animals, pets","Animal"]); sounds = classifySound(audioIn,fs,"IncludedSounds",classesToInclude)

sounds =1×4 string"Stream" "Machine gun" "Snoring" "Bark"

Generate Word Cloud

This example uses:

Open Live Script

Read in an audio signal and listen to it.

[audioIn,fs] = audioread('multipleSounds-16-16-mono-18secs.wav'); sound(audioIn,fs)

CallclassifySoundwith no output arguments to generate a word cloud of the detected sounds.

classifySound(audioIn,fs);

Modify default parameters ofclassifySoundto explore the effect on the word cloud.

threshold =0.1; minimumSoundSeparation =0.92; minimumSoundDuration =1.02; classifySound(audioIn,fs,...'Threshold',threshold,...'MinimumSoundSeparation',minimumSoundSeparation,...'MinimumSoundDuration',minimumSoundDuration);

Input Arguments

collapse all

`audioIn`—Audio input
column vector

Audio input, specified as a one-channel signal (column vector).

Data Types:single|double

`fs`—Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types:single|double

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:'Threshold',0.1

`Threshold`—Confidence threshold for reporting sounds
`0.35`(default) |scalar in the range (0,1)

Confidence threshold for reporting sounds, specified as the comma-separated pair consisting of'Threshold'and a scalar in the range (0,1).

Data Types:single|double

`MinimumSoundSeparation`—Minimum separation between detected sound regions (s)
`0.25`(default) |positive scalar

Minimum separation between consecutive regions of the same detected sound in seconds, specified as the comma-separated pair consisting of'MinimumSoundSeparation'and a positive scalar. Regions closer than the minimum sound separation are merged.

Data Types:single|double

`MinimumSoundDuration`—Minimum duration of detected sound region (s)
`0.5`(default) |positive scalar

Minimum duration of detected sound regions in seconds, specified as the comma-separated pair consisting of'MinimumSoundDuration'and a positive scalar. Regions shorter than the minimum sound duration are discarded.

Data Types:single|double

`IncludedSounds`—Sounds to include in results
character vector|cell array of character vectors|string scalar|string array

Sounds to include in results, specified as the comma-separated pair consisting of'IncludedSounds'and a character vector, cell array of character vectors, string scalar, or string array. UseyamnetGraphto inspect and analyze the sounds supported byclassifySound. By default, all supported sounds are included.

This option cannot be used with the'ExcludedSounds'option.

Data Types:char|string|cell

`ExcludedSounds`—Sounds to exclude from results
character vector|cell array of character vectors|string scalar|string array

Sounds to exclude from results, specified as the comma-separated pair consisting of'ExcludedSounds'and a character vector, cell array of character vectors, string scalar, or string array. When you specify an excluded sound, any successors of the excluded sound are also excluded. UseyamnetGraphto inspect valid sound classes and their predecessors and successors according to the AudioSet ontology. By default, no sounds are excluded.

This option cannot be used with the'IncludedSounds'option.

Data Types:char|string|cell

`SpecificityLevel`—Specificity of reported sounds
`'high'`(default) |`'low'`|`'none'`

Specificity of reported sounds, specified as the comma-separated pair consisting of'SpecificityLevel'and'high','low', or'none'. SetSpecificityLevelto'high'to make the function emphasize specific sound classes instead of general categories. SetSpecificityLevelto'low'函数返回最一般的声音categories instead of specific sound classes. SetSpecificityLevelto'none'to make the function return the most likely sound, regardless of its specificity.

Data Types:char|string

Output Arguments

collapse all

`sounds`— Sounds detected over time in audio input
string array

Sounds detected over time in audio input, returned as a string array containing the detected sounds in chronological order.

`timestamps`— Time stamps associated with detected sounds (s)
N-by-2 matrix

Time stamps associated with detected sounds in seconds, returned as anN-by-2 matrix.Nis the number of detected sounds. Each row oftimestampscontains the start and end times of the detected sound region.

`resultsTable`— Detailed results of sound classification
table

Detailed results of sound classification, returned as a table. The number of rows in the table is equal to the number of detected sound regions. The columns are as follows.

TimeStamps–– Time stamps corresponding to each analyzed region.
Results–– Table with three variables:
- Sounds–– Sounds detected in each region.
- AverageScores–– Mean network scores corresponding to each detected sound class in the region.
- MaxScores–– Maximum network scores corresponding to each detected sound class in the region.

Algorithms

collapse all

TheclassifySoundfunction uses YAMNet to classify audio segments into sound classes described by the AudioSet ontology. TheclassifySoundfunction preprocesses the audio so that it is in the format required by YAMNet and postprocesses YAMNet's predictions with common tasks that make the results more interpretable.

Preprocess

ResampleaudioInto 16 kHz and cast to single precision.
Buffer intoLoverlapping segments. Each segment is 0.98 seconds and the segments are overlapped by 0.8575 seconds.
Pass each segment through a one-sided short time Fourier transform using a 25 ms periodic Hann window with a 10 ms hop and a 512-point DFT. The audio is now represented by a 257-by-96-by-Larray, where 257 is the number of bins in the one-sided spectra and 96 is the number of spectra in the spectrograms.
Convert the complex spectral values to magnitude and discard phase information.
Pass the one-sided magnitude spectrum through a 64-band mel-spaced filter bank and then sum the magnitudes in each band. The audio is now represented by a 96-by-64-by-1-by-Larray, where 96 is the number of spectra in the mel spectrogram, 64 is the number of mel bands, and the spectrograms are now spaced along the fourth dimension for compatibility with the YAMNet model.
Convert the mel spectrograms to a log scale.

Prediction

Pass the 96-by-64-by-1-by-Larray of mel spectrograms through YAMNet to return anL-by-521 matrix. The output from YAMNet corresponds to confidence scores for each of the 521 sound classes over time.

Postprocess

Sound Event Region Detection

Pass each of the 521 confidence signals through a moving mean filter with a window length of 7.
Pass each of the signals through a moving median filter with a window length of 3.
Convert the confidence signals to binary masks using the specifiedThreshold.
Discard any sound shorter thanMinimumSoundDuration.
Merge regions that are closer thanMinimumSoundSeparation.

Consolidate Overlapping Sound Regions

巩固了良好的区域重叠by 50% or more into single regions. The region start time is the smallest start time of all sounds in the group. The region end time is the largest end time of all sounds in the group. The function returns time stamps, sounds classes, and the mean and maximum confidence of the sound classes within the region in theresultsTable.

Select Specificity of Sound Group

You can set the specificity level of your sound classification using theSpecificityLeveloption. For example, assume there are four sound classes in a sound group with the following corresponding mean scores over the sound region:

水––0.82817
Stream––0.81266
Trickle, dribble––0.23102
Pour––0.20732

The sound classes,水,Stream,Trickle, dribble, andPourare situated in AudioSet ontology as indicated by the graph:

Diagram of AudioSet ontology for Water, Stream, Pour, and Trickle, dribble. Stream is a successor of Water which is a successor of Natural sounds. Trickle, dribble is a successor of Pour which is a successor of Liquid which is a successor of Sounds of things.

The functions returns the sound class for the sound group in thesoundsoutput argument depending on theSpecificityLevel:

"high"(default) –– In this mode,Streamis preferred to水, andTrickle, dribbleis preferred toPour.Streamhas a higher mean score over the region, so the function returnsStreamin thesoundsoutput for the region.
"low"–– In this mode, the most general ontological category for the sound class with the highest mean confidence over the region is returned. ForTrickle, dribbleandPour, the most general category isSounds of things. ForStreamand水, the most general category isNatural sounds. Because水has the highest mean confidence over the sound region, the function returnsNatural sounds.
"none"–– In this mode, the function returns the sound class with the highest mean confidence score, which in this example is水.

References

[1] Gemmeke, Jort F., et al. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.”2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80.DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.

[2] Hershey, Shawn, et al. “CNN Architectures for Large-Scale Audio Classification.”2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35.DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, seeRun MATLAB Functions on a GPU(Parallel Computing Toolbox).

Version History

Introduced in R2020b

classifySound

Syntax

Description

Examples

Download`classifySound`

Identify Colored Noise

Identify and Locate Sounds in Time

Identify Only Specific Sounds

Exclude Specific Sounds

Generate Word Cloud

Input Arguments

`audioIn`—Audio input
column vector

`fs`—Sample rate (Hz)
positive scalar

Name-Value Arguments

`Threshold`—Confidence threshold for reporting sounds
`0.35`(default) |scalar in the range (0,1)

`MinimumSoundSeparation`—Minimum separation between detected sound regions (s)
`0.25`(default) |positive scalar

`MinimumSoundDuration`—Minimum duration of detected sound region (s)
`0.5`(default) |positive scalar

`IncludedSounds`—Sounds to include in results
character vector|cell array of character vectors|string scalar|string array

`ExcludedSounds`—Sounds to exclude from results
character vector|cell array of character vectors|string scalar|string array

`SpecificityLevel`—Specificity of reported sounds
`'high'`(default) |`'low'`|`'none'`

Output Arguments

`sounds`— Sounds detected over time in audio input
string array

`timestamps`— Time stamps associated with detected sounds (s)
N-by-2 matrix

`resultsTable`— Detailed results of sound classification
table

Algorithms

Preprocess

Prediction

Postprocess

References

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Apps

Blocks

Functions

classifySound

Syntax

Description

Examples

DownloadclassifySound

Identify Colored Noise

Identify and Locate Sounds in Time

Identify Only Specific Sounds

Exclude Specific Sounds

Generate Word Cloud

Input Arguments

audioIn—Audio inputcolumn vector

fs—Sample rate (Hz)positive scalar

Name-Value Arguments

Threshold—Confidence threshold for reporting sounds0.35(default) |scalar in the range (0,1)

MinimumSoundSeparation—Minimum separation between detected sound regions (s)0.25(default) |positive scalar

MinimumSoundDuration—Minimum duration of detected sound region (s)0.5(default) |positive scalar

IncludedSounds—Sounds to include in resultscharacter vector|cell array of character vectors|string scalar|string array

ExcludedSounds—Sounds to exclude from resultscharacter vector|cell array of character vectors|string scalar|string array

SpecificityLevel—Specificity of reported sounds'high'(default) |'low'|'none'

Output Arguments

sounds— Sounds detected over time in audio inputstring array

timestamps— Time stamps associated with detected sounds (s)N-by-2 matrix

resultsTable— Detailed results of sound classificationtable

Algorithms

Preprocess

Prediction

Postprocess

References

Extended Capabilities

GPU ArraysAccelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Apps

Blocks

Functions

Download`classifySound`

`audioIn`—Audio input
column vector

`fs`—Sample rate (Hz)
positive scalar

`Threshold`—Confidence threshold for reporting sounds
`0.35`(default) |scalar in the range (0,1)

`MinimumSoundSeparation`—Minimum separation between detected sound regions (s)
`0.25`(default) |positive scalar

`MinimumSoundDuration`—Minimum duration of detected sound region (s)
`0.5`(default) |positive scalar

`IncludedSounds`—Sounds to include in results
character vector|cell array of character vectors|string scalar|string array

`ExcludedSounds`—Sounds to exclude from results
character vector|cell array of character vectors|string scalar|string array

`SpecificityLevel`—Specificity of reported sounds
`'high'`(default) |`'low'`|`'none'`

`sounds`— Sounds detected over time in audio input
string array

`timestamps`— Time stamps associated with detected sounds (s)
N-by-2 matrix

`resultsTable`— Detailed results of sound classification
table

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.