Metrics for audio similarity

Distances

class audiotools.metrics.distance.L1Loss(attribute: str = 'audio_data', weight: float = 1.0, **kwargs)[source]

Bases: L1Loss

L1 Loss between AudioSignals. Defaults to comparing audio_data, but any attribute of an AudioSignal can be used.

Parameters

attribute (str, optional) – Attribute of signal to compare, defaults to audio_data.
weight (float, optional) – Weight of this loss, defaults to 1.0.

forward(x: AudioSignal, y: AudioSignal)[source]

Parameters

x (AudioSignal) – Estimate AudioSignal
y (AudioSignal) – Reference AudioSignal

Returns

L1 loss between AudioSignal attributes.

Return type

torch.Tensor

reduction: str

class audiotools.metrics.distance.SISDRLoss(scaling: int = True, reduction: str = 'mean', zero_mean: int = True, clip_min: Optional[int] = None, weight: float = 1.0)[source]

Bases: Module

Computes the Scale-Invariant Source-to-Distortion Ratio between a batch of estimated and reference audio signals or aligned features.

Parameters

scaling (int, optional) – Whether to use scale-invariant (True) or signal-to-noise ratio (False), by default True
reduction (str, optional) – How to reduce across the batch (either ‘mean’, ‘sum’, or none).], by default ‘ mean’
zero_mean (int, optional) – Zero mean the references and estimates before computing the loss, by default True
clip_min (int, optional) – The minimum possible loss value. Helps network to not focus on making already good examples better, by default None
weight (float, optional) – Weight of this loss, defaults to 1.0.

forward(x: AudioSignal, y: AudioSignal)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

Quality metrics

audiotools.metrics.quality.pesq(estimates: AudioSignal, references: AudioSignal, mode: str = 'wb', target_sr: float = 16000)[source]

_summary_

Parameters

estimates (AudioSignal) – Degraded AudioSignal
references (AudioSignal) – Reference AudioSignal
mode (str, optional) – ‘wb’ (wide-band) or ‘nb’ (narrow-band), by default “wb”
target_sr (float, optional) – Target sample rate, by default 16000

Returns

PESQ score: P.862.2 Prediction (MOS-LQO)

Return type

Tensor[float]

audiotools.metrics.quality.stoi(estimates: AudioSignal, references: AudioSignal, extended: int = False)[source]

Short term objective intelligibility Computes the STOI (See [1][2]) of a denoised signal compared to a clean signal, The output is expected to have a monotonic relation with the subjective speech-intelligibility, where a higher score denotes better speech intelligibility. Uses pystoi under the hood.

Parameters

estimates (AudioSignal) – Denoised speech
references (AudioSignal) – Clean original speech
extended (int, optional) – Boolean, whether to use the extended STOI described in [3], by default False

Returns

Short time objective intelligibility measure between clean and denoised speech

Return type

Tensor[float]

References

C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech’, ICASSP 2010, Texas, Dallas.
C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech’, IEEE Transactions on Audio, Speech, and Language Processing, 2011.
Jesper Jensen and Cees H. Taal, ‘An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers’, IEEE Transactions on Audio, Speech and Language Processing, 2016.

audiotools.metrics.quality.visqol(estimates: AudioSignal, references: AudioSignal, mode: str = 'audio')[source]

ViSQOL score.

Parameters

estimates (AudioSignal) – Degraded AudioSignal
references (AudioSignal) – Reference AudioSignal
mode (str, optional) – ‘audio’ or ‘speech’, by default ‘audio’

Returns

ViSQOL score (MOS-LQO)

Return type

Tensor[float]

Spectral distance metrics

class audiotools.metrics.spectral.MelSpectrogramLoss(n_mels: List[int] = [150, 80], window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, mel_fmin: List[float] = [0.0, 0.0], mel_fmax: List[float] = [None, None], window_type: Optional[str] = None)[source]

Bases: Module

Compute distance between mel spectrograms. Can be used in a multi-scale way.

Parameters

n_mels (List[int]) – Number of mels per STFT, by default [150, 80],
window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]
loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()
clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5
mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0
log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0
pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0
weight (float, optional) – Weight of this loss, by default 1.0
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

forward(x: AudioSignal, y: AudioSignal)[source]

Computes mel loss between an estimate and a reference signal.

Parameters

x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal

Returns

Mel loss.

Return type

torch.Tensor

training: bool

class audiotools.metrics.spectral.MultiScaleSTFTLoss(window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, window_type: Optional[str] = None)[source]

Bases: Module

Computes the multi-scale STFT loss from [1].

Parameters

window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]
loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()
clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5
mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0
log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0
pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0
weight (float, optional) – Weight of this loss, by default 1.0
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

References

Engel, Jesse, Chenjie Gu, and Adam Roberts. “DDSP: Differentiable Digital Signal Processing.” International Conference on Learning Representations. 2019.

forward(x: AudioSignal, y: AudioSignal)[source]

Computes multi-scale STFT between an estimate and a reference signal.

Parameters

x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal

Returns

Multi-scale STFT loss.

Return type

torch.Tensor

training: bool

class audiotools.metrics.spectral.PhaseLoss(window_length: int = 2048, hop_length: int = 512, weight: float = 1.0)[source]

Bases: Module

Difference between phase spectrograms.

Parameters

window_length (int, optional) – Length of STFT window, by default 2048
hop_length (int, optional) – Hop length of STFT window, by default 512
weight (float, optional) – Weight of loss, by default 1.0

forward(x: AudioSignal, y: AudioSignal)[source]

Computes phase loss between an estimate and a reference signal.

Parameters

x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal

Returns

Phase loss.

Return type

torch.Tensor

training: bool