Metrics for audio similarity
Distances
- class audiotools.metrics.distance.L1Loss(attribute: str = 'audio_data', weight: float = 1.0, **kwargs)[source]
Bases:
L1Loss
L1 Loss between AudioSignals. Defaults to comparing
audio_data
, but any attribute of an AudioSignal can be used.- Parameters
attribute (str, optional) – Attribute of signal to compare, defaults to
audio_data
.weight (float, optional) – Weight of this loss, defaults to 1.0.
- forward(x: AudioSignal, y: AudioSignal)[source]
- Parameters
x (AudioSignal) – Estimate AudioSignal
y (AudioSignal) – Reference AudioSignal
- Returns
L1 loss between AudioSignal attributes.
- Return type
torch.Tensor
- reduction: str
- class audiotools.metrics.distance.SISDRLoss(scaling: int = True, reduction: str = 'mean', zero_mean: int = True, clip_min: Optional[int] = None, weight: float = 1.0)[source]
Bases:
Module
Computes the Scale-Invariant Source-to-Distortion Ratio between a batch of estimated and reference audio signals or aligned features.
- Parameters
scaling (int, optional) – Whether to use scale-invariant (True) or signal-to-noise ratio (False), by default True
reduction (str, optional) – How to reduce across the batch (either ‘mean’, ‘sum’, or none).], by default ‘ mean’
zero_mean (int, optional) – Zero mean the references and estimates before computing the loss, by default True
clip_min (int, optional) – The minimum possible loss value. Helps network to not focus on making already good examples better, by default None
weight (float, optional) – Weight of this loss, defaults to 1.0.
- forward(x: AudioSignal, y: AudioSignal)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool
Quality metrics
- audiotools.metrics.quality.pesq(estimates: AudioSignal, references: AudioSignal, mode: str = 'wb', target_sr: float = 16000)[source]
_summary_
- Parameters
estimates (AudioSignal) – Degraded AudioSignal
references (AudioSignal) – Reference AudioSignal
mode (str, optional) – ‘wb’ (wide-band) or ‘nb’ (narrow-band), by default “wb”
target_sr (float, optional) – Target sample rate, by default 16000
- Returns
PESQ score: P.862.2 Prediction (MOS-LQO)
- Return type
Tensor[float]
- audiotools.metrics.quality.stoi(estimates: AudioSignal, references: AudioSignal, extended: int = False)[source]
Short term objective intelligibility Computes the STOI (See [1][2]) of a denoised signal compared to a clean signal, The output is expected to have a monotonic relation with the subjective speech-intelligibility, where a higher score denotes better speech intelligibility. Uses pystoi under the hood.
- Parameters
estimates (AudioSignal) – Denoised speech
references (AudioSignal) – Clean original speech
extended (int, optional) – Boolean, whether to use the extended STOI described in [3], by default False
- Returns
Short time objective intelligibility measure between clean and denoised speech
- Return type
Tensor[float]
References
C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech’, ICASSP 2010, Texas, Dallas.
C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech’, IEEE Transactions on Audio, Speech, and Language Processing, 2011.
Jesper Jensen and Cees H. Taal, ‘An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers’, IEEE Transactions on Audio, Speech and Language Processing, 2016.
- audiotools.metrics.quality.visqol(estimates: AudioSignal, references: AudioSignal, mode: str = 'audio')[source]
ViSQOL score.
- Parameters
estimates (AudioSignal) – Degraded AudioSignal
references (AudioSignal) – Reference AudioSignal
mode (str, optional) – ‘audio’ or ‘speech’, by default ‘audio’
- Returns
ViSQOL score (MOS-LQO)
- Return type
Tensor[float]
Spectral distance metrics
- class audiotools.metrics.spectral.MelSpectrogramLoss(n_mels: List[int] = [150, 80], window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, mel_fmin: List[float] = [0.0, 0.0], mel_fmax: List[float] = [None, None], window_type: Optional[str] = None)[source]
Bases:
Module
Compute distance between mel spectrograms. Can be used in a multi-scale way.
- Parameters
n_mels (List[int]) – Number of mels per STFT, by default [150, 80],
window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]
loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()
clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5
mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0
log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0
pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0
weight (float, optional) – Weight of this loss, by default 1.0
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False
- forward(x: AudioSignal, y: AudioSignal)[source]
Computes mel loss between an estimate and a reference signal.
- Parameters
x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal
- Returns
Mel loss.
- Return type
torch.Tensor
- training: bool
- class audiotools.metrics.spectral.MultiScaleSTFTLoss(window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, window_type: Optional[str] = None)[source]
Bases:
Module
Computes the multi-scale STFT loss from [1].
- Parameters
window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]
loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()
clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5
mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0
log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0
pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0
weight (float, optional) – Weight of this loss, by default 1.0
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False
References
Engel, Jesse, Chenjie Gu, and Adam Roberts. “DDSP: Differentiable Digital Signal Processing.” International Conference on Learning Representations. 2019.
- forward(x: AudioSignal, y: AudioSignal)[source]
Computes multi-scale STFT between an estimate and a reference signal.
- Parameters
x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal
- Returns
Multi-scale STFT loss.
- Return type
torch.Tensor
- training: bool
- class audiotools.metrics.spectral.PhaseLoss(window_length: int = 2048, hop_length: int = 512, weight: float = 1.0)[source]
Bases:
Module
Difference between phase spectrograms.
- Parameters
window_length (int, optional) – Length of STFT window, by default 2048
hop_length (int, optional) – Hop length of STFT window, by default 512
weight (float, optional) – Weight of loss, by default 1.0
- forward(x: AudioSignal, y: AudioSignal)[source]
Computes phase loss between an estimate and a reference signal.
- Parameters
x (AudioSignal) – Estimate signal
y (AudioSignal) – Reference signal
- Returns
Phase loss.
- Return type
torch.Tensor
- training: bool