Metrics for audio similarity

Distances

class audiotools.metrics.distance.L1Loss(attribute: str = 'audio_data', weight: float = 1.0, **kwargs)[source]

Bases: L1Loss

L1 Loss between AudioSignals. Defaults to comparing audio_data, but any attribute of an AudioSignal can be used.

Parameters
  • attribute (str, optional) – Attribute of signal to compare, defaults to audio_data.

  • weight (float, optional) – Weight of this loss, defaults to 1.0.

forward(x: AudioSignal, y: AudioSignal)[source]
Parameters
Returns

L1 loss between AudioSignal attributes.

Return type

torch.Tensor

reduction: str
class audiotools.metrics.distance.SISDRLoss(scaling: int = True, reduction: str = 'mean', zero_mean: int = True, clip_min: Optional[int] = None, weight: float = 1.0)[source]

Bases: Module

Computes the Scale-Invariant Source-to-Distortion Ratio between a batch of estimated and reference audio signals or aligned features.

Parameters
  • scaling (int, optional) – Whether to use scale-invariant (True) or signal-to-noise ratio (False), by default True

  • reduction (str, optional) – How to reduce across the batch (either ‘mean’, ‘sum’, or none).], by default ‘ mean’

  • zero_mean (int, optional) – Zero mean the references and estimates before computing the loss, by default True

  • clip_min (int, optional) – The minimum possible loss value. Helps network to not focus on making already good examples better, by default None

  • weight (float, optional) – Weight of this loss, defaults to 1.0.

forward(x: AudioSignal, y: AudioSignal)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

Quality metrics

audiotools.metrics.quality.pesq(estimates: AudioSignal, references: AudioSignal, mode: str = 'wb', target_sr: float = 16000)[source]

_summary_

Parameters
  • estimates (AudioSignal) – Degraded AudioSignal

  • references (AudioSignal) – Reference AudioSignal

  • mode (str, optional) – ‘wb’ (wide-band) or ‘nb’ (narrow-band), by default “wb”

  • target_sr (float, optional) – Target sample rate, by default 16000

Returns

PESQ score: P.862.2 Prediction (MOS-LQO)

Return type

Tensor[float]

audiotools.metrics.quality.stoi(estimates: AudioSignal, references: AudioSignal, extended: int = False)[source]

Short term objective intelligibility Computes the STOI (See [1][2]) of a denoised signal compared to a clean signal, The output is expected to have a monotonic relation with the subjective speech-intelligibility, where a higher score denotes better speech intelligibility. Uses pystoi under the hood.

Parameters
  • estimates (AudioSignal) – Denoised speech

  • references (AudioSignal) – Clean original speech

  • extended (int, optional) – Boolean, whether to use the extended STOI described in [3], by default False

Returns

Short time objective intelligibility measure between clean and denoised speech

Return type

Tensor[float]

References

  1. C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech’, ICASSP 2010, Texas, Dallas.

  2. C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen ‘An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech’, IEEE Transactions on Audio, Speech, and Language Processing, 2011.

  3. Jesper Jensen and Cees H. Taal, ‘An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers’, IEEE Transactions on Audio, Speech and Language Processing, 2016.

audiotools.metrics.quality.visqol(estimates: AudioSignal, references: AudioSignal, mode: str = 'audio')[source]

ViSQOL score.

Parameters
  • estimates (AudioSignal) – Degraded AudioSignal

  • references (AudioSignal) – Reference AudioSignal

  • mode (str, optional) – ‘audio’ or ‘speech’, by default ‘audio’

Returns

ViSQOL score (MOS-LQO)

Return type

Tensor[float]

Spectral distance metrics

class audiotools.metrics.spectral.MelSpectrogramLoss(n_mels: List[int] = [150, 80], window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, mel_fmin: List[float] = [0.0, 0.0], mel_fmax: List[float] = [None, None], window_type: Optional[str] = None)[source]

Bases: Module

Compute distance between mel spectrograms. Can be used in a multi-scale way.

Parameters
  • n_mels (List[int]) – Number of mels per STFT, by default [150, 80],

  • window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]

  • loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()

  • clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5

  • mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0

  • log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0

  • pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0

  • weight (float, optional) – Weight of this loss, by default 1.0

  • match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

forward(x: AudioSignal, y: AudioSignal)[source]

Computes mel loss between an estimate and a reference signal.

Parameters
Returns

Mel loss.

Return type

torch.Tensor

training: bool
class audiotools.metrics.spectral.MultiScaleSTFTLoss(window_lengths: List[int] = [2048, 512], loss_fn: Callable = L1Loss(), clamp_eps: float = 1e-05, mag_weight: float = 1.0, log_weight: float = 1.0, pow: float = 2.0, weight: float = 1.0, match_stride: bool = False, window_type: Optional[str] = None)[source]

Bases: Module

Computes the multi-scale STFT loss from [1].

Parameters
  • window_lengths (List[int], optional) – Length of each window of each STFT, by default [2048, 512]

  • loss_fn (Callable, optional) – How to compare each loss, by default nn.L1Loss()

  • clamp_eps (float, optional) – Clamp on the log magnitude, below, by default 1e-5

  • mag_weight (float, optional) – Weight of raw magnitude portion of loss, by default 1.0

  • log_weight (float, optional) – Weight of log magnitude portion of loss, by default 1.0

  • pow (float, optional) – Power to raise magnitude to before taking log, by default 2.0

  • weight (float, optional) – Weight of this loss, by default 1.0

  • match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

References

  1. Engel, Jesse, Chenjie Gu, and Adam Roberts. “DDSP: Differentiable Digital Signal Processing.” International Conference on Learning Representations. 2019.

forward(x: AudioSignal, y: AudioSignal)[source]

Computes multi-scale STFT between an estimate and a reference signal.

Parameters
Returns

Multi-scale STFT loss.

Return type

torch.Tensor

training: bool
class audiotools.metrics.spectral.PhaseLoss(window_length: int = 2048, hop_length: int = 512, weight: float = 1.0)[source]

Bases: Module

Difference between phase spectrograms.

Parameters
  • window_length (int, optional) – Length of STFT window, by default 2048

  • hop_length (int, optional) – Hop length of STFT window, by default 512

  • weight (float, optional) – Weight of loss, by default 1.0

forward(x: AudioSignal, y: AudioSignal)[source]

Computes phase loss between an estimate and a reference signal.

Parameters
Returns

Phase loss.

Return type

torch.Tensor

training: bool