AudioSignal

Base functionality

class audiotools.core.audio_signal.AudioSignal(audio_path_or_array: Union[Tensor, str, Path, ndarray], sample_rate: Optional[int] = None, stft_params: Optional[STFTParams] = None, offset: float = 0, duration: Optional[float] = None, device: Optional[str] = None)[source]

Bases: EffectMixin, LoudnessMixin, PlayMixin, ImpulseResponseMixin, DSPMixin, DisplayMixin, FFMPEGMixin, WhisperMixin

This is the core object of this library. Audio is always loaded into an AudioSignal, which then enables all the features of this library, including audio augmentations, I/O, playback, and more.

The structure of this object is that the base functionality is defined in core/audio_signal.py, while extensions to that functionality are defined in the other core/*.py files. For example, all the display-based functionality (e.g. plot spectrograms, waveforms, write to tensorboard) are in core/display.py.

Parameters
  • audio_path_or_array (Union[torch.Tensor, str, Path, np.ndarray]) – Object to create AudioSignal from. Can be a tensor, numpy array, or a path to a file. The file is always reshaped to

  • sample_rate (int, optional) – Sample rate of the audio. If different from underlying file, resampling is performed. If passing in an array or tensor, this must be defined, by default None

  • stft_params (STFTParams, optional) – Parameters of STFT to use. , by default None

  • offset (float, optional) – Offset in seconds to read from file, by default 0

  • duration (float, optional) – Duration in seconds to read from file, by default None

  • device (str, optional) – Device to load audio onto, by default None

Examples

Loading an AudioSignal from an array, at a sample rate of 44100.

>>> signal = AudioSignal(torch.randn(5*44100), 44100)

Note, the signal is reshaped to have a batch size, and one audio channel:

>>> print(signal.shape)
(1, 1, 44100)

You can treat AudioSignals like tensors, and many of the same functions you might use on tensors are defined for AudioSignals as well:

>>> signal.to("cuda")
>>> signal.cuda()
>>> signal.clone()
>>> signal.detach()

Indexing AudioSignals returns an AudioSignal:

>>> signal[..., 3*44100:4*44100]

The above signal is 1 second long, and is also an AudioSignal.

property audio_data

Returns the audio data tensor in the object.

Audio data is always of the shape (batch_size, num_channels, num_samples). If value has less than 3 dims (e.g. is (num_channels, num_samples)), then it will be reshaped to (1, num_channels, num_samples) - a batch size of 1.

Parameters

data (Union[torch.Tensor, np.ndarray]) – Audio data to set.

Returns

Audio samples.

Return type

torch.Tensor

classmethod batch(audio_signals: list, pad_signals: bool = False, truncate_signals: bool = False, resample: bool = False, dim: int = 0)[source]

Creates a batched AudioSignal from a list of AudioSignals.

Parameters
  • audio_signals (list[AudioSignal]) – List of AudioSignal objects

  • pad_signals (bool, optional) – Whether to pad signals to length of the maximum length AudioSignal in the list, by default False

  • truncate_signals (bool, optional) – Whether to truncate signals to length of shortest length AudioSignal in the list, by default False

  • resample (bool, optional) – Whether to resample AudioSignal to the sample rate of the first AudioSignal in the list, by default False

  • dim (int, optional) – Dimension along which to batch the signals.

Returns

Batched AudioSignal.

Return type

AudioSignal

Raises
  • RuntimeError – If not all AudioSignals are the same sample rate, and resample=False, an error is raised.

  • RuntimeError – If not all AudioSignals are the same the length, and both pad_signals=False and truncate_signals=False, an error is raised.

Examples

Batching a bunch of random signals:

>>> signal_list = [AudioSignal(torch.randn(44100), 44100) for _ in range(10)]
>>> signal = AudioSignal.batch(signal_list)
>>> print(signal.shape)
(10, 1, 44100)
property batch_size

Batch size of audio signal.

Returns

Batch size of signal.

Return type

int

clone()[source]

Clones all tensors contained in the AudioSignal, and returns a copy of the signal with everything cloned. Useful when using AudioSignal within autograd computation graphs.

Relevant attributes are the stft data, the audio data, and the loudness of the file.

Returns

Clone of AudioSignal.

Return type

AudioSignal

compute_stft_padding(window_length: int, hop_length: int, match_stride: bool)[source]

Compute how the STFT should be padded, based on match_stride.

Parameters
  • window_length (int) – Window length of STFT.

  • hop_length (int) – Hop length of STFT.

  • match_stride (bool) – Whether or not to match stride, making the STFT have the same alignment as convolutional layers.

Returns

Amount to pad on either side of audio.

Return type

tuple

copy()[source]

Shallow copy of signal.

Returns

Shallow copy of the audio signal.

Return type

AudioSignal

cpu()[source]

Moves AudioSignal to cpu.

Return type

AudioSignal

cuda()[source]

Moves AudioSignal to cuda.

Return type

AudioSignal

deepcopy()[source]

Copies the signal and all of its attributes.

Returns

Deep copy of the audio signal.

Return type

AudioSignal

detach()[source]

Detaches tensors contained in AudioSignal.

Relevant attributes are the stft data, the audio data, and the loudness of the file.

Returns

Same signal, but with all tensors detached.

Return type

AudioSignal

property device

Get device that AudioSignal is on.

Returns

Device that AudioSignal is on.

Return type

torch.device

property duration

Length of audio signal in seconds.

Returns

Length of signal in seconds.

Return type

float

classmethod excerpt(audio_path: Union[str, Path], offset: Optional[float] = None, duration: Optional[float] = None, state: Optional[Union[RandomState, int]] = None, **kwargs)[source]

Randomly draw an excerpt of duration seconds from an audio file specified at audio_path, between offset seconds and end of file. state can be used to seed the random draw.

Parameters
  • audio_path (Union[str, Path]) – Path to audio file to grab excerpt from.

  • offset (float, optional) – Lower bound for the start time, in seconds drawn from the file, by default None.

  • duration (float, optional) – Duration of excerpt, in seconds, by default None

  • state (Union[np.random.RandomState, int], optional) – RandomState or seed of random state, by default None

Returns

AudioSignal containing excerpt.

Return type

AudioSignal

Examples

>>> signal = AudioSignal.excerpt("path/to/audio", duration=5)
float()[source]

Calls .float() on self.audio_data.

Return type

AudioSignal

static get_dct(n_mfcc: int, n_mels: int, norm: str = 'ortho', device: str = None)[source]

Create a discrete cosine transform (DCT) transformation matrix with shape (n_mels, n_mfcc), it can be normalized depending on norm. For more information about dct: http://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II

Parameters
  • n_mfcc (int) – Number of mfccs

  • n_mels (int) – Number of mels

  • norm (str) – Use “ortho” to get a orthogonal matrix or None, by default “ortho”

  • device (str, optional) – Device to load the transformation matrix on, by default None

Returns

The dct transformation matrix.

Return type

torch.Tensor [shape=(n_mels, n_mfcc)] T

static get_mel_filters(sr: int, n_fft: int, n_mels: int, fmin: float = 0.0, fmax: float = None)[source]

Create a Filterbank matrix to combine FFT bins into Mel-frequency bins.

Parameters
  • sr (int) – Sample rate of audio

  • n_fft (int) – Number of FFT bins

  • n_mels (int) – Number of mels

  • fmin (float, optional) – Lowest frequency, in Hz, by default 0.0

  • fmax (float, optional) – Highest frequency, by default None

Returns

Mel transform matrix

Return type

np.ndarray [shape=(n_mels, 1 + n_fft/2)]

static get_window(window_type: str, window_length: int, device: str)[source]

Wrapper around scipy.signal.get_window so one can also get the popular sqrt-hann window. This function caches for efficiency using functools.lru_cache.

Parameters
  • window_type (str) – Type of window to get

  • window_length (int) – Length of the window

  • device (str) – Device to put window onto.

Returns

Window returned by scipy.signal.get_window, as a tensor.

Return type

torch.Tensor

hash()[source]

Writes the audio data to a temporary file, and then hashes it using hashlib. Useful for creating a file name based on the audio content.

Returns

Hash of audio data.

Return type

str

Examples

Creating a signal, and writing it to a unique file name:

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> hash = signal.hash()
>>> signal.write(f"{hash}.wav")
istft(window_length: Optional[int] = None, hop_length: Optional[int] = None, window_type: Optional[str] = None, match_stride: Optional[bool] = None, length: Optional[int] = None)[source]

Computes inverse STFT and sets it to audio_data.

Parameters
  • window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.

  • hop_length (int, optional) – Hop length of STFT, by default window_length // 4.

  • window_type (str, optional) – Type of window to use, by default sqrt\_hann.

  • match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

  • length (int, optional) – Original length of signal, by default None

Returns

AudioSignal with istft applied.

Return type

AudioSignal

Raises

RuntimeError – Raises an error if stft was not called prior to istft on the signal, or if stft_data is not set.

property length

Length of audio signal.

Returns

Length of signal in samples.

Return type

int

load_from_array(audio_array: Union[Tensor, ndarray], sample_rate: int, device: str = 'cpu')[source]

Loads data from array, reshaping it to be exactly 3 dimensions. Used internally when AudioSignal is called with a tensor or an array.

Parameters
  • audio_array (Union[torch.Tensor, np.ndarray]) – Array/tensor of audio of samples.

  • sample_rate (int) – Sample rate of audio

  • device (str, optional) – Device to move audio onto, by default “cpu”

Returns

AudioSignal loaded from array

Return type

AudioSignal

load_from_file(audio_path: Union[str, Path], offset: float, duration: float, device: str = 'cpu')[source]

Loads data from file. Used internally when AudioSignal is instantiated with a path to a file.

Parameters
  • audio_path (Union[str, Path]) – Path to file

  • offset (float) – Offset in seconds

  • duration (float) – Duration in seconds

  • device (str, optional) – Device to put AudioSignal on, by default “cpu”

Returns

AudioSignal loaded from file

Return type

AudioSignal

log_magnitude(ref_value: float = 1.0, amin: float = 1e-05, top_db: float = 80.0)[source]

Computes the log-magnitude of the spectrogram.

Parameters
  • ref_value (float, optional) – The magnitude is scaled relative to ref: 20 * log10(S / ref). Zeros in the output correspond to positions where S == ref, by default 1.0

  • amin (float, optional) – Minimum threshold for S and ref, by default 1e-5

  • top_db (float, optional) – Threshold the output at top_db below the peak: max(10 * log10(S/ref)) - top_db, by default -80.0

Returns

Log-magnitude spectrogram

Return type

torch.Tensor

property magnitude

Computes and returns the absolute value of the STFT, which is the magnitude. This value can also be set to some tensor. When set, self.stft_data is manipulated so that its magnitude matches what this is set to, and modulated by the phase.

Returns

Magnitude of STFT.

Return type

torch.Tensor

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> magnitude = signal.magnitude # Computes stft if not computed
>>> magnitude[magnitude < magnitude.mean()] = 0
>>> signal.magnitude = magnitude
>>> signal.istft()
markdown()[source]

Produces a markdown representation of AudioSignal, in a markdown table.

Returns

Markdown representation of AudioSignal.

Return type

str

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> print(signal.markdown())
| Key | Value
|---|---
| duration | 1.000 seconds |
| batch_size | 1 |
| path | path unknown |
| sample_rate | 44100 |
| num_channels | 1 |
| audio_data.shape | torch.Size([1, 1, 44100]) |
| stft_params | STFTParams(window_length=2048, hop_length=512, window_type='sqrt_hann', match_stride=False) |
| device | cpu |
mel_spectrogram(n_mels: int = 80, mel_fmin: float = 0.0, mel_fmax: Optional[float] = None, **kwargs)[source]

Computes a Mel spectrogram.

Parameters
  • n_mels (int, optional) – Number of mels, by default 80

  • mel_fmin (float, optional) – Lowest frequency, in Hz, by default 0.0

  • mel_fmax (float, optional) – Highest frequency, by default None

  • kwargs (dict, optional) – Keyword arguments to self.stft().

Returns

Mel spectrogram.

Return type

torch.Tensor [shape=(batch, channels, mels, time)]

mfcc(n_mfcc: int = 40, n_mels: int = 80, log_offset: float = 1e-06, **kwargs)[source]

Computes mel-frequency cepstral coefficients (MFCCs).

Parameters
  • n_mfcc (int, optional) – Number of mels, by default 40

  • n_mels (int, optional) – Number of mels, by default 80

  • log_offset (float, optional) – Small value to prevent numerical issues when trying to compute log(0), by default 1e-6

  • kwargs (dict, optional) – Keyword arguments to self.mel_spectrogram(), note that some of them will be used for self.stft()

Returns

MFCCs.

Return type

torch.Tensor [shape=(batch, channels, mfccs, time)]

property num_channels

Number of audio channels.

Returns

Number of audio channels.

Return type

int

numpy()[source]

Detaches self.audio_data, moves to cpu, and converts to numpy.

Returns

Audio data as a numpy array.

Return type

np.ndarray

property path_to_input_file

Path to input file, if it exists. Alias to path_to_file for backwards compatibility

property phase

Computes and returns the phase of the STFT. This value can also be set to some tensor. When set, self.stft_data is manipulated so that its phase matches what this is set to, we original magnitudeith th.

Returns

Phase of STFT.

Return type

torch.Tensor

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> phase = signal.phase # Computes stft if not computed
>>> phase[phase < phase.mean()] = 0
>>> signal.phase = phase
>>> signal.istft()
resample(sample_rate: int)[source]

Resamples the audio, using sinc interpolation. This works on both cpu and gpu, and is much faster on gpu.

Parameters

sample_rate (int) – Sample rate to resample to.

Returns

Resampled AudioSignal

Return type

AudioSignal

classmethod salient_excerpt(audio_path: Union[str, Path], loudness_cutoff: Optional[float] = None, num_tries: int = 8, state: Optional[Union[RandomState, int]] = None, **kwargs)[source]

Similar to AudioSignal.excerpt, except it extracts excerpts only if they are above a specified loudness threshold, which is computed via a fast LUFS routine.

Parameters
  • audio_path (Union[str, Path]) – Path to audio file to grab excerpt from.

  • loudness_cutoff (float, optional) – Loudness threshold in dB. Typical values are -40, -60, etc, by default None

  • num_tries (int, optional) – Number of tries to grab an excerpt above the threshold before giving up, by default 8.

  • state (Union[np.random.RandomState, int], optional) – RandomState or seed of random state, by default None

  • kwargs (dict) – Keyword arguments to AudioSignal.excerpt

Returns

AudioSignal containing excerpt.

Return type

AudioSignal

Warning

if num_tries is set to None, salient_excerpt may try forever, which can result in an infinite loop if audio_path does not have any loud enough excerpts.

Examples

>>> signal = AudioSignal.salient_excerpt(
        "path/to/audio",
        loudness_cutoff=-40,
        duration=5
    )
property samples

Returns the audio data tensor in the object.

Audio data is always of the shape (batch_size, num_channels, num_samples). If value has less than 3 dims (e.g. is (num_channels, num_samples)), then it will be reshaped to (1, num_channels, num_samples) - a batch size of 1.

Parameters

data (Union[torch.Tensor, np.ndarray]) – Audio data to set.

Returns

Audio samples.

Return type

torch.Tensor

property shape

Shape of audio data.

Returns

Shape of audio data.

Return type

tuple

property signal_duration

Length of audio signal in seconds.

Returns

Length of signal in seconds.

Return type

float

property signal_length

Length of audio signal.

Returns

Length of signal in samples.

Return type

int

stft(window_length: Optional[int] = None, hop_length: Optional[int] = None, window_type: Optional[str] = None, match_stride: Optional[bool] = None, padding_type: Optional[str] = None)[source]

Computes the short-time Fourier transform of the audio data, with specified STFT parameters.

Parameters
  • window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.

  • hop_length (int, optional) – Hop length of STFT, by default window_length // 4.

  • window_type (str, optional) – Type of window to use, by default sqrt\_hann.

  • match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

  • padding_type (str, optional) – Type of padding to use, by default ‘reflect’

Returns

STFT of audio data.

Return type

torch.Tensor

Examples

Compute the STFT of an AudioSignal:

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> signal.stft()

Vary the window and hop length:

>>> stft_params = [STFTParams(128, 32), STFTParams(512, 128)]
>>> for stft_param in stft_params:
>>>     signal.stft_params = stft_params
>>>     signal.stft()
property stft_data

Returns the STFT data inside the signal. Shape is (batch, channels, frequencies, time).

Returns

Complex spectrogram data.

Return type

torch.Tensor

property stft_params

Returns STFTParams object, which can be re-used to other AudioSignals.

This property can be set as well. If values are not defined in STFTParams, they are inferred automatically from the signal properties. The default is to use 32ms windows, with 8ms hop length, and the square root of the hann window.

Returns

STFT parameters for the AudioSignal.

Return type

STFTParams

Examples

>>> stft_params = STFTParams(128, 32)
>>> signal1 = AudioSignal(torch.randn(44100), 44100, stft_params=stft_params)
>>> signal2 = AudioSignal(torch.randn(44100), 44100, stft_params=signal1.stft_params)
>>> signal1.stft_params = STFTParams() # Defaults
to(device: str)[source]

Moves all tensors contained in signal to the specified device.

Parameters

device (str) – Device to move AudioSignal onto. Typical values are “cuda”, “cpu”, or “cuda:n” to specify the nth gpu.

Returns

AudioSignal with all tensors moved to specified device.

Return type

AudioSignal

to_mono()[source]

Converts audio data to mono audio, by taking the mean along the channels dimension.

Returns

AudioSignal with mean of channels.

Return type

AudioSignal

trim(before: int, after: int)[source]

Trims the audio_data tensor before and after.

Parameters
  • before (int) – How many samples to trim from beginning.

  • after (int) – How many samples to trim from end.

Returns

AudioSignal with trimming applied.

Return type

AudioSignal

truncate_samples(length_in_samples: int)[source]

Truncate signal to specified length.

Parameters

length_in_samples (int) – Truncate to this many samples.

Returns

AudioSignal with truncation applied.

Return type

AudioSignal

classmethod wave(frequency: float, duration: float, sample_rate: int, num_channels: int = 1, shape: str = 'sine', **kwargs)[source]

Generate a waveform of a given frequency and shape.

Parameters
  • frequency (float) – Frequency of the waveform

  • duration (float) – Duration of the waveform

  • sample_rate (int) – Sample rate of the waveform

  • num_channels (int, optional) – Number of channels, by default 1

  • shape (str, optional) – Shape of the waveform, by default “saw” One of “sawtooth”, “square”, “sine”, “triangle”

  • kwargs (dict) – Keyword arguments to AudioSignal

write(audio_path: Union[str, Path])[source]

Writes audio to a file. Only writes the audio that is in the very first item of the batch. To write other items in the batch, index the signal along the batch dimension before writing. After writing, the signal’s path_to_file attribute is updated to the new path.

Parameters

audio_path (Union[str, Path]) – Path to write audio to.

Returns

Returns original AudioSignal, so you can use this in a fluent interface.

Return type

AudioSignal

Examples

Creating and writing a signal to disk:

>>> signal = AudioSignal(torch.randn(10, 1, 44100), 44100)
>>> signal.write("/tmp/out.wav")

Writing a different element of the batch:

>>> signal[5].write("/tmp/out.wav")

Using this in a fluent interface:

>>> signal.write("/tmp/original.wav").low_pass(4000).write("/tmp/lowpass.wav")
zero_pad(before: int, after: int)[source]

Zero pads the audio_data tensor before and after.

Parameters
  • before (int) – How many zeros to prepend to audio.

  • after (int) – How many zeros to append to audio.

Returns

AudioSignal with padding applied.

Return type

AudioSignal

zero_pad_to(length: int, mode: str = 'after')[source]

Pad with zeros to a specified length, either before or after the audio data.

Parameters
  • length (int) – Length to pad to

  • mode (str, optional) – Whether to prepend or append zeros to signal, by default “after”

Returns

AudioSignal with padding applied.

Return type

AudioSignal

classmethod zeros(duration: float, sample_rate: int, num_channels: int = 1, batch_size: int = 1, **kwargs)[source]

Helper function create an AudioSignal of all zeros.

Parameters
  • duration (float) – Duration of AudioSignal

  • sample_rate (int) – Sample rate of AudioSignal

  • num_channels (int, optional) – Number of channels, by default 1

  • batch_size (int, optional) – Batch size, by default 1

Returns

AudioSignal containing all zeros.

Return type

AudioSignal

Examples

Generate 5 seconds of all zeros at a sample rate of 44100.

>>> signal = AudioSignal.zeros(5.0, 44100)
class audiotools.core.audio_signal.STFTParams(window_length, hop_length, window_type, match_stride, padding_type)

Bases: tuple

STFTParams object is a container that holds STFT parameters - window_length, hop_length, and window_type. Not all parameters need to be specified. Ones that are not specified will be inferred by the AudioSignal parameters.

Parameters
  • window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.

  • hop_length (int, optional) – Hop length of STFT, by default window_length // 4.

  • window_type (str, optional) – Type of window to use, by default sqrt\_hann.

  • match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False

  • padding_type (str, optional) – Type of padding to use, by default ‘reflect’

hop_length

Alias for field number 1

match_stride

Alias for field number 3

padding_type

Alias for field number 4

window_length

Alias for field number 0

window_type

Alias for field number 2

Displaying and visualizing

class audiotools.core.display.DisplayMixin[source]

Bases: object

save_image(image_path: str, plot_fn: Union[Callable, str] = 'specshow', **kwargs)[source]

Save AudioSignal spectrogram (or whatever plot_fn is set to) to a specified file.

Parameters
  • image_path (str) – Where to save the file to.

  • plot_fn (Union[Callable, str], optional) – How to create the image. Set to None to avoid plotting, by default “specshow”

  • kwargs (dict, optional) – Keyword arguments to audiotools.core.display.DisplayMixin.specshow() or whatever plot_fn is set to.

specshow(preemphasis: bool = False, x_axis: str = 'time', y_axis: str = 'linear', n_mels: int = 128, **kwargs)[source]

Displays a spectrogram, using librosa.display.specshow.

Parameters
  • preemphasis (bool, optional) – Whether or not to apply preemphasis, which makes high frequency detail easier to see, by default False

  • x_axis (str, optional) – How to label the x axis, by default “time”

  • y_axis (str, optional) – How to label the y axis, by default “linear”

  • n_mels (int, optional) – If displaying a mel spectrogram with y_axis = "mel", this controls the number of mels, by default 128.

  • kwargs (dict, optional) – Keyword arguments to audiotools.core.util.format_figure().

waveplot(x_axis: str = 'time', **kwargs)[source]

Displays a waveform plot, using librosa.display.waveshow.

Parameters
wavespec(x_axis: str = 'time', **kwargs)[source]

Displays a waveform plot, using librosa.display.waveshow.

Parameters
write_audio_to_tb(tag: str, writer, step: Optional[int] = None, plot_fn: Union[Callable, str] = 'specshow', **kwargs)[source]

Writes a signal and its spectrogram to Tensorboard. Will show up under the Audio and Images tab in Tensorboard.

Parameters
  • tag (str) – Tag to write signal to (e.g. clean/sample_0.wav). The image will be written to the corresponding .png file (e.g. clean/sample_0.png).

  • writer (SummaryWriter) – A SummaryWriter object from PyTorch library.

  • step (int, optional) – The step to write the signal to, by default None

  • plot_fn (Union[Callable, str], optional) – How to create the image. Set to None to avoid plotting, by default “specshow”

  • kwargs (dict, optional) – Keyword arguments to audiotools.core.display.DisplayMixin.specshow() or whatever plot_fn is set to.

audiotools.core.display.format_figure(func)[source]

Decorator for formatting figures produced by the code below. See audiotools.core.util.format_figure() for more.

Parameters

func (Callable) – Plotting function that is decorated by this function.

Digital signal processing

class audiotools.core.dsp.DSPMixin[source]

Bases: object

collect_windows(window_duration: float, hop_duration: float, preprocess: bool = True)[source]

Reshapes signal into windows of specified duration from signal with a specified hop length. Window are placed along the batch dimension. Use with audiotools.core.dsp.DSPMixin.overlap_and_add() to reconstruct the original signal.

Parameters
  • window_duration (float) – Duration of every window in seconds.

  • hop_duration (float) – Hop between windows in seconds.

  • preprocess (bool, optional) – Whether to preprocess the signal, so that the first sample is in the middle of the first window, by default True

Returns

AudioSignal unfolded with shape (nb * nch * num_windows, 1, window_length)

Return type

AudioSignal

corrupt_phase(scale: Union[Tensor, ndarray, float])[source]

Corrupts the phase randomly by some scaled value.

Parameters

scale (Union[torch.Tensor, np.ndarray, float]) – Standard deviation of noise to add to the phase.

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

high_pass(cutoffs: Union[Tensor, ndarray, float], zeros: int = 51)[source]

High-passes the signal in-place. Each item in the batch can have a different high-pass cutoff, if the input to this signal is an array or tensor. If a float, all items are given the same high-pass filter.

Parameters
  • cutoffs (Union[torch.Tensor, np.ndarray, float]) – Cutoff in Hz of high-pass filter.

  • zeros (int, optional) – Number of taps to use in high-pass filter, by default 51

Returns

High-passed AudioSignal.

Return type

AudioSignal

low_pass(cutoffs: Union[Tensor, ndarray, float], zeros: int = 51)[source]

Low-passes the signal in-place. Each item in the batch can have a different low-pass cutoff, if the input to this signal is an array or tensor. If a float, all items are given the same low-pass filter.

Parameters
  • cutoffs (Union[torch.Tensor, np.ndarray, float]) – Cutoff in Hz of low-pass filter.

  • zeros (int, optional) – Number of taps to use in low-pass filter, by default 51

Returns

Low-passed AudioSignal.

Return type

AudioSignal

mask_frequencies(fmin_hz: Union[Tensor, ndarray, float], fmax_hz: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Masks frequencies between fmin_hz and fmax_hz, and fills them with the value specified by val. Useful for implementing SpecAug. The min and max can be different for every item in the batch.

Parameters
  • fmin_hz (Union[torch.Tensor, np.ndarray, float]) – Lower end of band to mask out.

  • fmax_hz (Union[torch.Tensor, np.ndarray, float]) – Upper end of band to mask out.

  • val (float, optional) – Value to fill in, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

mask_low_magnitudes(db_cutoff: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Mask away magnitudes below a specified threshold, which can be different for every item in the batch.

Parameters
  • db_cutoff (Union[torch.Tensor, np.ndarray, float]) – Decibel value for which things below it will be masked away.

  • val (float, optional) – Value to fill in for masked portions, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

mask_timesteps(tmin_s: Union[Tensor, ndarray, float], tmax_s: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Masks timesteps between tmin_s and tmax_s, and fills them with the value specified by val. Useful for implementing SpecAug. The min and max can be different for every item in the batch.

Parameters
  • tmin_s (Union[torch.Tensor, np.ndarray, float]) – Lower end of timesteps to mask out.

  • tmax_s (Union[torch.Tensor, np.ndarray, float]) – Upper end of timesteps to mask out.

  • val (float, optional) – Value to fill in, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

overlap_and_add(hop_duration: float)[source]

Function which takes a list of windows and overlap adds them into a signal the same length as audio_signal.

Parameters

hop_duration (float) – How much to shift for each window (overlap is window_duration - hop_duration) in seconds.

Returns

overlap-and-added signal.

Return type

AudioSignal

preemphasis(coef: float = 0.85)[source]

Applies pre-emphasis to audio signal.

Parameters

coef (float, optional) – How much pre-emphasis to apply, lower values do less. 0 does nothing. by default 0.85

Returns

Pre-emphasized signal.

Return type

AudioSignal

shift_phase(shift: Union[Tensor, ndarray, float])[source]

Shifts the phase by a constant value.

Parameters

shift (Union[torch.Tensor, np.ndarray, float]) – What to shift the phase by.

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

windows(window_duration: float, hop_duration: float, preprocess: bool = True)[source]

Generator which yields windows of specified duration from signal with a specified hop length.

Parameters
  • window_duration (float) – Duration of every window in seconds.

  • hop_duration (float) – Hop between windows in seconds.

  • preprocess (bool, optional) – Whether to preprocess the signal, so that the first sample is in the middle of the first window, by default True

Yields

AudioSignal – Each window is returned as an AudioSignal.

Audio effects

class audiotools.core.effects.EffectMixin[source]

Bases: object

CODEC_PRESETS = {'8-bit': {'bits_per_sample': 8, 'encoding': 'ULAW', 'format': 'wav'}, 'Amr-nb': {'format': 'amr-nb'}, 'GSM-FR': {'format': 'gsm'}, 'MP3': {'compression': -9, 'format': 'mp3'}, 'Ogg': {'compression': -1, 'format': 'ogg'}, 'Vorbis': {'compression': -1, 'format': 'vorbis'}}

Presets for applying codecs via torchaudio.

GAIN_FACTOR = 0.11512925464970229

Gain factor for converting between amplitude and decibels.

apply_codec(preset: Optional[str] = None, format: str = 'wav', encoding: Optional[str] = None, bits_per_sample: Optional[int] = None, compression: Optional[int] = None)[source]

Applies an audio codec to the signal.

Parameters
  • preset (str, optional) – One of the keys in self.CODEC_PRESETS, by default None

  • format (str, optional) – Format for audio codec, by default “wav”

  • encoding (str, optional) – Encoding to use, by default None

  • bits_per_sample (int, optional) – How many bits per sample, by default None

  • compression (int, optional) – Compression amount of codec, by default None

Returns

AudioSignal with codec applied.

Return type

AudioSignal

Raises

ValueError – If preset is not in self.CODEC_PRESETS, an error is thrown.

apply_ir(ir, drr: Optional[Union[Tensor, ndarray, float]] = None, ir_eq: Optional[Union[Tensor, ndarray]] = None, use_original_phase: bool = False)[source]

Applies an impulse response to the signal. If ` is`ir_eq`` is specified, the impulse response is equalized before it is applied, using the given curve.

Parameters
  • ir (AudioSignal) – Impulse response to convolve with.

  • drr (Union[torch.Tensor, np.ndarray, float], optional) – Direct-to-reverberant ratio that impulse response will be altered to, if specified, by default None

  • ir_eq (Union[torch.Tensor, np.ndarray], optional) – Equalization that will be applied to impulse response if specified, by default None

  • use_original_phase (bool, optional) – Whether to use the original phase, instead of the convolved phase, by default False

Returns

Signal with impulse response applied to it

Return type

AudioSignal

clip_distortion(clip_percentile: Union[Tensor, ndarray, float])[source]

Clips the signal at a given percentile. The higher it is, the lower the threshold for clipping.

Parameters

clip_percentile (Union[torch.Tensor, np.ndarray, float]) – Values are between 0.0 to 1.0. Typical values are 0.1 or below.

Returns

Audio signal with clipped audio data.

Return type

AudioSignal

convolve(other, start_at_max: bool = True)[source]

Convolves self with other. This function uses FFTs to do the convolution.

Parameters
  • other (AudioSignal) – Signal to convolve with.

  • start_at_max (bool, optional) – Whether to start at the max value of other signal, to avoid inducing delays, by default True

Returns

Convolved signal, in-place.

Return type

AudioSignal

ensure_max_of_audio(max: float = 1.0)[source]

Ensures that abs(audio_data) <= max.

Parameters

max (float, optional) – Max absolute value of signal, by default 1.0

Returns

Signal with values scaled between -max and max.

Return type

AudioSignal

equalizer(db: Union[Tensor, ndarray])[source]

Applies a mel-spaced equalizer to the audio signal.

Parameters

db (Union[torch.Tensor, np.ndarray]) – EQ curve to apply.

Returns

AudioSignal with equalization applied.

Return type

AudioSignal

mel_filterbank(n_bands: int)[source]

Breaks signal into mel bands.

Parameters

n_bands (int) – Number of mel bands to use.

Returns

Mel-filtered bands, with last axis being the band index.

Return type

torch.Tensor

mix(other, snr: Union[Tensor, ndarray, float] = 10, other_eq: Optional[Union[Tensor, ndarray]] = None)[source]

Mixes noise with signal at specified signal-to-noise ratio. Optionally, the other signal can be equalized in-place.

Parameters
  • other (AudioSignal) – AudioSignal object to mix with.

  • snr (Union[torch.Tensor, np.ndarray, float], optional) – Signal to noise ratio, by default 10

  • other_eq (Union[torch.Tensor, np.ndarray], optional) – EQ curve to apply to other signal, if any, by default None

Returns

In-place modification of AudioSignal.

Return type

AudioSignal

mulaw_quantization(quantization_channels: Union[Tensor, ndarray, int])[source]

Applies mu-law quantization to the input waveform.

Parameters

quantization_channels (Union[torch.Tensor, np.ndarray, int]) – Number of mu-law spaced quantization channels to quantize to.

Returns

Quantized AudioSignal.

Return type

AudioSignal

normalize(db: Union[Tensor, ndarray, float] = -24.0)[source]

Normalizes the signal’s volume to the specified db, in LUFS. This is GPU-compatible, making for very fast loudness normalization.

Parameters

db (Union[torch.Tensor, np.ndarray, float], optional) – Loudness to normalize to, by default -24.0

Returns

Normalized audio signal.

Return type

AudioSignal

pitch_shift(n_semitones: int, quick: bool = True)[source]

Pitch shift the signal. All items in the batch get the same pitch shift.

Parameters
  • n_semitones (int) – How many semitones to shift the signal by.

  • quick (bool, optional) – Using quick pitch shifting, by default True

Returns

Pitch shifted audio signal.

Return type

AudioSignal

quantization(quantization_channels: Union[Tensor, ndarray, int])[source]

Applies quantization to the input waveform.

Parameters

quantization_channels (Union[torch.Tensor, np.ndarray, int]) – Number of evenly spaced quantization channels to quantize to.

Returns

Quantized AudioSignal.

Return type

AudioSignal

time_stretch(factor: float, quick: bool = True)[source]

Time stretch the audio signal.

Parameters
  • factor (float) – Factor by which to stretch the AudioSignal. Typically between 0.8 and 1.2.

  • quick (bool, optional) – Whether to use quick time stretching, by default True

Returns

Time-stretched AudioSignal.

Return type

AudioSignal

volume_change(db: Union[Tensor, ndarray, float])[source]

Change volume of signal by some amount, in dB.

Parameters

db (Union[torch.Tensor, np.ndarray, float]) – Amount to change volume by.

Returns

Signal at new volume.

Return type

AudioSignal

class audiotools.core.effects.ImpulseResponseMixin[source]

Bases: object

These functions are generally only used with AudioSignals that are derived from impulse responses, not other sources like music or speech. These methods are used to replicate the data augmentation described in [1].

  1. Bryan, Nicholas J. “Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation.” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.

alter_drr(drr: Union[Tensor, ndarray, float])[source]

Alters the direct-to-reverberant ratio of the impulse response.

Parameters

drr (Union[torch.Tensor, np.ndarray, float]) – Direct-to-reverberant ratio that impulse response will be altered to, if specified, by default None

Returns

Altered impulse response.

Return type

AudioSignal

decompose_ir()[source]

Decomposes an impulse response into early and late field responses.

measure_drr()[source]

Measures the direct-to-reverberant ratio of the impulse response.

Returns

Direct-to-reverberant ratio

Return type

float

static solve_alpha(early_response, late_field, wd, target_drr)[source]

Used to solve for the alpha value, which is used to alter the drr.

FFMPEG routines

class audiotools.core.ffmpeg.FFMPEGMixin[source]

Bases: object

ffmpeg_loudness(quiet: bool = True)[source]

Computes loudness of audio file using FFMPEG.

Parameters

quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True

Returns

Loudness of every item in the batch, computed via FFMPEG.

Return type

torch.Tensor

ffmpeg_resample(sample_rate: int, quiet: bool = True)[source]

Resamples AudioSignal using FFMPEG. More memory-efficient than using julius.resample for long audio files.

Parameters
  • sample_rate (int) – Sample rate to resample to.

  • quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True

Returns

Resampled AudioSignal.

Return type

AudioSignal

classmethod load_from_file_with_ffmpeg(audio_path: str, quiet: bool = True, **kwargs)[source]

Loads AudioSignal object after decoding it to a wav file using FFMPEG. Useful for loading audio that isn’t covered by librosa’s loading mechanism. Also useful for loading mp3 files, without any offset.

Parameters
  • audio_path (str) – Path to load AudioSignal from.

  • quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True

Returns

AudioSignal loaded from file with FFMPEG.

Return type

AudioSignal

audiotools.core.ffmpeg.ffprobe_offset(path)[source]
audiotools.core.ffmpeg.r128stats(filepath: str, quiet: bool)[source]

Takes a path to an audio file, returns a dict with the loudness stats computed by the ffmpeg ebur128 filter.

Parameters
  • filepath (str) – Path to compute loudness stats on.

  • quiet (bool) – Whether to show FFMPEG output during computation.

Returns

Dictionary containing loudness stats.

Return type

dict

Perceptual loudness

class audiotools.core.loudness.LoudnessMixin[source]

Bases: object

MIN_LOUDNESS = -70

Minimum loudness possible.

loudness(filter_class: str = 'K-weighting', block_size: float = 0.4, **kwargs)[source]

Calculates loudness using an implementation of ITU-R BS.1770-4. Allows control over gating block size and frequency weighting filters for additional control. Measure the integrated gated loudness of a signal.

API is derived from PyLoudnorm, but this implementation is ported to PyTorch and is tensorized across batches. When on GPU, an FIR approximation of the IIR filters is used to compute loudness for speed.

Uses the weighting filters and block size defined by the meter the integrated loudness is measured based upon the gating algorithm defined in the ITU-R BS.1770-4 specification.

Parameters
  • filter_class (str, optional) – Class of weighting filter used. K-weighting’ (default), ‘Fenton/Lee 1’ ‘Fenton/Lee 2’, ‘Dash et al.’ by default “K-weighting”

  • block_size (float, optional) – Gating block size in seconds, by default 0.400

  • kwargs (dict, optional) – Keyword arguments to audiotools.core.loudness.Meter().

Returns

Loudness of audio data.

Return type

torch.Tensor

class audiotools.core.loudness.Meter(rate: int, filter_class: str = 'K-weighting', block_size: float = 0.4, zeros: int = 512, use_fir: bool = False)[source]

Bases: Module

Tensorized version of pyloudnorm.Meter. Works with batched audio tensors.

Parameters
  • rate (int) – Sample rate of audio.

  • filter_class (str, optional) – Class of weighting filter used. K-weighting’ (default), ‘Fenton/Lee 1’ ‘Fenton/Lee 2’, ‘Dash et al.’ by default “K-weighting”

  • block_size (float, optional) – Gating block size in seconds, by default 0.400

  • zeros (int, optional) – Number of zeros to use in FIR approximation of IIR filters, by default 512

  • use_fir (bool, optional) – Whether to use FIR approximation or exact IIR formulation. If computing on GPU, use_fir=True will be used, as its much faster, by default False

apply_filter(data: Tensor)[source]

Applies filter on either CPU or GPU, depending on if the audio is on GPU or is on CPU, or if self.use_fir is True.

Parameters

data (torch.Tensor) – Audio data of shape (nb, nch, nt).

Returns

Filtered audio data.

Return type

torch.Tensor

apply_filter_cpu(data: Tensor)[source]

Performs IIR formulation of loudness computation.

Parameters

data (torch.Tensor) – Audio data of shape (nb, nch, nt).

Returns

Filtered audio data.

Return type

torch.Tensor

apply_filter_gpu(data: Tensor)[source]

Performs FIR approximation of loudness computation.

Parameters

data (torch.Tensor) – Audio data of shape (nb, nch, nt).

Returns

Filtered audio data.

Return type

torch.Tensor

property filter_class
forward(data: Tensor)[source]

Computes integrated loudness of data.

Parameters

data (torch.Tensor) – Audio data of shape (nb, nch, nt).

Returns

Filtered audio data.

Return type

torch.Tensor

integrated_loudness(data: Tensor)[source]

Computes integrated loudness of data.

Parameters

data (torch.Tensor) – Audio data of shape (nb, nch, nt).

Returns

Filtered audio data.

Return type

torch.Tensor

training: bool

Listening to AudioSignals

These are utilities that allow one to embed an AudioSignal as a playable object in a Jupyter notebook, or to play audio from the terminal, etc.

class audiotools.core.playback.PlayMixin[source]

Bases: object

embed(ext: Optional[str] = None, display: bool = True, return_html: bool = False)[source]

Embeds audio as a playable audio embed in a notebook, or HTML document, etc.

Parameters
  • ext (str, optional) – Extension to use when saving the audio, by default “.wav”

  • display (bool, optional) – This controls whether or not to display the audio when called. This is used when the embed is the last line in a Jupyter cell, to prevent the audio from being embedded twice, by default True

  • return_html (bool, optional) – Whether to return the data wrapped in an HTML audio element, by default False

Returns

Either the element for display, or the HTML string of it.

Return type

str

play()[source]

Plays an audio signal if ffplay from the ffmpeg suite of tools is installed. Otherwise, will fail. The audio signal is written to a temporary file and then played with ffplay.

widget(title: Optional[str] = None, ext: str = '.wav', add_headers: bool = True, player_width: str = '100%', margin: str = '10px', plot_fn: str = 'specshow', return_html: bool = False, **kwargs)[source]

Creates a playable widget with spectrogram. Inspired (heavily) by https://sjvasquez.github.io/blog/melnet/.

Parameters
  • title (str, optional) – Title of plot, placed in upper right of top-most axis.

  • ext (str, optional) – Extension for embedding, by default “.mp3”

  • add_headers (bool, optional) – Whether or not to add headers (use for first embed, False for later embeds), by default True

  • player_width (str, optional) – Width of the player, as a string in a CSS rule, by default “100%”

  • margin (str, optional) – Margin on all sides of player, by default “10px”

  • plot_fn (function, optional) – Plotting function to use (by default self.specshow).

  • return_html (bool, optional) – Whether to return the data wrapped in an HTML audio element, by default False

  • kwargs (dict, optional) – Keyword arguments to plot_fn (by default self.specshow).

Returns

HTML object.

Return type

HTML

Utilities

class audiotools.core.util.Info(sample_rate: float, num_frames: int)[source]

Bases: object

Shim for torchaudio.info API changes.

property duration: float
num_frames: int
sample_rate: float
audiotools.core.util.chdir(newdir: Union[Path, str])[source]

Context manager for switching directories to run a function. Useful for when you want to use relative paths to different runs.

Parameters

newdir (Union[Path, str]) – Directory to switch to.

audiotools.core.util.choose_from_list_of_lists(state: RandomState, list_of_lists: list, p: Optional[float] = None)[source]

Choose a single item from a list of lists.

Parameters
  • state (np.random.RandomState) – Random state to use when choosing an item.

  • list_of_lists (list) – A list of lists from which items will be drawn.

  • p (float, optional) – Probabilities of each list, by default None

Returns

An item from the list of lists.

Return type

Any

audiotools.core.util.collate(list_of_dicts: list, n_splits: Optional[int] = None)[source]

Collates a list of dictionaries (e.g. as returned by a dataloader) into a dictionary with batched values. This routine uses the default torch collate function for everything except AudioSignal objects, which are handled by the audiotools.core.audio_signal.AudioSignal.batch() function.

This function takes n_splits to enable splitting a batch into multiple sub-batches for the purposes of gradient accumulation, etc.

Parameters
  • list_of_dicts (list) – List of dictionaries to be collated.

  • n_splits (int) – Number of splits to make when creating the batches (split into sub-batches). Useful for things like gradient accumulation.

Returns

Dictionary containing batched data.

Return type

dict

audiotools.core.util.ensure_tensor(x: Union[ndarray, Tensor, float, int], ndim: Optional[int] = None, batch_size: Optional[int] = None)[source]

Ensures that the input x is a tensor of specified dimensions and batch size.

Parameters
  • x (Union[np.ndarray, torch.Tensor, float, int]) – Data that will become a tensor on its way out.

  • ndim (int, optional) – How many dimensions should be in the output, by default None

  • batch_size (int, optional) – The batch size of the output, by default None

Returns

Modified version of x as a tensor.

Return type

torch.Tensor

audiotools.core.util.find_audio(folder: str, ext: List[str] = ['.wav', '.flac', '.mp3', '.mp4'])[source]

Finds all audio files in a directory recursively. Returns a list.

Parameters
  • folder (str) – Folder to look for audio files in, recursively.

  • ext (List[str], optional) – Extensions to look for without the ., by default ['.wav', '.flac', '.mp3', '.mp4'].

audiotools.core.util.format_figure(fig_size: Optional[tuple] = None, title: Optional[str] = None, fig=None, format_axes: bool = True, format: bool = True, font_color: str = 'white')[source]

Prettifies the spectrogram and waveform plots. A title can be inset into the top right corner, and the axes can be inset into the figure, allowing the data to take up the entire image. Used in

Parameters
  • fig_size (tuple, optional) – Size of figure, by default (9, 3)

  • title (str, optional) – Title to inset in top right, by default None

  • fig (matplotlib.figure.Figure, optional) – Figure object, if None plt.gcf() will be used, by default None

  • format_axes (bool, optional) – Format the axes to be inside the figure, by default True

  • format (bool, optional) – This formatting can be skipped entirely by passing format=False to any of the plotting functions that use this formater, by default True

  • font_color (str, optional) – Color of font of axes, by default “white”

audiotools.core.util.generate_chord_dataset(max_voices: int = 8, sample_rate: int = 44100, num_items: int = 5, duration: float = 1.0, min_note: str = 'C2', max_note: str = 'C6', output_dir: Path = 'chords')[source]

Generates a toy multitrack dataset of chords, synthesized from sine waves.

Parameters
  • max_voices (int, optional) – Maximum number of voices in a chord, by default 8

  • sample_rate (int, optional) – Sample rate of audio, by default 44100

  • num_items (int, optional) – Number of items to generate, by default 5

  • duration (float, optional) – Duration of each item, by default 1.0

  • min_note (str, optional) – Minimum note in the dataset, by default “C2”

  • max_note (str, optional) – Maximum note in the dataset, by default “C6”

  • output_dir (Path, optional) – Directory to save the dataset, by default “chords”

audiotools.core.util.hz_to_bin(hz: Tensor, n_fft: int, sample_rate: int)[source]

Closest frequency bin given a frequency, number of bins, and a sampling rate.

Parameters
  • hz (torch.Tensor) – Tensor of frequencies in Hz.

  • n_fft (int) – Number of FFT bins.

  • sample_rate (int) – Sample rate of audio.

Returns

Closest bins to the data.

Return type

torch.Tensor

audiotools.core.util.info(audio_path: str)[source]

Shim for torchaudio.info to make 0.7.2 API match 0.8.0.

Parameters

audio_path (str) – Path to audio file.

audiotools.core.util.prepare_batch(batch: Union[dict, list, Tensor], device: str = 'cpu')[source]

Moves items in a batch (typically generated by a DataLoader as a list or a dict) to the specified device. This works even if dictionaries are nested.

Parameters
  • batch (Union[dict, list, torch.Tensor]) – Batch, typically generated by a dataloader, that will be moved to the device.

  • device (str, optional) – Device to move batch to, by default “cpu”

Returns

Batch with all values moved to the specified device.

Return type

Union[dict, list, torch.Tensor]

audiotools.core.util.random_state(seed: Union[int, RandomState])[source]

Turn seed into a np.random.RandomState instance.

Parameters

seed (Union[int, np.random.RandomState] or None) – If seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.

Returns

Random state object.

Return type

np.random.RandomState

Raises

ValueError – If seed is not valid, an error is thrown.

audiotools.core.util.read_sources(sources: List[str], remove_empty: bool = True, relative_path: str = '', ext: List[str] = ['.wav', '.flac', '.mp3', '.mp4'])[source]

Reads audio sources that can either be folders full of audio files, or CSV files that contain paths to audio files. CSV files that adhere to the expected format can be generated by audiotools.data.preprocess.create_csv().

Parameters
  • sources (List[str]) – List of audio sources to be converted into a list of lists of audio files.

  • remove_empty (bool, optional) – Whether or not to remove rows with an empty “path” from each CSV file, by default True.

Returns

List of lists of rows of CSV files.

Return type

list

audiotools.core.util.sample_from_dist(dist_tuple: tuple, state: Optional[RandomState] = None)[source]

Samples from a distribution defined by a tuple. The first item in the tuple is the distribution type, and the rest of the items are arguments to that distribution. The distribution function is gotten from the np.random.RandomState object.

Parameters
  • dist_tuple (tuple) – Distribution tuple

  • state (np.random.RandomState, optional) – Random state, or seed to use, by default None

Returns

Draw from the distribution.

Return type

Union[float, int, str]

Examples

Sample from a uniform distribution:

>>> dist_tuple = ("uniform", 0, 1)
>>> sample_from_dist(dist_tuple)

Sample from a constant distribution:

>>> dist_tuple = ("const", 0)
>>> sample_from_dist(dist_tuple)

Sample from a normal distribution:

>>> dist_tuple = ("normal", 0, 0.5)
>>> sample_from_dist(dist_tuple)
audiotools.core.util.seed(random_seed, set_cudnn=False)[source]

Seeds all random states with the same random seed for reproducibility. Seeds numpy, random and torch random generators. For full reproducibility, two further options must be set according to the torch documentation: https://pytorch.org/docs/stable/notes/randomness.html To do this, set_cudnn must be True. It defaults to False, since setting it to True results in a performance hit.

Parameters
  • random_seed (int) – integer corresponding to random seed to

  • use.

  • set_cudnn (bool) – Whether or not to set cudnn into determinstic

  • False. (mode and off of benchmark mode. Defaults to) –