AudioSignal

Base functionality

class audiotools.core.audio_signal.AudioSignal(audio_path_or_array: Union[Tensor, str, Path, ndarray], sample_rate: Optional[int] = None, stft_params: Optional[STFTParams] = None, offset: float = 0, duration: Optional[float] = None, device: Optional[str] = None)[source]

Bases: EffectMixin, LoudnessMixin, PlayMixin, ImpulseResponseMixin, DSPMixin, DisplayMixin, FFMPEGMixin, WhisperMixin

This is the core object of this library. Audio is always loaded into an AudioSignal, which then enables all the features of this library, including audio augmentations, I/O, playback, and more.

The structure of this object is that the base functionality is defined in core/audio_signal.py, while extensions to that functionality are defined in the other core/*.py files. For example, all the display-based functionality (e.g. plot spectrograms, waveforms, write to tensorboard) are in core/display.py.

Parameters

audio_path_or_array (Union[torch.Tensor, str, Path, np.ndarray]) – Object to create AudioSignal from. Can be a tensor, numpy array, or a path to a file. The file is always reshaped to
sample_rate (int, optional) – Sample rate of the audio. If different from underlying file, resampling is performed. If passing in an array or tensor, this must be defined, by default None
stft_params (STFTParams, optional) – Parameters of STFT to use. , by default None
offset (float, optional) – Offset in seconds to read from file, by default 0
duration (float, optional) – Duration in seconds to read from file, by default None
device (str, optional) – Device to load audio onto, by default None

Examples

Loading an AudioSignal from an array, at a sample rate of 44100.

>>> signal = AudioSignal(torch.randn(5*44100), 44100)

Note, the signal is reshaped to have a batch size, and one audio channel:

>>> print(signal.shape)
(1, 1, 44100)

You can treat AudioSignals like tensors, and many of the same functions you might use on tensors are defined for AudioSignals as well:

>>> signal.to("cuda")
>>> signal.cuda()
>>> signal.clone()
>>> signal.detach()

Indexing AudioSignals returns an AudioSignal:

>>> signal[..., 3*44100:4*44100]

The above signal is 1 second long, and is also an AudioSignal.

property audio_data

Returns the audio data tensor in the object.

Audio data is always of the shape (batch_size, num_channels, num_samples). If value has less than 3 dims (e.g. is (num_channels, num_samples)), then it will be reshaped to (1, num_channels, num_samples) - a batch size of 1.

Parameters: data (Union[torch.Tensor, np.ndarray]) – Audio data to set.
Returns: Audio samples.
Return type: torch.Tensor

classmethod batch(audio_signals: list, pad_signals: bool = False, truncate_signals: bool = False, resample: bool = False, dim: int = 0)[source]

Creates a batched AudioSignal from a list of AudioSignals.

Parameters

audio_signals (list[AudioSignal]) – List of AudioSignal objects
pad_signals (bool, optional) – Whether to pad signals to length of the maximum length AudioSignal in the list, by default False
truncate_signals (bool, optional) – Whether to truncate signals to length of shortest length AudioSignal in the list, by default False
resample (bool, optional) – Whether to resample AudioSignal to the sample rate of the first AudioSignal in the list, by default False
dim (int, optional) – Dimension along which to batch the signals.

Returns

Batched AudioSignal.

Return type

AudioSignal

Raises

RuntimeError – If not all AudioSignals are the same sample rate, and resample=False, an error is raised.
RuntimeError – If not all AudioSignals are the same the length, and both pad_signals=False and truncate_signals=False, an error is raised.

Examples

Batching a bunch of random signals:

>>> signal_list = [AudioSignal(torch.randn(44100), 44100) for _ in range(10)]
>>> signal = AudioSignal.batch(signal_list)
>>> print(signal.shape)
(10, 1, 44100)

property batch_size

Batch size of audio signal.

Returns: Batch size of signal.
Return type: int

clone()[source]

Clones all tensors contained in the AudioSignal, and returns a copy of the signal with everything cloned. Useful when using AudioSignal within autograd computation graphs.

Relevant attributes are the stft data, the audio data, and the loudness of the file.

Returns: Clone of AudioSignal.
Return type: AudioSignal

compute_stft_padding(window_length: int, hop_length: int, match_stride: bool)[source]

Compute how the STFT should be padded, based on match_stride.

Parameters

window_length (int) – Window length of STFT.
hop_length (int) – Hop length of STFT.
match_stride (bool) – Whether or not to match stride, making the STFT have the same alignment as convolutional layers.

Returns

Amount to pad on either side of audio.

Return type

tuple

copy()[source]

Shallow copy of signal.

Returns: Shallow copy of the audio signal.
Return type: AudioSignal

cpu()[source]

Moves AudioSignal to cpu.

Return type: AudioSignal

cuda()[source]

Moves AudioSignal to cuda.

Return type: AudioSignal

deepcopy()[source]

Copies the signal and all of its attributes.

Returns: Deep copy of the audio signal.
Return type: AudioSignal

detach()[source]

Detaches tensors contained in AudioSignal.

Relevant attributes are the stft data, the audio data, and the loudness of the file.

Returns: Same signal, but with all tensors detached.
Return type: AudioSignal

property device

Get device that AudioSignal is on.

Returns: Device that AudioSignal is on.
Return type: torch.device

property duration

Length of audio signal in seconds.

Returns: Length of signal in seconds.
Return type: float

classmethod excerpt(audio_path: Union[str, Path], offset: Optional[float] = None, duration: Optional[float] = None, state: Optional[Union[RandomState, int]] = None, **kwargs)[source]

Randomly draw an excerpt of duration seconds from an audio file specified at audio_path, between offset seconds and end of file. state can be used to seed the random draw.

Parameters

audio_path (Union[str, Path]) – Path to audio file to grab excerpt from.
offset (float, optional) – Lower bound for the start time, in seconds drawn from the file, by default None.
duration (float, optional) – Duration of excerpt, in seconds, by default None
state (Union[np.random.RandomState, int], optional) – RandomState or seed of random state, by default None

Returns

AudioSignal containing excerpt.

Return type

AudioSignal

Examples

>>> signal = AudioSignal.excerpt("path/to/audio", duration=5)

float()[source]

Calls .float() on self.audio_data.

Return type: AudioSignal

static get_dct(n_mfcc: int, n_mels: int, norm: str = 'ortho', device: str = None)[source]

Create a discrete cosine transform (DCT) transformation matrix with shape (n_mels, n_mfcc), it can be normalized depending on norm. For more information about dct: http://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II

Parameters

n_mfcc (int) – Number of mfccs
n_mels (int) – Number of mels
norm (str) – Use “ortho” to get a orthogonal matrix or None, by default “ortho”
device (str, optional) – Device to load the transformation matrix on, by default None

Returns

The dct transformation matrix.

Return type

torch.Tensor [shape=(n_mels, n_mfcc)] T

static get_mel_filters(sr: int, n_fft: int, n_mels: int, fmin: float = 0.0, fmax: float = None)[source]

Create a Filterbank matrix to combine FFT bins into Mel-frequency bins.

Parameters

sr (int) – Sample rate of audio
n_fft (int) – Number of FFT bins
n_mels (int) – Number of mels
fmin (float, optional) – Lowest frequency, in Hz, by default 0.0
fmax (float, optional) – Highest frequency, by default None

Returns

Mel transform matrix

Return type

np.ndarray [shape=(n_mels, 1 + n_fft/2)]

static get_window(window_type: str, window_length: int, device: str)[source]

Wrapper around scipy.signal.get_window so one can also get the popular sqrt-hann window. This function caches for efficiency using functools.lru_cache.

Parameters

window_type (str) – Type of window to get
window_length (int) – Length of the window
device (str) – Device to put window onto.

Returns

Window returned by scipy.signal.get_window, as a tensor.

Return type

torch.Tensor

hash()[source]

Writes the audio data to a temporary file, and then hashes it using hashlib. Useful for creating a file name based on the audio content.

Returns: Hash of audio data.
Return type: str

Examples

Creating a signal, and writing it to a unique file name:

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> hash = signal.hash()
>>> signal.write(f"{hash}.wav")

istft(window_length: Optional[int] = None, hop_length: Optional[int] = None, window_type: Optional[str] = None, match_stride: Optional[bool] = None, length: Optional[int] = None)[source]

Computes inverse STFT and sets it to audio_data.

Parameters

window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.
hop_length (int, optional) – Hop length of STFT, by default window_length // 4.
window_type (str, optional) – Type of window to use, by default sqrt\_hann.
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False
length (int, optional) – Original length of signal, by default None

Returns

AudioSignal with istft applied.

Return type

AudioSignal

Raises

RuntimeError – Raises an error if stft was not called prior to istft on the signal, or if stft_data is not set.

property length

Length of audio signal.

Returns: Length of signal in samples.
Return type: int

load_from_array(audio_array: Union[Tensor, ndarray], sample_rate: int, device: str = 'cpu')[source]

Loads data from array, reshaping it to be exactly 3 dimensions. Used internally when AudioSignal is called with a tensor or an array.

Parameters

audio_array (Union[torch.Tensor, np.ndarray]) – Array/tensor of audio of samples.
sample_rate (int) – Sample rate of audio
device (str, optional) – Device to move audio onto, by default “cpu”

Returns

AudioSignal loaded from array

Return type

AudioSignal

load_from_file(audio_path: Union[str, Path], offset: float, duration: float, device: str = 'cpu')[source]

Loads data from file. Used internally when AudioSignal is instantiated with a path to a file.

Parameters

audio_path (Union[str, Path]) – Path to file
offset (float) – Offset in seconds
duration (float) – Duration in seconds
device (str, optional) – Device to put AudioSignal on, by default “cpu”

Returns

AudioSignal loaded from file

Return type

AudioSignal

log_magnitude(ref_value: float = 1.0, amin: float = 1e-05, top_db: float = 80.0)[source]

Computes the log-magnitude of the spectrogram.

Parameters

ref_value (float, optional) – The magnitude is scaled relative to ref: 20 * log10(S / ref). Zeros in the output correspond to positions where S == ref, by default 1.0
amin (float, optional) – Minimum threshold for S and ref, by default 1e-5
top_db (float, optional) – Threshold the output at top_db below the peak: max(10 * log10(S/ref)) - top_db, by default -80.0

Returns

Log-magnitude spectrogram

Return type

torch.Tensor

property magnitude

Computes and returns the absolute value of the STFT, which is the magnitude. This value can also be set to some tensor. When set, self.stft_data is manipulated so that its magnitude matches what this is set to, and modulated by the phase.

Returns: Magnitude of STFT.
Return type: torch.Tensor

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> magnitude = signal.magnitude # Computes stft if not computed
>>> magnitude[magnitude < magnitude.mean()] = 0
>>> signal.magnitude = magnitude
>>> signal.istft()

markdown()[source]

Produces a markdown representation of AudioSignal, in a markdown table.

Returns: Markdown representation of AudioSignal.
Return type: str

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> print(signal.markdown())
| Key | Value
|---|---
| duration | 1.000 seconds |
| batch_size | 1 |
| path | path unknown |
| sample_rate | 44100 |
| num_channels | 1 |
| audio_data.shape | torch.Size([1, 1, 44100]) |
| stft_params | STFTParams(window_length=2048, hop_length=512, window_type='sqrt_hann', match_stride=False) |
| device | cpu |

mel_spectrogram(n_mels: int = 80, mel_fmin: float = 0.0, mel_fmax: Optional[float] = None, **kwargs)[source]

Computes a Mel spectrogram.

Parameters

n_mels (int, optional) – Number of mels, by default 80
mel_fmin (float, optional) – Lowest frequency, in Hz, by default 0.0
mel_fmax (float, optional) – Highest frequency, by default None
kwargs (dict, optional) – Keyword arguments to self.stft().

Returns

Mel spectrogram.

Return type

torch.Tensor [shape=(batch, channels, mels, time)]

mfcc(n_mfcc: int = 40, n_mels: int = 80, log_offset: float = 1e-06, **kwargs)[source]

Computes mel-frequency cepstral coefficients (MFCCs).

Parameters

n_mfcc (int, optional) – Number of mels, by default 40
n_mels (int, optional) – Number of mels, by default 80
log_offset (float, optional) – Small value to prevent numerical issues when trying to compute log(0), by default 1e-6
kwargs (dict, optional) – Keyword arguments to self.mel_spectrogram(), note that some of them will be used for self.stft()

Returns

MFCCs.

Return type

torch.Tensor [shape=(batch, channels, mfccs, time)]

property num_channels

Number of audio channels.

Returns: Number of audio channels.
Return type: int

numpy()[source]

Detaches self.audio_data, moves to cpu, and converts to numpy.

Returns: Audio data as a numpy array.
Return type: np.ndarray

property path_to_input_file: Path to input file, if it exists. Alias to path_to_file for backwards compatibility

property phase

Computes and returns the phase of the STFT. This value can also be set to some tensor. When set, self.stft_data is manipulated so that its phase matches what this is set to, we original magnitudeith th.

Returns: Phase of STFT.
Return type: torch.Tensor

Examples

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> phase = signal.phase # Computes stft if not computed
>>> phase[phase < phase.mean()] = 0
>>> signal.phase = phase
>>> signal.istft()

resample(sample_rate: int)[source]

Resamples the audio, using sinc interpolation. This works on both cpu and gpu, and is much faster on gpu.

Parameters: sample_rate (int) – Sample rate to resample to.
Returns: Resampled AudioSignal
Return type: AudioSignal

classmethod salient_excerpt(audio_path: Union[str, Path], loudness_cutoff: Optional[float] = None, num_tries: int = 8, state: Optional[Union[RandomState, int]] = None, **kwargs)[source]

Similar to AudioSignal.excerpt, except it extracts excerpts only if they are above a specified loudness threshold, which is computed via a fast LUFS routine.

Parameters

audio_path (Union[str, Path]) – Path to audio file to grab excerpt from.
loudness_cutoff (float, optional) – Loudness threshold in dB. Typical values are -40, -60, etc, by default None
num_tries (int, optional) – Number of tries to grab an excerpt above the threshold before giving up, by default 8.
state (Union[np.random.RandomState, int], optional) – RandomState or seed of random state, by default None
kwargs (dict) – Keyword arguments to AudioSignal.excerpt

Returns

AudioSignal containing excerpt.

Return type

AudioSignal

Warning

if num_tries is set to None, salient_excerpt may try forever, which can result in an infinite loop if audio_path does not have any loud enough excerpts.

Examples

>>> signal = AudioSignal.salient_excerpt(
        "path/to/audio",
        loudness_cutoff=-40,
        duration=5
    )

property samples

Returns the audio data tensor in the object.

Audio data is always of the shape (batch_size, num_channels, num_samples). If value has less than 3 dims (e.g. is (num_channels, num_samples)), then it will be reshaped to (1, num_channels, num_samples) - a batch size of 1.

Parameters: data (Union[torch.Tensor, np.ndarray]) – Audio data to set.
Returns: Audio samples.
Return type: torch.Tensor

property shape

Shape of audio data.

Returns: Shape of audio data.
Return type: tuple

property signal_duration

Length of audio signal in seconds.

Returns: Length of signal in seconds.
Return type: float

property signal_length

Length of audio signal.

Returns: Length of signal in samples.
Return type: int

stft(window_length: Optional[int] = None, hop_length: Optional[int] = None, window_type: Optional[str] = None, match_stride: Optional[bool] = None, padding_type: Optional[str] = None)[source]

Computes the short-time Fourier transform of the audio data, with specified STFT parameters.

Parameters

window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.
hop_length (int, optional) – Hop length of STFT, by default window_length // 4.
window_type (str, optional) – Type of window to use, by default sqrt\_hann.
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False
padding_type (str, optional) – Type of padding to use, by default ‘reflect’

Returns

STFT of audio data.

Return type

torch.Tensor

Examples

Compute the STFT of an AudioSignal:

>>> signal = AudioSignal(torch.randn(44100), 44100)
>>> signal.stft()

Vary the window and hop length:

>>> stft_params = [STFTParams(128, 32), STFTParams(512, 128)]
>>> for stft_param in stft_params:
>>>     signal.stft_params = stft_params
>>>     signal.stft()

property stft_data

Returns the STFT data inside the signal. Shape is (batch, channels, frequencies, time).

Returns: Complex spectrogram data.
Return type: torch.Tensor

property stft_params

Returns STFTParams object, which can be re-used to other AudioSignals.

This property can be set as well. If values are not defined in STFTParams, they are inferred automatically from the signal properties. The default is to use 32ms windows, with 8ms hop length, and the square root of the hann window.

Returns: STFT parameters for the AudioSignal.
Return type: STFTParams

Examples

>>> stft_params = STFTParams(128, 32)
>>> signal1 = AudioSignal(torch.randn(44100), 44100, stft_params=stft_params)
>>> signal2 = AudioSignal(torch.randn(44100), 44100, stft_params=signal1.stft_params)
>>> signal1.stft_params = STFTParams() # Defaults

to(device: str)[source]

Moves all tensors contained in signal to the specified device.

Parameters: device (str) – Device to move AudioSignal onto. Typical values are “cuda”, “cpu”, or “cuda:n” to specify the nth gpu.
Returns: AudioSignal with all tensors moved to specified device.
Return type: AudioSignal

to_mono()[source]

Converts audio data to mono audio, by taking the mean along the channels dimension.

Returns: AudioSignal with mean of channels.
Return type: AudioSignal

trim(before: int, after: int)[source]

Trims the audio_data tensor before and after.

Parameters

before (int) – How many samples to trim from beginning.
after (int) – How many samples to trim from end.

Returns

AudioSignal with trimming applied.

Return type

AudioSignal

truncate_samples(length_in_samples: int)[source]

Truncate signal to specified length.

Parameters: length_in_samples (int) – Truncate to this many samples.
Returns: AudioSignal with truncation applied.
Return type: AudioSignal

classmethod wave(frequency: float, duration: float, sample_rate: int, num_channels: int = 1, shape: str = 'sine', **kwargs)[source]

Generate a waveform of a given frequency and shape.

Parameters

frequency (float) – Frequency of the waveform
duration (float) – Duration of the waveform
sample_rate (int) – Sample rate of the waveform
num_channels (int, optional) – Number of channels, by default 1
shape (str, optional) – Shape of the waveform, by default “saw” One of “sawtooth”, “square”, “sine”, “triangle”
kwargs (dict) – Keyword arguments to AudioSignal

write(audio_path: Union[str, Path])[source]

Writes audio to a file. Only writes the audio that is in the very first item of the batch. To write other items in the batch, index the signal along the batch dimension before writing. After writing, the signal’s path_to_file attribute is updated to the new path.

Parameters: audio_path (Union[str, Path]) – Path to write audio to.
Returns: Returns original AudioSignal, so you can use this in a fluent interface.
Return type: AudioSignal

Examples

Creating and writing a signal to disk:

>>> signal = AudioSignal(torch.randn(10, 1, 44100), 44100)
>>> signal.write("/tmp/out.wav")

Writing a different element of the batch:

>>> signal[5].write("/tmp/out.wav")

Using this in a fluent interface:

>>> signal.write("/tmp/original.wav").low_pass(4000).write("/tmp/lowpass.wav")

zero_pad(before: int, after: int)[source]

Zero pads the audio_data tensor before and after.

Parameters

before (int) – How many zeros to prepend to audio.
after (int) – How many zeros to append to audio.

Returns

AudioSignal with padding applied.

Return type

AudioSignal

zero_pad_to(length: int, mode: str = 'after')[source]

Pad with zeros to a specified length, either before or after the audio data.

Parameters

length (int) – Length to pad to
mode (str, optional) – Whether to prepend or append zeros to signal, by default “after”

Returns

AudioSignal with padding applied.

Return type

AudioSignal

classmethod zeros(duration: float, sample_rate: int, num_channels: int = 1, batch_size: int = 1, **kwargs)[source]

Helper function create an AudioSignal of all zeros.

Parameters

duration (float) – Duration of AudioSignal
sample_rate (int) – Sample rate of AudioSignal
num_channels (int, optional) – Number of channels, by default 1
batch_size (int, optional) – Batch size, by default 1

Returns

AudioSignal containing all zeros.

Return type

AudioSignal

Examples

Generate 5 seconds of all zeros at a sample rate of 44100.

>>> signal = AudioSignal.zeros(5.0, 44100)

class audiotools.core.audio_signal.STFTParams(window_length, hop_length, window_type, match_stride, padding_type)

Bases: tuple

STFTParams object is a container that holds STFT parameters - window_length, hop_length, and window_type. Not all parameters need to be specified. Ones that are not specified will be inferred by the AudioSignal parameters.

Parameters

window_length (int, optional) – Window length of STFT, by default 0.032 * self.sample_rate.
hop_length (int, optional) – Hop length of STFT, by default window_length // 4.
window_type (str, optional) – Type of window to use, by default sqrt\_hann.
match_stride (bool, optional) – Whether to match the stride of convolutional layers, by default False
padding_type (str, optional) – Type of padding to use, by default ‘reflect’

hop_length: Alias for field number 1

match_stride: Alias for field number 3

padding_type: Alias for field number 4

window_length: Alias for field number 0

window_type: Alias for field number 2

Displaying and visualizing

class audiotools.core.display.DisplayMixin[source]

Bases: object

save_image(image_path: str, plot_fn: Union[Callable, str] = 'specshow', **kwargs)[source]

Save AudioSignal spectrogram (or whatever plot_fn is set to) to a specified file.

Parameters

image_path (str) – Where to save the file to.
plot_fn (Union[Callable, str], optional) – How to create the image. Set to None to avoid plotting, by default “specshow”
kwargs (dict, optional) – Keyword arguments to audiotools.core.display.DisplayMixin.specshow() or whatever plot_fn is set to.

specshow(preemphasis: bool = False, x_axis: str = 'time', y_axis: str = 'linear', n_mels: int = 128, **kwargs)[source]

Displays a spectrogram, using librosa.display.specshow.

Parameters

preemphasis (bool, optional) – Whether or not to apply preemphasis, which makes high frequency detail easier to see, by default False
x_axis (str, optional) – How to label the x axis, by default “time”
y_axis (str, optional) – How to label the y axis, by default “linear”
n_mels (int, optional) – If displaying a mel spectrogram with y_axis = "mel", this controls the number of mels, by default 128.
kwargs (dict, optional) – Keyword arguments to audiotools.core.util.format_figure().

waveplot(x_axis: str = 'time', **kwargs)[source]

Displays a waveform plot, using librosa.display.waveshow.

Parameters

x_axis (str, optional) – How to label the x axis, by default “time”
kwargs (dict, optional) – Keyword arguments to audiotools.core.util.format_figure().

wavespec(x_axis: str = 'time', **kwargs)[source]

Displays a waveform plot, using librosa.display.waveshow.

Parameters

x_axis (str, optional) – How to label the x axis, by default “time”
kwargs (dict, optional) – Keyword arguments to audiotools.core.display.DisplayMixin.specshow().

write_audio_to_tb(tag: str, writer, step: Optional[int] = None, plot_fn: Union[Callable, str] = 'specshow', **kwargs)[source]

Writes a signal and its spectrogram to Tensorboard. Will show up under the Audio and Images tab in Tensorboard.

Parameters

tag (str) – Tag to write signal to (e.g. clean/sample_0.wav). The image will be written to the corresponding .png file (e.g. clean/sample_0.png).
writer (SummaryWriter) – A SummaryWriter object from PyTorch library.
step (int, optional) – The step to write the signal to, by default None
plot_fn (Union[Callable, str], optional) – How to create the image. Set to None to avoid plotting, by default “specshow”
kwargs (dict, optional) – Keyword arguments to audiotools.core.display.DisplayMixin.specshow() or whatever plot_fn is set to.

audiotools.core.display.format_figure(func)[source]

Decorator for formatting figures produced by the code below. See audiotools.core.util.format_figure() for more.

Parameters: func (Callable) – Plotting function that is decorated by this function.

Digital signal processing

class audiotools.core.dsp.DSPMixin[source]

Bases: object

collect_windows(window_duration: float, hop_duration: float, preprocess: bool = True)[source]

Reshapes signal into windows of specified duration from signal with a specified hop length. Window are placed along the batch dimension. Use with audiotools.core.dsp.DSPMixin.overlap_and_add() to reconstruct the original signal.

Parameters

window_duration (float) – Duration of every window in seconds.
hop_duration (float) – Hop between windows in seconds.
preprocess (bool, optional) – Whether to preprocess the signal, so that the first sample is in the middle of the first window, by default True

Returns

AudioSignal unfolded with shape (nb * nch * num_windows, 1, window_length)

Return type

AudioSignal

corrupt_phase(scale: Union[Tensor, ndarray, float])[source]

Corrupts the phase randomly by some scaled value.

Parameters: scale (Union[torch.Tensor, np.ndarray, float]) – Standard deviation of noise to add to the phase.
Returns: Signal with stft_data manipulated. Apply .istft() to get the masked audio data.
Return type: AudioSignal

high_pass(cutoffs: Union[Tensor, ndarray, float], zeros: int = 51)[source]

High-passes the signal in-place. Each item in the batch can have a different high-pass cutoff, if the input to this signal is an array or tensor. If a float, all items are given the same high-pass filter.

Parameters

cutoffs (Union[torch.Tensor, np.ndarray, float]) – Cutoff in Hz of high-pass filter.
zeros (int, optional) – Number of taps to use in high-pass filter, by default 51

Returns

High-passed AudioSignal.

Return type

AudioSignal

low_pass(cutoffs: Union[Tensor, ndarray, float], zeros: int = 51)[source]

Low-passes the signal in-place. Each item in the batch can have a different low-pass cutoff, if the input to this signal is an array or tensor. If a float, all items are given the same low-pass filter.

Parameters

cutoffs (Union[torch.Tensor, np.ndarray, float]) – Cutoff in Hz of low-pass filter.
zeros (int, optional) – Number of taps to use in low-pass filter, by default 51

Returns

Low-passed AudioSignal.

Return type

AudioSignal

mask_frequencies(fmin_hz: Union[Tensor, ndarray, float], fmax_hz: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Masks frequencies between fmin_hz and fmax_hz, and fills them with the value specified by val. Useful for implementing SpecAug. The min and max can be different for every item in the batch.

Parameters

fmin_hz (Union[torch.Tensor, np.ndarray, float]) – Lower end of band to mask out.
fmax_hz (Union[torch.Tensor, np.ndarray, float]) – Upper end of band to mask out.
val (float, optional) – Value to fill in, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

mask_low_magnitudes(db_cutoff: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Mask away magnitudes below a specified threshold, which can be different for every item in the batch.

Parameters

db_cutoff (Union[torch.Tensor, np.ndarray, float]) – Decibel value for which things below it will be masked away.
val (float, optional) – Value to fill in for masked portions, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

mask_timesteps(tmin_s: Union[Tensor, ndarray, float], tmax_s: Union[Tensor, ndarray, float], val: float = 0.0)[source]

Masks timesteps between tmin_s and tmax_s, and fills them with the value specified by val. Useful for implementing SpecAug. The min and max can be different for every item in the batch.

Parameters

tmin_s (Union[torch.Tensor, np.ndarray, float]) – Lower end of timesteps to mask out.
tmax_s (Union[torch.Tensor, np.ndarray, float]) – Upper end of timesteps to mask out.
val (float, optional) – Value to fill in, by default 0.0

Returns

Signal with stft_data manipulated. Apply .istft() to get the masked audio data.

Return type

AudioSignal

overlap_and_add(hop_duration: float)[source]

Function which takes a list of windows and overlap adds them into a signal the same length as audio_signal.

Parameters: hop_duration (float) – How much to shift for each window (overlap is window_duration - hop_duration) in seconds.
Returns: overlap-and-added signal.
Return type: AudioSignal

preemphasis(coef: float = 0.85)[source]

Applies pre-emphasis to audio signal.

Parameters: coef (float, optional) – How much pre-emphasis to apply, lower values do less. 0 does nothing. by default 0.85
Returns: Pre-emphasized signal.
Return type: AudioSignal

shift_phase(shift: Union[Tensor, ndarray, float])[source]

Shifts the phase by a constant value.

Parameters: shift (Union[torch.Tensor, np.ndarray, float]) – What to shift the phase by.
Returns: Signal with stft_data manipulated. Apply .istft() to get the masked audio data.
Return type: AudioSignal

windows(window_duration: float, hop_duration: float, preprocess: bool = True)[source]

Generator which yields windows of specified duration from signal with a specified hop length.

Parameters

window_duration (float) – Duration of every window in seconds.
hop_duration (float) – Hop between windows in seconds.
preprocess (bool, optional) – Whether to preprocess the signal, so that the first sample is in the middle of the first window, by default True

Yields

AudioSignal – Each window is returned as an AudioSignal.

Audio effects

class audiotools.core.effects.EffectMixin[source]

Bases: object

CODEC_PRESETS = {'8-bit': {'bits_per_sample': 8, 'encoding': 'ULAW', 'format': 'wav'}, 'Amr-nb': {'format': 'amr-nb'}, 'GSM-FR': {'format': 'gsm'}, 'MP3': {'compression': -9, 'format': 'mp3'}, 'Ogg': {'compression': -1, 'format': 'ogg'}, 'Vorbis': {'compression': -1, 'format': 'vorbis'}}: Presets for applying codecs via torchaudio.

GAIN_FACTOR = 0.11512925464970229: Gain factor for converting between amplitude and decibels.

apply_codec(preset: Optional[str] = None, format: str = 'wav', encoding: Optional[str] = None, bits_per_sample: Optional[int] = None, compression: Optional[int] = None)[source]

Applies an audio codec to the signal.

Parameters

preset (str, optional) – One of the keys in self.CODEC_PRESETS, by default None
format (str, optional) – Format for audio codec, by default “wav”
encoding (str, optional) – Encoding to use, by default None
bits_per_sample (int, optional) – How many bits per sample, by default None
compression (int, optional) – Compression amount of codec, by default None

Returns

AudioSignal with codec applied.

Return type

AudioSignal

Raises

ValueError – If preset is not in self.CODEC_PRESETS, an error is thrown.

apply_ir(ir, drr: Optional[Union[Tensor, ndarray, float]] = None, ir_eq: Optional[Union[Tensor, ndarray]] = None, use_original_phase: bool = False)[source]

Applies an impulse response to the signal. If ` is`ir_eq`` is specified, the impulse response is equalized before it is applied, using the given curve.

Parameters

ir (AudioSignal) – Impulse response to convolve with.
drr (Union[torch.Tensor, np.ndarray, float], optional) – Direct-to-reverberant ratio that impulse response will be altered to, if specified, by default None
ir_eq (Union[torch.Tensor, np.ndarray], optional) – Equalization that will be applied to impulse response if specified, by default None
use_original_phase (bool, optional) – Whether to use the original phase, instead of the convolved phase, by default False

Returns

Signal with impulse response applied to it

Return type

AudioSignal

clip_distortion(clip_percentile: Union[Tensor, ndarray, float])[source]

Clips the signal at a given percentile. The higher it is, the lower the threshold for clipping.

Parameters: clip_percentile (Union[torch.Tensor, np.ndarray, float]) – Values are between 0.0 to 1.0. Typical values are 0.1 or below.
Returns: Audio signal with clipped audio data.
Return type: AudioSignal

convolve(other, start_at_max: bool = True)[source]

Convolves self with other. This function uses FFTs to do the convolution.

Parameters

other (AudioSignal) – Signal to convolve with.
start_at_max (bool, optional) – Whether to start at the max value of other signal, to avoid inducing delays, by default True

Returns

Convolved signal, in-place.

Return type

AudioSignal

ensure_max_of_audio(max: float = 1.0)[source]

Ensures that abs(audio_data) <= max.

Parameters: max (float, optional) – Max absolute value of signal, by default 1.0
Returns: Signal with values scaled between -max and max.
Return type: AudioSignal

equalizer(db: Union[Tensor, ndarray])[source]

Applies a mel-spaced equalizer to the audio signal.

Parameters: db (Union[torch.Tensor, np.ndarray]) – EQ curve to apply.
Returns: AudioSignal with equalization applied.
Return type: AudioSignal

mel_filterbank(n_bands: int)[source]

Breaks signal into mel bands.

Parameters: n_bands (int) – Number of mel bands to use.
Returns: Mel-filtered bands, with last axis being the band index.
Return type: torch.Tensor

mix(other, snr: Union[Tensor, ndarray, float] = 10, other_eq: Optional[Union[Tensor, ndarray]] = None)[source]

Mixes noise with signal at specified signal-to-noise ratio. Optionally, the other signal can be equalized in-place.

Parameters

other (AudioSignal) – AudioSignal object to mix with.
snr (Union[torch.Tensor, np.ndarray, float], optional) – Signal to noise ratio, by default 10
other_eq (Union[torch.Tensor, np.ndarray], optional) – EQ curve to apply to other signal, if any, by default None

Returns

In-place modification of AudioSignal.

Return type

AudioSignal

mulaw_quantization(quantization_channels: Union[Tensor, ndarray, int])[source]

Applies mu-law quantization to the input waveform.

Parameters: quantization_channels (Union[torch.Tensor, np.ndarray, int]) – Number of mu-law spaced quantization channels to quantize to.
Returns: Quantized AudioSignal.
Return type: AudioSignal

normalize(db: Union[Tensor, ndarray, float] = -24.0)[source]

Normalizes the signal’s volume to the specified db, in LUFS. This is GPU-compatible, making for very fast loudness normalization.

Parameters: db (Union[torch.Tensor, np.ndarray, float], optional) – Loudness to normalize to, by default -24.0
Returns: Normalized audio signal.
Return type: AudioSignal

pitch_shift(n_semitones: int, quick: bool = True)[source]

Pitch shift the signal. All items in the batch get the same pitch shift.

Parameters

n_semitones (int) – How many semitones to shift the signal by.
quick (bool, optional) – Using quick pitch shifting, by default True

Returns

Pitch shifted audio signal.

Return type

AudioSignal

quantization(quantization_channels: Union[Tensor, ndarray, int])[source]

Applies quantization to the input waveform.

Parameters: quantization_channels (Union[torch.Tensor, np.ndarray, int]) – Number of evenly spaced quantization channels to quantize to.
Returns: Quantized AudioSignal.
Return type: AudioSignal

time_stretch(factor: float, quick: bool = True)[source]

Time stretch the audio signal.

Parameters

factor (float) – Factor by which to stretch the AudioSignal. Typically between 0.8 and 1.2.
quick (bool, optional) – Whether to use quick time stretching, by default True

Returns

Time-stretched AudioSignal.

Return type

AudioSignal

volume_change(db: Union[Tensor, ndarray, float])[source]

Change volume of signal by some amount, in dB.

Parameters: db (Union[torch.Tensor, np.ndarray, float]) – Amount to change volume by.
Returns: Signal at new volume.
Return type: AudioSignal

class audiotools.core.effects.ImpulseResponseMixin[source]

Bases: object

These functions are generally only used with AudioSignals that are derived from impulse responses, not other sources like music or speech. These methods are used to replicate the data augmentation described in [1].

Bryan, Nicholas J. “Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation.” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.

alter_drr(drr: Union[Tensor, ndarray, float])[source]

Alters the direct-to-reverberant ratio of the impulse response.

Parameters: drr (Union[torch.Tensor, np.ndarray, float]) – Direct-to-reverberant ratio that impulse response will be altered to, if specified, by default None
Returns: Altered impulse response.
Return type: AudioSignal

decompose_ir()[source]: Decomposes an impulse response into early and late field responses.

measure_drr()[source]

Measures the direct-to-reverberant ratio of the impulse response.

Returns: Direct-to-reverberant ratio
Return type: float

static solve_alpha(early_response, late_field, wd, target_drr)[source]: Used to solve for the alpha value, which is used to alter the drr.

FFMPEG routines

class audiotools.core.ffmpeg.FFMPEGMixin[source]

Bases: object

ffmpeg_loudness(quiet: bool = True)[source]

Computes loudness of audio file using FFMPEG.

Parameters: quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True
Returns: Loudness of every item in the batch, computed via FFMPEG.
Return type: torch.Tensor

ffmpeg_resample(sample_rate: int, quiet: bool = True)[source]

Resamples AudioSignal using FFMPEG. More memory-efficient than using julius.resample for long audio files.

Parameters

sample_rate (int) – Sample rate to resample to.
quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True

Returns

Resampled AudioSignal.

Return type

AudioSignal

classmethod load_from_file_with_ffmpeg(audio_path: str, quiet: bool = True, **kwargs)[source]

Loads AudioSignal object after decoding it to a wav file using FFMPEG. Useful for loading audio that isn’t covered by librosa’s loading mechanism. Also useful for loading mp3 files, without any offset.

Parameters

audio_path (str) – Path to load AudioSignal from.
quiet (bool, optional) – Whether to show FFMPEG output during computation, by default True

Returns

AudioSignal loaded from file with FFMPEG.

Return type

AudioSignal

audiotools.core.ffmpeg.ffprobe_offset(path)[source]

audiotools.core.ffmpeg.r128stats(filepath: str, quiet: bool)[source]

Takes a path to an audio file, returns a dict with the loudness stats computed by the ffmpeg ebur128 filter.

Parameters

filepath (str) – Path to compute loudness stats on.
quiet (bool) – Whether to show FFMPEG output during computation.

Returns

Dictionary containing loudness stats.

Return type

dict

Perceptual loudness

class audiotools.core.loudness.LoudnessMixin[source]

Bases: object

MIN_LOUDNESS = -70: Minimum loudness possible.

loudness(filter_class: str = 'K-weighting', block_size: float = 0.4, **kwargs)[source]

Calculates loudness using an implementation of ITU-R BS.1770-4. Allows control over gating block size and frequency weighting filters for additional control. Measure the integrated gated loudness of a signal.

API is derived from PyLoudnorm, but this implementation is ported to PyTorch and is tensorized across batches. When on GPU, an FIR approximation of the IIR filters is used to compute loudness for speed.

Uses the weighting filters and block size defined by the meter the integrated loudness is measured based upon the gating algorithm defined in the ITU-R BS.1770-4 specification.

Parameters

filter_class (str, optional) – Class of weighting filter used. K-weighting’ (default), ‘Fenton/Lee 1’ ‘Fenton/Lee 2’, ‘Dash et al.’ by default “K-weighting”
block_size (float, optional) – Gating block size in seconds, by default 0.400
kwargs (dict, optional) – Keyword arguments to audiotools.core.loudness.Meter().

Returns

Loudness of audio data.

Return type

torch.Tensor

class audiotools.core.loudness.Meter(rate: int, filter_class: str = 'K-weighting', block_size: float = 0.4, zeros: int = 512, use_fir: bool = False)[source]

Bases: Module

Tensorized version of pyloudnorm.Meter. Works with batched audio tensors.

Parameters

rate (int) – Sample rate of audio.
filter_class (str, optional) – Class of weighting filter used. K-weighting’ (default), ‘Fenton/Lee 1’ ‘Fenton/Lee 2’, ‘Dash et al.’ by default “K-weighting”
block_size (float, optional) – Gating block size in seconds, by default 0.400
zeros (int, optional) – Number of zeros to use in FIR approximation of IIR filters, by default 512
use_fir (bool, optional) – Whether to use FIR approximation or exact IIR formulation. If computing on GPU, use_fir=True will be used, as its much faster, by default False

apply_filter(data: Tensor)[source]

Applies filter on either CPU or GPU, depending on if the audio is on GPU or is on CPU, or if self.use_fir is True.

Parameters: data (torch.Tensor) – Audio data of shape (nb, nch, nt).
Returns: Filtered audio data.
Return type: torch.Tensor

apply_filter_cpu(data: Tensor)[source]

Performs IIR formulation of loudness computation.

Parameters: data (torch.Tensor) – Audio data of shape (nb, nch, nt).
Returns: Filtered audio data.
Return type: torch.Tensor

apply_filter_gpu(data: Tensor)[source]

Performs FIR approximation of loudness computation.

Parameters: data (torch.Tensor) – Audio data of shape (nb, nch, nt).
Returns: Filtered audio data.
Return type: torch.Tensor

property filter_class

forward(data: Tensor)[source]

Computes integrated loudness of data.

Parameters: data (torch.Tensor) – Audio data of shape (nb, nch, nt).
Returns: Filtered audio data.
Return type: torch.Tensor

integrated_loudness(data: Tensor)[source]

Computes integrated loudness of data.

Parameters: data (torch.Tensor) – Audio data of shape (nb, nch, nt).
Returns: Filtered audio data.
Return type: torch.Tensor

training: bool

Listening to AudioSignals

These are utilities that allow one to embed an AudioSignal as a playable object in a Jupyter notebook, or to play audio from the terminal, etc.

class audiotools.core.playback.PlayMixin[source]

Bases: object

embed(ext: Optional[str] = None, display: bool = True, return_html: bool = False)[source]

Embeds audio as a playable audio embed in a notebook, or HTML document, etc.

Parameters

ext (str, optional) – Extension to use when saving the audio, by default “.wav”
display (bool, optional) – This controls whether or not to display the audio when called. This is used when the embed is the last line in a Jupyter cell, to prevent the audio from being embedded twice, by default True
return_html (bool, optional) – Whether to return the data wrapped in an HTML audio element, by default False

Returns

Either the element for display, or the HTML string of it.

Return type

str

play()[source]: Plays an audio signal if ffplay from the ffmpeg suite of tools is installed. Otherwise, will fail. The audio signal is written to a temporary file and then played with ffplay.

widget(title: Optional[str] = None, ext: str = '.wav', add_headers: bool = True, player_width: str = '100%', margin: str = '10px', plot_fn: str = 'specshow', return_html: bool = False, **kwargs)[source]

Creates a playable widget with spectrogram. Inspired (heavily) by https://sjvasquez.github.io/blog/melnet/.

Parameters

title (str, optional) – Title of plot, placed in upper right of top-most axis.
ext (str, optional) – Extension for embedding, by default “.mp3”
add_headers (bool, optional) – Whether or not to add headers (use for first embed, False for later embeds), by default True
player_width (str, optional) – Width of the player, as a string in a CSS rule, by default “100%”
margin (str, optional) – Margin on all sides of player, by default “10px”
plot_fn (function, optional) – Plotting function to use (by default self.specshow).
return_html (bool, optional) – Whether to return the data wrapped in an HTML audio element, by default False
kwargs (dict, optional) – Keyword arguments to plot_fn (by default self.specshow).

Returns

HTML object.

Return type

HTML

Utilities

class audiotools.core.util.Info(sample_rate: float, num_frames: int)[source]

Bases: object

Shim for torchaudio.info API changes.

property duration: float

num_frames: int

sample_rate: float

audiotools.core.util.chdir(newdir: Union[Path, str])[source]

Context manager for switching directories to run a function. Useful for when you want to use relative paths to different runs.

Parameters: newdir (Union[Path, str]) – Directory to switch to.

audiotools.core.util.choose_from_list_of_lists(state: RandomState, list_of_lists: list, p: Optional[float] = None)[source]

Choose a single item from a list of lists.

Parameters

state (np.random.RandomState) – Random state to use when choosing an item.
list_of_lists (list) – A list of lists from which items will be drawn.
p (float, optional) – Probabilities of each list, by default None

Returns

An item from the list of lists.

Return type

Any

audiotools.core.util.collate(list_of_dicts: list, n_splits: Optional[int] = None)[source]

Collates a list of dictionaries (e.g. as returned by a dataloader) into a dictionary with batched values. This routine uses the default torch collate function for everything except AudioSignal objects, which are handled by the audiotools.core.audio_signal.AudioSignal.batch() function.

This function takes n_splits to enable splitting a batch into multiple sub-batches for the purposes of gradient accumulation, etc.

Parameters

list_of_dicts (list) – List of dictionaries to be collated.
n_splits (int) – Number of splits to make when creating the batches (split into sub-batches). Useful for things like gradient accumulation.

Returns

Dictionary containing batched data.

Return type

dict

audiotools.core.util.ensure_tensor(x: Union[ndarray, Tensor, float, int], ndim: Optional[int] = None, batch_size: Optional[int] = None)[source]

Ensures that the input x is a tensor of specified dimensions and batch size.

Parameters

x (Union[np.ndarray, torch.Tensor, float, int]) – Data that will become a tensor on its way out.
ndim (int, optional) – How many dimensions should be in the output, by default None
batch_size (int, optional) – The batch size of the output, by default None

Returns

Modified version of x as a tensor.

Return type

torch.Tensor

audiotools.core.util.find_audio(folder: str, ext: List[str] = ['.wav', '.flac', '.mp3', '.mp4'])[source]

Finds all audio files in a directory recursively. Returns a list.

Parameters

folder (str) – Folder to look for audio files in, recursively.
ext (List[str], optional) – Extensions to look for without the ., by default ['.wav', '.flac', '.mp3', '.mp4'].

audiotools.core.util.format_figure(fig_size: Optional[tuple] = None, title: Optional[str] = None, fig=None, format_axes: bool = True, format: bool = True, font_color: str = 'white')[source]

Prettifies the spectrogram and waveform plots. A title can be inset into the top right corner, and the axes can be inset into the figure, allowing the data to take up the entire image. Used in

audiotools.core.display.DisplayMixin.specshow()
audiotools.core.display.DisplayMixin.waveplot()
audiotools.core.display.DisplayMixin.wavespec()

Parameters

fig_size (tuple, optional) – Size of figure, by default (9, 3)
title (str, optional) – Title to inset in top right, by default None
fig (matplotlib.figure.Figure, optional) – Figure object, if None plt.gcf() will be used, by default None
format_axes (bool, optional) – Format the axes to be inside the figure, by default True
format (bool, optional) – This formatting can be skipped entirely by passing format=False to any of the plotting functions that use this formater, by default True
font_color (str, optional) – Color of font of axes, by default “white”

audiotools.core.util.generate_chord_dataset(max_voices: int = 8, sample_rate: int = 44100, num_items: int = 5, duration: float = 1.0, min_note: str = 'C2', max_note: str = 'C6', output_dir: Path = 'chords')[source]

Generates a toy multitrack dataset of chords, synthesized from sine waves.

Parameters

max_voices (int, optional) – Maximum number of voices in a chord, by default 8
sample_rate (int, optional) – Sample rate of audio, by default 44100
num_items (int, optional) – Number of items to generate, by default 5
duration (float, optional) – Duration of each item, by default 1.0
min_note (str, optional) – Minimum note in the dataset, by default “C2”
max_note (str, optional) – Maximum note in the dataset, by default “C6”
output_dir (Path, optional) – Directory to save the dataset, by default “chords”

audiotools.core.util.hz_to_bin(hz: Tensor, n_fft: int, sample_rate: int)[source]

Closest frequency bin given a frequency, number of bins, and a sampling rate.

Parameters

hz (torch.Tensor) – Tensor of frequencies in Hz.
n_fft (int) – Number of FFT bins.
sample_rate (int) – Sample rate of audio.

Returns

Closest bins to the data.

Return type

torch.Tensor

audiotools.core.util.info(audio_path: str)[source]

Shim for torchaudio.info to make 0.7.2 API match 0.8.0.

Parameters: audio_path (str) – Path to audio file.

audiotools.core.util.prepare_batch(batch: Union[dict, list, Tensor], device: str = 'cpu')[source]

Moves items in a batch (typically generated by a DataLoader as a list or a dict) to the specified device. This works even if dictionaries are nested.

Parameters

batch (Union[dict, list, torch.Tensor]) – Batch, typically generated by a dataloader, that will be moved to the device.
device (str, optional) – Device to move batch to, by default “cpu”

Returns

Batch with all values moved to the specified device.

Return type

Union[dict, list, torch.Tensor]

audiotools.core.util.random_state(seed: Union[int, RandomState])[source]

Turn seed into a np.random.RandomState instance.

Parameters: seed (Union[int, np.random.RandomState] or None) – If seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.
Returns: Random state object.
Return type: np.random.RandomState
Raises: ValueError – If seed is not valid, an error is thrown.

audiotools.core.util.read_sources(sources: List[str], remove_empty: bool = True, relative_path: str = '', ext: List[str] = ['.wav', '.flac', '.mp3', '.mp4'])[source]

Reads audio sources that can either be folders full of audio files, or CSV files that contain paths to audio files. CSV files that adhere to the expected format can be generated by audiotools.data.preprocess.create_csv().

Parameters

sources (List[str]) – List of audio sources to be converted into a list of lists of audio files.
remove_empty (bool, optional) – Whether or not to remove rows with an empty “path” from each CSV file, by default True.

Returns

List of lists of rows of CSV files.

Return type

list

audiotools.core.util.sample_from_dist(dist_tuple: tuple, state: Optional[RandomState] = None)[source]

Samples from a distribution defined by a tuple. The first item in the tuple is the distribution type, and the rest of the items are arguments to that distribution. The distribution function is gotten from the np.random.RandomState object.

Parameters

dist_tuple (tuple) – Distribution tuple
state (np.random.RandomState, optional) – Random state, or seed to use, by default None

Returns

Draw from the distribution.

Return type

Union[float, int, str]

Examples

Sample from a uniform distribution:

>>> dist_tuple = ("uniform", 0, 1)
>>> sample_from_dist(dist_tuple)

Sample from a constant distribution:

>>> dist_tuple = ("const", 0)
>>> sample_from_dist(dist_tuple)

Sample from a normal distribution:

>>> dist_tuple = ("normal", 0, 0.5)
>>> sample_from_dist(dist_tuple)

audiotools.core.util.seed(random_seed, set_cudnn=False)[source]

Seeds all random states with the same random seed for reproducibility. Seeds numpy, random and torch random generators. For full reproducibility, two further options must be set according to the torch documentation: https://pytorch.org/docs/stable/notes/randomness.html To do this, set_cudnn must be True. It defaults to False, since setting it to True results in a performance hit.

Parameters

random_seed (int) – integer corresponding to random seed to
use. –
set_cudnn (bool) – Whether or not to set cudnn into determinstic
False. (mode and off of benchmark mode. Defaults to) –