--- file_format: mystnb kernelspec: name: python3 --- # Introduction Object-oriented handling of audio signals, with fast augmentation routines, batching, padding, and more. ```{code-cell} ipython3 import torch import audiotools from audiotools import AudioSignal from audiotools import post import rich import matplotlib.pyplot as plt import markdown2 as md from IPython.display import HTML audiotools.core.playback.DEFAULT_EXTENSION = ".mp3" state = audiotools.util.random_state(0) spk = AudioSignal("../tests/audio/spk/f10_script4_produced.wav", offset=5, duration=5) ir = AudioSignal("../tests/audio/ir/h179_Bar_1txts.wav") nz = AudioSignal("../tests/audio/nz/f5_script2_ipad_balcony1_room_tone.wav") ``` ## Playback and visualization Let's first listen to the clean file, and visualize it: ```{code-cell} ipython3 spk.specshow() plt.show() spk.embed(display=False) ``` We can also combine the above into a single widget, like so: ```{code-cell} ipython3 spk.widget() ``` ## Mixing signals Let's mix the speaker with noise at varying SNRs. We'll make a deep copy before each mix, to preserve the original signal in `spk`, as the `mix` function is applied in-place. ```{code-cell} ipython3 outputs = {} for snr in [0, 10, 20]: output = spk.clone().mix(nz, snr=snr) outputs[f"snr={snr}"] = output post.disp(outputs) ``` ## Batching signals We can collate a batch together at random offsets from one file, with the same duration: ```{code-cell} ipython3 batch_size = 16 spk_batch = AudioSignal.batch([ AudioSignal.excerpt('../tests/audio/spk/f10_script4_produced.wav', duration=2, state=state) for _ in range(batch_size) ]) HTML(md.markdown(spk_batch.markdown(), extras=["tables"])) ``` We can listen to different items in the batch: ```{code-cell} ipython3 outputs = {} for idx in [0, 2, 5]: output = spk_batch[idx] outputs[f"batch_idx={idx}"] = output post.disp(outputs) ``` We can mix each item in the batch at a different SNR: ```{code-cell} ipython3 tgt_snr = torch.linspace(-10, 10, batch_size) spk_plus_nz_batch = spk_batch.clone().mix(nz, snr=tgt_snr) ``` Let's listen to the first and last item in the output: ```{code-cell} ipython3 outputs = {} for idx in [0, -1]: output = spk_plus_nz_batch[idx] outputs[f"batch_idx={idx}"] = output post.disp(outputs) ``` The first item was mixed at -10 dB SNR, and the last at 10 dB SNR. ## Perceptual loudness In Descript, we auto-level to -24dB. Now, we can do the same thing for a batch of audio signals by using an implementation of the same LUFS algorithm used in FFMPEG. This implementation is fully differentiable, and so can be computed on the GPU. Let's see the loudness of each item in our batch. ```{code-cell} ipython3 print(spk_batch.loudness()) ``` Now, let's auto-level each item in the batch to -24 dB LUFS. ```{code-cell} ipython3 output = spk_batch.clone().normalize(-24) print(output.loudness()) ``` Let's make sure the SNR based mixing we did before was actually correct. ```{code-cell} ipython3 print(spk_batch.loudness() - nz.loudness()) print(tgt_snr) ``` Fairly close. ## Convolution Next, let's convolve our speaker with an impulse response, to make it sound like they're in a room. ```{code-cell} ipython3 convolved = spk.clone().convolve(ir) ``` ```{code-cell} ipython3 post.disp(convolved) ``` We can convolve every item in the batch with this impulse response. ```{code-cell} ipython3 spk_batch.clone().convolve(ir) ``` Or if we have a batch of impulse responses, we can convolve a batch of speech signals with the batch of impulse responses. ```{code-cell} ipython3 ir_batch = AudioSignal.batch([ AudioSignal('../tests/audio/ir/h179_Bar_1txts.wav') for _ in range(batch_size) ]) spk_batch.clone().convolve(ir_batch) ``` There's also some syntactic sugar for applying convolution. ```{code-cell} ipython3 spk_batch.clone() @ ir_batch # Same as above. ``` ## Equalization Next, let's apply some equalization to the impulse response, to simulate different mic responses. First, we need to figure out the number of bands in the EQ. ```{code-cell} ipython3 n_bands = 6 ``` Then, let's make a random EQ curve. The curve is in dB. ```{code-cell} ipython3 curve = -2.5 + 1 * torch.rand(n_bands) ``` Now, apply it to the impulse response. ```{code-cell} ipython3 eq_ir = ir.clone().equalizer(curve) ``` Then convolve with the signal. ```{code-cell} ipython3 output = spk.clone().convolve(eq_ir) ``` ```{code-cell} ipython3 post.disp(output) ``` ## Pitch shifting and time stretching Pitch shifting and time stretching can be applied to signals ```{code-cell} ipython3 outputs = { "original": spk, "pitch_shifted": spk.clone().pitch_shift(2), "time_stretched": spk.clone().time_stretch(0.8), } post.disp(outputs) ``` Like other transformations, they also get applied across an entire batch. ```{code-cell} ipython3 spk_batch.clone().pitch_shift(2) spk_batch.clone().time_stretch(0.8) ``` ## Codec transformations This one is a bit wonky, but you can take audio, and convert it into a a highly compressed format, and then get the samples back out. This creates a sort of "Zoom-y" effect. ```{code-cell} ipython3 output = spk.clone().apply_codec("Ogg") ``` ```{code-cell} ipython3 post.disp(output) ``` ## Putting it all together This is a fluent interface so things can be chained together easily. Let's augment an entire batch by chaining these effects together. We'll start from scratch, loading the batch fresh each time to avoid overwriting anything inside the augmentation pipeline. ```{code-cell} ipython3 def load_batch(batch_size, state=None): spk_batch = AudioSignal.batch([ AudioSignal.salient_excerpt('../tests/audio/spk/f10_script4_produced.wav', duration=5, state=state) for _ in range(batch_size) ]) nz_batch = AudioSignal.batch([ AudioSignal.excerpt('../tests/audio/nz/f5_script2_ipad_balcony1_room_tone.wav', duration=5, state=state) for _ in range(batch_size) ]) ir_batch = AudioSignal.batch([ AudioSignal('../tests/audio/ir/h179_Bar_1txts.wav') for _ in range(batch_size) ]) return spk_batch, nz_batch, ir_batch ``` We'll apply the following pipeline, randomly getting parameters for each effect. 1. Pitch shift 2. Time stretch 3. Equalize noise. 4. Equalize impulse response. 5. Convolve speech with impulse response. 6. Mix speech and noise at some random SNR. ```{code-cell} ipython3 batch_size = 4 # Seed is given to function for reproducibility. def augment(seed): state = audiotools.util.random_state(seed) spk_batch, nz_batch, ir_batch = load_batch(batch_size, state) n_semitones = state.uniform(-2, 2) factor = state.uniform(0.8, 1.2) snr = state.uniform(10, 40, batch_size) # Make a copy so we have it later for training targets. clean_spk = spk_batch.clone() spk_batch = ( spk_batch .pitch_shift(n_semitones) .time_stretch(factor) ) # Augment the noise signal with equalization n_bands = 6 curve = -1 + 1 * state.rand(nz_batch.batch_size, n_bands) nz_batch = nz_batch.equalizer(curve) # Augment the impulse response to simulate microphone effects. n_bands = 6 curve = -1 + 1 * state.rand(ir_batch.batch_size, n_bands) ir_batch = ir_batch.equalizer(curve) # Convolve noisy_spk = ( spk_batch .convolve(ir_batch) .mix(nz_batch, snr=snr) ) return clean_spk, noisy_spk ``` Let's augment and then listen to each item in the batch. ```{code-cell} ipython3 clean_spk, noisy_spk = augment(0) sr = clean_spk.sample_rate outputs = {} for i in range(clean_spk.batch_size): _outputs = { "clean": clean_spk[i], "noisy": noisy_spk[i], } outputs[f"{i+1}"] = _outputs post.disp(outputs) ```