godmother

The tune Godmother uses stems from Jlin and Holly's track "Expand" processed by a voice conversion model, trained to turn Mat's voice into Holly's voice. The name "Godmother" appeared because I was referred to (rather androgynously) as Spawn's Göttin or god-parent in a Swiss article about Spawn. The song has a music video made from composite portraits of Jlin and Holly; photos of the ensemble were used to produce Holly's hybrid album cover.

godmother (album mix)

Godmother is essentially a Vocaloid track created from unusual input, with a "neural" component that adds a sense of hyper-realism. Most of the texture of Godmother is a result of the vocoder analyzing drum sounds and resynthesizing them, with style transfer in between that makes the voices sound more like Holly. Jlin's drum stems were written in stereo, so I processed the channels individually, resulting in a lovely stereo field as these insect-like beatboxing voices fly in circles around your head.

Spawn was never a monolithic system. "Spawn" was a fictional character that is part of the album's narrative: a way to talk metaphorically about machine learning through the parent-child relationship. In reality, while our goal was to make something with machine learning, I also wanted it to sound like it was "made by AI" (whatever that meant). The resulting audio also needed to be roughly CD quality, so that it would be usable on the record. Most examples of audio machine learning were very lo-fi, and not in a good way. Since I was hardly up to the task of designing a neural network architecture from scratch, I delved into the open source machine learning software that was available at the time.

Finding machine learning software that worked with audio was difficult. My experiments using SampleRNN to process voices were a dead end, though one test can be heard on the track Birth. Since it was 2018 and pix2pix was then popular for image style transfer, I eventually tracked down voice conversion software called become-yukarin that built on pix2pix. The developer, K. Hiroshiba, had taken Vocaloid stems and made parallel recordings of their own voice. They designed and trained a model that could transform their voice into Yuzuki Yukari, one of many Vocaloid voices, itself based on the voice of Chihiro Ishiguro.

Ironically, one of the things that had inspired Holly initially was the paper, A Neural Parametric Singing Synthesizer (2017, link), by researchers affiliated with the Universidad Pompeu Fabra in Barcelona. I learned much later that this was where the original Vocaloid algorithm was developed, and that what the paper described was basically Vocaloid too. Eventually I found become-yukarin documented in the paper, Two-Stage Sequence-to-Sequence Neural Voice Conversion with Low-to-High Definition Spectrogram Mapping (2018, link).

The neural vocoder works similarly to the official Vocaloid software from Yamaha. It uses the open-source WORLD vocoder to divide a sound into harmonic and inharmonic spectra, as well as detecting a root pitch. It then trains a pix2pix style transfer network to turn one set of spectra into another, and then resynthesizes vocal audio with WORLD again. The final step trains a super-resolution network to improve the quality of the converted spectra.

Once we had some software to try, I instructed Holly and Mat to make a speech training dataset with broad phonetic coverage for English, as well as a singing dataset. We trained a model on the dataset with become-yukarin, then used the model to process isolated drum stems from Jlin's tune Expand. I experimented with sample rates, test data, and frequency transfer – the latter being the most useful, as you could pitch a recording into a higher register. You can hear our tests below:

speaking, 24k

copy f0 from source recording

mat talking input copy f0 copy f0, x1.2 copy f0, x2 convert f0 convert f0, x2
mat singing input copy f0 copy f0, x1.2 copy f0, x2 convert f0 convert f0, x2
lyra singing input copy f0 copy f0, x1.2 copy f0, x2 convert f0 convert f0, x2
jlin music input copy f0 copy f0, x1.2 copy f0, x2 convert f0 convert f0, x2

phrase 1: "a moth zig-zagged..."

mat mat + sr mat → holly mat → holly + sr
holly → mat holly → mat + sr holly holly + sr

phrase 2: "I assume moisture..."

mat mat + sr mat → holly mat → holly + sr
holly → mat holly → mat + sr holly holly + sr

mat tests

mat test 1 mat → holly mat → holly + sr
mat test 2 mat → holly mat → holly + sr
mat test 3 mat → holly mat → holly + sr
mat test 4 mat → holly mat → holly + sr
mat test 5 mat → holly mat → holly + sr
mat test 6 mat → holly mat → holly + sr

br song tests

br mat → holly + sr #1 + sr #2 + sr #3 + sr #4 + sr #5
br lyra → holly + sr #1 + sr #2 + sr #3 + sr #4 + sr #5

jlin tests

jlin → holly #1 + sr
jlin → holly #2 + sr
jlin → holly #3 + sr

speaking + singing network, 44k

network trained on a combination of speaking + singing datasets

br_mat 00 br_mat 01 br_mat 02 br_mat 03 br_mat 04
br_lyra 00 br_lyra 01 br_lyra 02 br_lyra 03 br_lyra 04

fling_across_the_yard holly_2018_master jlin_drums_summed
jlin_mix holly_2_master xpand_master

singing 20 singing 21 singing 22 singing 23 singing 24
singing 25 singing 26 singing 27 singing 28 singing 29

voice 20 voice 21 voice 22 voice 23 voice 24
voice 25 voice 26 voice 27 voice 28 voice 29

speaking, 44k

speaking conversion + speaking super resolution

br_mat 00 br_mat 01 br_mat 02 br_mat 03 br_mat 04
br_lyra 00 br_lyra 01 br_lyra 02 br_lyra 03 br_lyra 04
fling_across_the_yard holly_2018_master jlin_drums_summed

singing dataset thru speaking network

singing 20 singing 21 singing 22 singing 23 singing 24
singing 25 singing 26 singing 27 singing 28 singing 29

speaking network with converted f0

voice 20 voice 21 voice 22 voice 23 voice 24
voice 25 voice 26 voice 27 voice 28 voice 29

speaking dataset with input f0

voice 20 voice 21 voice 22 voice 23 voice 24
voice 25 voice 26 voice 27 voice 28 voice 29

speaking conversion + singing super-resolution

br_mat 00 br_mat 01 br_mat 02 br_mat 03 br_mat 04
br_lyra 00 br_lyra 01 br_lyra 02 br_lyra 03 br_lyra 04
fling_across_the_yard holly_2018_master jlin_drums_summed

monotone singing, 44k

singing conversion using source f0

br_mat 00 br_mat 01 br_mat 02 br_mat 03 br_mat 04
br_lyra 00 br_lyra 01 br_lyra 02 br_lyra 03 br_lyra 04
fling_across_the_yard holly_2018_master jlin_drums_summed

singing conversion using transformed (i.e. monotone) f0

br_mat 00 br_mat 01 br_mat 02 br_mat 03 br_mat 04
br_lyra 00 br_lyra 01 br_lyra 02 br_lyra 03 br_lyra 04
fling_across_the_yard holly_2018_master jlin_drums_summed

mat → holly dataset

singing 00 singing 01 singing 02 singing 03 singing 04
singing 05 singing 06 singing 07 singing 08 singing 09
singing 10 singing 12 singing 13 singing 14 singing 15
singing 16 singing 17 singing 18 singing 19 singing 20
singing 21 singing 22 singing 23 singing 24 singing 25
singing 26 singing 27 singing 28 singing 29 singing 30
singing 31 singing 32 singing 33 singing 34 singing 35
singing 36 singing 37 singing 38 singing 40 singing 41
singing 42

Back to index