godmother

The tune Godmother uses stems from Jlin and Holly's track "Expand" processed by a voice conversion model, trained to turn Mat's voice into Holly's voice. The name "Godmother" appeared because I was referred to (rather androgynously) as Spawn's Göttin or god-parent in a Swiss article about Spawn. The song has a music video made from composite portraits of Jlin and Holly; photos of the ensemble were used to produce Holly's hybrid album cover.

godmother (album mix)

Godmother is essentially a Vocaloid track created from unusual input, with a "neural" component that adds a sense of hyper-realism. Most of the texture of Godmother is a result of the vocoder analyzing drum sounds and resynthesizing them, with style transfer in between that makes the voices sound more like Holly. Jlin's drum stems were written in stereo, so I processed the channels individually, resulting in a lovely stereo field as these insect-like beatboxing voices fly in circles around your head.

Spawn was never a monolithic system. "Spawn" was a fictional character that is part of the album's narrative: a way to talk metaphorically about machine learning through the parent-child relationship. In reality, while our goal was to make something with machine learning, I also wanted it to sound like it was "made by AI" (whatever that meant). The resulting audio also needed to be roughly CD quality, so that it would be usable on the record. Most examples of audio machine learning were very lo-fi, and not in a good way. Since I was hardly up to the task of designing a neural network architecture from scratch, I delved into the open source machine learning software that was available at the time.

Finding machine learning software that worked with audio was difficult. My experiments using SampleRNN to process voices were a dead end, though one test can be heard on the track Birth. Since it was 2018 and pix2pix was then popular for image style transfer, I eventually tracked down voice conversion software called become-yukarin that built on pix2pix. The developer, K. Hiroshiba, had taken Vocaloid stems and made parallel recordings of their own voice. They designed and trained a model that could transform their voice into Yuzuki Yukari, one of many Vocaloid voices, itself based on the voice of Chihiro Ishiguro.

Ironically, one of the things that had inspired Holly initially was the paper, A Neural Parametric Singing Synthesizer (2017, link), by researchers affiliated with the Universidad Pompeu Fabra in Barcelona. I learned much later that this was where the original Vocaloid algorithm was developed, and that what the paper described was basically Vocaloid too. Eventually I found become-yukarin documented in the paper, Two-Stage Sequence-to-Sequence Neural Voice Conversion with Low-to-High Definition Spectrogram Mapping (2018, link).

The neural vocoder works similarly to the official Vocaloid software from Yamaha. It uses the open-source WORLD vocoder to divide a sound into harmonic and inharmonic spectra (analogous to buzz and hiss in an analog vocoder), as well as detecting a root pitch. It then trains a pix2pix style transfer network to turn one set of spectra into another, and then resynthesizes vocal audio with WORLD again. The final step trains a super-resolution network to improve the quality of the converted spectra.

Once we had some software to try, I instructed Holly and Mat to make a speech training dataset with broad phonetic coverage for English, as well as a singing dataset. We trained a model on the dataset with become-yukarin, then used the model to process isolated drum stems from Jlin's tune Expand. I experimented with sample rates, test data, and frequency transfer – the latter being the most useful, as you could pitch a recording into a higher register. You can hear our tests below: