The tune Godmother uses stems from Jlin and Holly's track "Expand" processed by a voice conversion model, trained to turn Mat's voice into Holly's voice. The name "Godmother" appeared because I was referred to (rather androgynously) as Spawn's Göttin or god-parent in a Swiss article about Spawn. The song has a music video made from composite portraits of Jlin and Holly; photos of the ensemble were used to produce Holly's hybrid album cover.
godmother (album mix)Godmother is essentially a Vocaloid track created from unusual input, with a "neural" component that adds a sense of hyper-realism. Most of the texture of Godmother is a result of the vocoder analyzing drum sounds and resynthesizing them, with style transfer in between that makes the voices sound more like Holly. Jlin's drum stems were written in stereo, so I processed the channels individually, resulting in a lovely stereo field as these insect-like beatboxing voices fly in circles around your head.
Spawn was never a monolithic system. "Spawn" was a fictional character that is part of the album's narrative: a way to talk metaphorically about machine learning through a parent-child relationship. In retrospect, perhaps Holly gave too much credit to this character for taking an active role in creating the album. As any programmer can attest: what is a piece of software, but a miserable little pile of shell scripts?
In reality, while our goal was to make something with machine learning (and actually do it not just fake it), I also wanted it to sound like it was "made by AI" (whatever that meant) when people eventually heard the recording. The resulting audio also needed to be roughly CD quality, so that it would be usable on the record. Most examples of audio machine learning were very lo-fi, and not in a good way. Since I was hardly up to the task of designing a neural network architecture from scratch, I delved into the open source machine learning software that was available at the time.
Finding machine learning software that worked with audio was difficult. My experiments using SampleRNN to process voices were mostly a dead end. (Our first test can be heard babbling on the track Birth.) Individual SampleRNN networks typically only produce one kind of sound, and were not really capable of controllable vocal synthesis.
Since it was 2018 and pix2pix was then popular for image style transfer, I eventually tracked down voice conversion software called become-yukarin that built on pix2pix. The developer, K. Hiroshiba, had taken Vocaloid stems and made parallel recordings of their own voice. They designed and trained a model that could transform their voice into Yuzuki Yukari, one of many Vocaloid voices, itself based on the voice of Chihiro Ishiguro.
Ironically, one of the things that had inspired Holly initially was the paper, A Neural Parametric Singing Synthesizer (Blauuw and Bonada, 2017, link), by researchers affiliated with the Universidad Pompeu Fabra in Barcelona. I learned much later that this was where the original Vocaloid algorithm was developed, and that what the paper described was basically Vocaloid too. Eventually I found become-yukarin documented in the paper, Two-Stage Sequence-to-Sequence Neural Voice Conversion with Low-to-High Definition Spectrogram Mapping (Hiroshiba et al. 2018, link).
The neural vocoder works similarly to the official Vocaloid software from Yamaha. It uses the open-source WORLD vocoder to divide a sound into harmonic and inharmonic spectra (analogous to buzz and hiss in an analog vocoder), as well as detecting a root pitch. It then trains a pix2pix style transfer network to turn one set of spectra into another, and then resynthesizes vocal audio with WORLD again. The final step trains a super-resolution network to improve the quality of the converted spectra.
Once we had some software to try, I instructed Holly and Mat to make a speech training dataset with broad phonetic coverage for English, as well as a singing dataset. We trained a model on the dataset with become-yukarin, then used the model to process isolated drum stems from Jlin's tune Expand. I experimented with sample rates, test data, and frequency transfer – the latter being the most useful, as you could pitch a recording into a higher register. You can hear our tests below: