In 2018, I taught machines to break glass. For the task I used SampleRNN, which trains a neural network on sound, and uses it to generate audio. While I trained on a diverse set of datasets (see sonic gnarl), this page focuses on glass as a way to compare how SampleRNN unravels a single sound - when it works well, when it works poorly.
The first success I had with SampleRNN was rather counter-intuitive. I had just figured out how to get it running, and Holly, Mat and I had just done some quick voice recordings in her kitchen, with the intent of making a dataset. The drinking glass in my hand made a nice bell-like tone, when I tapped it, so we recorded this as well.
Later I would set up training jobs on these recordings. At the rate we could train using their desktop computer, each input recording would take six hours to process, resulting in a trained network of debatable quality and a couple minutes of sample audio. With such a slow, iterative process, I had a lot of time to think about what I wanted to train next. Results from the simple vocal recordings were mixed, seeming to get stuck in simple patterns, as I was to learn. Expecting more variety, I tried a mix of inputs layering the voices, and on a whim made a layered recording of the glass where I pitched it up and down tonally, and occasionally reversed it.
SampleRNN could simulate the complex glass recording astoundingly well, somewhat counterintuitively it seemed at the time. I spent the next few months training networks on a variety of complex sounds - recordings containing a mixture of tones, transients, and rhythms. The finest results can be heard from the Auctioneer and Siren. But the glass always surprised me the most - the way the tones seem to bend into one another.
Glass Bells (source) Glass Bells (neural)
Meanwhile, Hito told me she'd found a company that was training AI to recognize the sound of breaking windows, and she wanted to try to make these sounds make them audible.
When I saw the video she'd recorded, the company's methodology was on display: they were smashing windows in a disused airplane hangar in order to train a home security system. The breaking glass echoed in the cavernous space, smeared into a dramatic reverberation. Acoustically, it was a mess, but the image suggested we could make a neural sound from breaking glass that illustrated what training a neural network would do to the data. And perhaps, as Hito insisted, it could also be musical.
This section describes the result of my breaking glass tests with SampleRNN. Indeed I got good results surprisingly quickly. Rather than break any windows myself, I relied on recordings of breaking glass from a sound effects library (now lost, alas). For a dataset, I had just 2-3 minutes of breaking glass sounds strung together. I would do a training pass and listen to the results, then I would edit the input to remove parts that sounded related to things I didn't like (usually parts where the sound was overly loud and noisy), and then retrained.
These samples are the same network, with different amounts of training. Be warned, the under-trained results can be noisy.
glass, best fit glass, good fit glass, undertrained
I did more tests on network hyperparameters that brough out other qualities of the glass, though they were noisier. My research notes contain some fanciful descriptions of what these sounded like:
whistling glass fuzzy glass punch tumbling glass
Finally, these are the first training results, which eventually led to the better results above. Still, you can hear the glass in there - either the rhythm of it breaking, or its tonal quality.
first test with foley foley, single channel from glass I recorded
The first test uses two channels from a stereo recording (in series), but the second uses just one channel. At the time I noted that "noise decreases the further it trains.. but there is more silence and the breaking is further apart." I trimmed noisier parts to get the good results above. The final example used my own recording and led to the musical glass below.
Finally I moved on to musical glass. I used the same technique as the drinking glass - taking a recording, pitching it up and down, and layering it:
Glass Filing Cabinet (source) Glass Filing Cabinet (neural)
For this test, I made a recording from Hito's office at UdK. I had thought that we'd need to actually break glass for this project, so she had bought a half dozen wine glasses from a Getränkemarkt. When the time came, I was alone with the glasses with a hammer in my hand, staring at them and wondering if I really needed to destroy them, thinking of that security firm and the sledgehammer.
I broke one glass and got a poor recording of it. The other glasses I clattered around the inside of a metal filing cabinet - which explains the more clumsy sounds on this recording.
glass cabinet, very underfit glass cabinet, underfit glass cabinet, overfit glass cabinet, very overfit glass cabinet, very overfit
Finally, once I had a network that performed well, I generated a rhythmic sound by restarting the network every quarter second or so, producing a string of samples that gave Hito her "musical neural glass":
To hedge against failure, I made "traditional" implementation of glass breaking using granular synthesis. The first demo uses glass hits generated by the SampleRNN, and glass detritus / grains I recorded, following a similar project to do infinitely-sustained glass breaking. The second demo performs spectral resynthesis.
For more varied sound sources, see the page on SampleRNN.
Back to index