teaching machines to break glass

In 2018, I taught machines to break glass. For the task I used SampleRNN, which trains a neural network on sound, and uses it to generate audio. While I trained on a diverse set of datasets (see sonic gnarl), this page focuses on glass as a way to compare how SampleRNN unravels a single sound - when it works well, when it works poorly.

the drinking glass

The first success I had with SampleRNN was rather counter-intuitive. I had just figured out how to get it running, and Holly, Mat and I had just done some quick voice recordings in her kitchen, with the intent of making a dataset. The drinking glass in my hand made a nice bell-like tone, when I tapped it, so we recorded this as well.

Later I would set up training jobs on these recordings. At the rate we could train using their desktop computer, each input recording would take six hours to process, resulting in a trained network of debatable quality and a couple minutes of sample audio. With such a slow, iterative process, I had a lot of time to think about what I wanted to train next. Results from the simple vocal recordings were mixed, seeming to get stuck in simple patterns, as I was to learn. Expecting more variety, I tried a mix of inputs layering the voices, and on a whim made a layered recording of the glass where I pitched it up and down tonally, and occasionally reversed it.

SampleRNN could simulate the complex glass recording astoundingly well, somewhat counterintuitively it seemed at the time. I spent the next few months training networks on a variety of complex sounds - recordings containing a mixture of tones, transients, and rhythms. The finest results can be heard from the Auctioneer and Siren. But the glass always surprised me the most - the way the tones seem to bend into one another.

Glass Bells (source) Glass Bells (neural)

Meanwhile, Hito told me she'd found a company that was training AI to recognize the sound of breaking windows, and she wanted to try to make these sounds make them audible.

When I saw the video she'd recorded, the company's methodology was on display: they were smashing windows in a disused airplane hangar in order to train a home security system. The breaking glass echoed in the cavernous space, smeared into a dramatic reverberation. Acoustically, it was a mess, but the image suggested we could make a neural sound from breaking glass that illustrated what training a neural network would do to the data. And perhaps, as Hito insisted, it could also be musical.

neural glass

This section describes the result of my breaking glass tests with SampleRNN. Indeed I got good results surprisingly quickly. Rather than break any windows myself, I relied on recordings of breaking glass from a sound effects library (now lost, alas). For a dataset, I had just 2-3 minutes of breaking glass sounds strung together. I would do a training pass and listen to the results, then I would edit the input to remove parts that sounded related to things I didn't like (usually parts where the sound was overly loud and noisy), and then retrained.

These samples are the same network, with different amounts of training. Be warned, the under-trained results can be noisy.

glass, best fit glass, good fit glass, undertrained

I did more tests on network hyperparameters that brough out other qualities of the glass, though they were noisier. My research notes contain some fanciful descriptions of what these sounded like:

whistling glass fuzzy glass punch tumbling glass

Finally, these are the first training results, which eventually led to the better results above. Still, you can hear the glass in there - either the rhythm of it breaking, or its tonal quality.

first test with foley foley, single channel from glass I recorded

The first test uses two channels from a stereo recording (in series), but the second uses just one channel. At the time I noted that "noise decreases the further it trains.. but there is more silence and the breaking is further apart." I trimmed noisier parts to get the good results above. The final example used my own recording and led to the musical glass below.

musical glass

Finally I moved on to musical glass. I used the same technique as the drinking glass - taking a recording, pitching it up and down, and layering it:

Glass Filing Cabinet (source) Glass Filing Cabinet (neural)

For this test, I made a recording from Hito's office at UdK. I had thought that we'd need to actually break glass for this project, so she had bought a half dozen wine glasses from a Getränkemarkt. When the time came, I was alone with the glasses with a hammer in my hand, staring at them and wondering if I really needed to destroy them, thinking of that security firm and the sledgehammer.

I broke one glass and got a poor recording of it. The other glasses I clattered around the inside of a metal filing cabinet - which explains the more clumsy sounds on this recording.

glass cabinet, very underfit glass cabinet, underfit glass cabinet, overfit glass cabinet, very overfit glass cabinet, very overfit

Finally, once I had a network that performed well, I generated a rhythmic sound by restarting the network every quarter second or so, producing a string of samples that gave Hito her "musical neural glass":

glass cabinet (chopped)

the old-fashioned way

To hedge against failure, I made "traditional" implementation of glass breaking using granular synthesis. The first demo uses glass hits generated by the SampleRNN, and glass detritus / grains I recorded, following a similar project to do infinitely-sustained glass breaking. The second demo performs spectral resynthesis.

more neural recordings

For more varied sound sources, see the page on SampleRNN.

Back to index