Home Framework Google AI’s new audio generation framework “AudioLM” learns how to generate realistic voice and piano music by listening to audio alone

Google AI’s new audio generation framework “AudioLM” learns how to generate realistic voice and piano music by listening to audio alone


Audio signals, whether human speech, musical composition, or ambient noise, involve different levels of abstraction. Prosody, syntax, grammar, and semantics are some ways to dissect and examine speech.

The problem of generating well-organized and consistent audio sequences at all three levels has been solved by combining the audio with transcriptions that can direct the generation process, such as text transcriptions for speech synthesis or MIDI representations for the piano. However, this method fails when it comes to representing non-transcribed audio characteristics, such as the speaking qualities needed to help people with speech difficulties.

Language models have shown that they can model high-level and long-term structures for various types of content. “Textless NLP” has recently been advanced regarding the production of unconditioned speech. In particular, without textual annotations, a transformer trained on discretized speech units can generate meaningful speech. The model is only trained on proper speech and synthesis is only possible with a single speaker. Thus, the diversity and the acoustic quality are still constrained.

A new search from Google presents a new framework called AudioLM for audio production that can learn to make realistic speech and piano music just by listening to audio. AudioLM surpasses earlier systems and pushes the boundaries of audio production with applications in speech synthesis and computer-aided music due to its long-term consistency (e.g., syntax in speech) and high fidelity. Using the same AI principles that guided the development of our other models, we created a system to detect synthetic sounds generated by AudioLM.

AudioLM uses two distinct types of audio tokens to address these issues. In the first step, w2v-BERT, a self-supervised audio model, is used to extract the semantic tokens. These tokens heavily downsample the audio signal to model long audio sequences while capturing local dependencies and long-term global structure.

It is possible to achieve high audio quality and long-term consistency by training a system to generate semantic and acoustic tokens. A low level of fidelity is present in the reconstructed audio when using these tokens. To overcome this restriction, the team uses a neural codec SoundStream to generate acoustic tokens that capture the nuances of the audio waveform and enable precise synthesis.

AudioLM is a music-only model that follows training using audio only. To represent an audio sequence in a hierarchical way, from semantic tokens to fine acoustic tokens, AudioLM combines different Transformer models. Similarly, a textual language model is taught, each step learns to predict the next token based on the tokens that preceded them.

  1. In the first phase, semantic tokens are used to model the overall structure of the audio file.
  2. Second, the complete semantic token sequence is introduced and the previous coarse acoustic tokens into the coarse acoustic model as conditioning, allowing the model to predict the next set of tokens. In this step, acoustic qualities are modeled, such as those of the speaker in a speech or the sound of a musical instrument.
  3. The final audio is refined by applying the fine acoustic model to the coarse acoustic tokens. Next, they recreated an audio waveform by feeding a series of acoustic chips into a SoundStream decoder. Once trained, AudioLM can be conditioned to short audio clips, allowing it to produce smooth loops.

To verify the results, human evaluators listened to audio samples. They judged whether they heard a natural continuation of a recorded human voice or a synthetic voice generated by AudioLM. Their results show a success rate of 51.2%. This means that the ordinary listener will find it difficult to tell the difference between speech generated by AudioLM and real human speech.

Researchers investigated the possibility that people mistake the brief speech samples generated by AudioLM for real human speech and took steps to reduce this risk. To do this, they developed a classifier capable of accurately identifying the synthetic speech generated by AudioLM (98.6% of the time). This shows how simple audio classifiers can easily identify continuations produced by AudioLM, despite their (almost) indistinguishability for some listeners. This is an essential first step in protecting AudioLM from abuse, and future work could investigate technologies such as audio “watermarking” to further strengthen security.

The study authors believe their work will pave the way for future applications of AudioLM to a wider variety of audio and the incorporation of AudioLM into an encoder-decoder framework for conditioned tasks such as text-to-speech and speech-to-speech translation.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'AudioLM: a Language Modeling Approach to Audio Generation'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, project and reference article.

Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new technological advancements and applying them to real life.