On Thursday, researchers at Google announced a new generative AI model named MusicLM that can create 24 kHz music audio from text descriptions, such as B. “a soothing violin melody backed by a distorted guitar riff”. It can also convert a hummed melody into a different style of music and output music for several minutes.
MusicLM uses an AI model trained on what Google calls “a large dataset of uncaptioned music”, along with captions from MusicCaps, a new data set consisting of 5,521 music-lyric pairs. MusicCaps gets its text descriptions from human experts and its matching audio clips from Google AudioSeta collection of over 2 million labeled 10-second sound clips from YouTube videos.
In general, MusicLM works in two main parts: First, it takes a sequence of audio tokens (pieces of sound) and maps them to semantic tokens (words that represent meaning) in subtitles for training. The second part receives user subtitles and/or input audio and generates acoustic tokens (pieces of sound that make up the resulting song output). The system relies on an earlier AI model called AudioLM (introduced by Google in September) along with other components such as SoundStream and MuLan.
Google claims that MusicLM performs better previously AI music generators with audio quality and compliance with text descriptions. On the MusicLM demonstration page, Google provides numerous examples of the AI model in action, creating audio from “rich subtitles” that describe the feel of the music, and even vocals (which so far are just gibberish). Here’s an example of a rich caption they provide:
Slow-paced reggae song led by bass and drums. Sustainable electric guitar. High bongos with ringtones. The vocals are laid back with a relaxed feel, very expressive.
Google also features MusicLM’s “Long Generation” (creating five-minute music clips from a simple prompt), “Story Mode” (which takes a sequence of text prompts and turns them into a changing series of pieces of music), “Lyrics and Melody Conditioning” ( which takes a human buzzing or whistling audio input and adapts it to the style specified in a prompt) and produces music to match the mood of captions.

Further down the example page, Google dives into MusicLM’s ability to play specific instruments (e.g. flute, cello, guitar), different music genres, different musician experience levels, locations (prison break, gym), time periods (a club in the 1950s) and more.
AI generated music is by no means a new idea, but AI music generation methods of earlier decades often created musical notation that was later played by hand or via a synthesizer, while MusicLM generated the music’s raw audio frequencies. Also in December, we reported on Riffus, a hobbyist AI project that can create music from text descriptions in a similar way, but not at high fidelity. Google refers to Riffus in its MusicLM academic paperand says MusicLM beats it in quality.
In the MusicLM paper, the makers outline potential impacts of MusicLM, including “potential misappropriation of creative content” (i.e., copyright issues), potential biases for cultures underrepresented in the training data, and potential cultural appropriation issues. As a result, Google stresses the need for further efforts to address these risks, and withholds the code: “We have no plans to release any models at this time.”
Google researchers are already looking ahead to future improvements: “Future work could focus on text generation, along with improving text conditioning and voice quality. Another aspect is the modeling of high-level song structures like introduction, verse and chorus Music with a higher sample rate is an additional goal.”
It’s probably not too far-fetched to suggest that AI researchers will continue to improve music creation technology until anyone can create studio-quality music in any style simply by describing it — although no one can yet predict exactly when and how that will happen Target is met This is exactly what will affect the music industry. Stay tuned for further developments.
This article was previously published on Source link