BOSSA - AI4Music Workshop @ NeurIPS 2025

BOSSA: Learning Music Style Through Cross-Modal Bootstrapping

Jingwei Zhao, Ziyu Wang, Gus Xia, and Ye Wang

Abstract: What is music style? Though often described using text labels such as "swing," "classical," or "emotional," the real style remains implicit and hidden in concrete music examples. In this paper, we introduce a cross-modal framework that learns implicit music styles from raw audio and applies the styles to symbolic music generation. Inspired by BLIP-2, our model leverages a Querying Transformer (Q-Former) to extract style representations from a large, pre-trained audio language model (LM), and further applies them to condition a symbolic LM for generating piano arrangements. We adopt a two-stage training strategy: contrastive learning to align style representations with symbolic expression, followed by generative modeling to perform music arrangement. We name our model as BOSSA (i.e., BOotStrapping audio-to-Symbolic Arrangement). It generates piano performances jointly conditioned on a lead sheet (content) and a reference audio example (style), enabling controllable and stylistically faithful arrangement.

Highlight

Our model generates piano performances conditioned on a lead sheet (melody and chords as content) and a reference audio example (providing style). Below, we demonstrate its ability to accommodate diverse, freely manipulated styles to the same content: an 8-bar excerpt from The Sound of Music.

An 8-bar lead sheet excerpt from The Sound of Music.

We present two arrangement results, each guided by a different reference audio: one in the style of Bossa Nova and the other Ragtime. In each case, the reference audio is shown on the left and the generated arrangement on the right. Click to explore and listen to different stylistic interpretations.

1/2

2/2

The Girl from Ipanema

Piano arrangement in the style of Bossa Nova.

The Entertainer

Piano arrangement in the style of Ragtime.

In the rest of this page, we will show more arrangement demos.

Piano Cover Generation

When the lead sheet is paired with (or derived from) the audio, the task becomes piano cover generation. In this section, we demonstrate our model's ability to generate symbolic piano covers that capture the feel of the original audio. Demo pieces are drawn from the Ballroom, RWC-Pop, and POP909 datasets, which span a wide range of genres and instrumentation. We compare our model against three baselines: 1) PiCoGen2, 2) Audio2MIDI, and 3) a variant of our model w/o pre-training. To ensure a fair comparison, the lead sheet input to our model is automatically transcribed from the audio by Sheetsage, making audio the sole input modality for all methods. For each model, we cherry-pick the best result out of three generated samples.

Demo No.	Audio Input	Ours	PiCoGen2	Audio2MIDI	w/o Pre-Training
#01 (J-Pop)
#02 (Pop)
#03 (Jazz)
#04 (Jazz)
#05 (Rumba)
#06 (Tango)
#07 (Jazz)
#08 (Tango)
#09 (C-Pop)
#10 (C-Pop)

Audio-to-Symbolic Style Transfer

When the lead sheet is not paired with the audio, the task becomes cross-modal style transfer. In this case, our model arranges the lead sheet into a piano performance based on the style from the reference audio. In this demo section, We use reference audio from diverse genres, including classical and jazz, and lead sheets from the WikiMT and POP909 datasets, covering both Western and Eastern contemporary music. In each example, the reference audio is shown on the left, followed by four piano arrangements generated from different lead sheets.

Reference Audio: Samba