ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

Junmin Gong, Yulin Song, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo

📝 Abstract

🚀 We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast—under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style.

🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️

🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸

Model	AudioBox ↑				SongEval ↑					Style Align ↑	Lyric Align ↑
Model	CE	CU	PC	PQ	Coh.	Mus.	Mem.	Cla.	Nat.	Style Align ↑	Lyric Align ↑
Udio-v1.5	7.45	7.65	6.15	8.03	4.15	3.96	4.09	3.93	3.86	34.9	24.8
Suno-v4.5	7.63	7.85	6.22	8.25	4.64	4.51	4.63	4.53	4.49	40.5	32.7
Suno-v5	7.69	7.87	6.51	8.29	4.72	4.62	4.71	4.63	4.56	46.8	34.2
Mureka-V7.6	7.44	7.71	6.35	8.13	4.43	4.29	4.35	4.29	4.21	36.2	22.4
MinMax-2.0	7.71	7.95	6.42	8.38	4.61	4.51	4.59	4.50	4.41	43.1	29.5
Yue	6.58	7.29	4.95	7.39	3.01	2.80	2.85	2.79	2.82	26.8	−4.6
ACE-Step 1.0	7.22	7.52	6.50	7.76	3.99	3.73	3.85	3.78	3.68	28.5	0.9
LeVo	7.61	7.78	5.92	8.31	3.55	3.35	3.32	3.31	3.20	29.4	−1.2
DiffRhythm 2	7.25	7.61	6.33	7.99	3.99	3.79	3.97	3.82	3.66	32.1	3.8
HeartMuLa	7.66	7.89	6.15	8.25	4.68	4.55	4.69	4.55	4.45	31.7	28.6
ACE-Step 1.5	7.42	8.09	6.47	8.35	4.72	4.67	4.72	4.66	4.59	39.1	26.3

Table 1: Comparison with commercial (top) and open-source (bottom) music generation models. Bold = best, underline = second best. ↑ higher is better.

Generation Speed

4-min song on A100

Ours

20s

A few
models

2min

Some
models

4min

Most
models

10–120× faster

than alternatives

🎵 Examples

Caption	Lyrics	ACE-Step generated

🏗️ Framework & Application

ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

📝 Abstract

🎵 Examples

🏗️ Framework & Application

⚠️ Limitations & Future Improvements 🔮