ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation


Junmin Gong, Yulin Song, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo

acestudio_logo
stepfun_logo
GitHub Hugging Face Paper (soon) Demo

📝 Abstract

🚀 We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast—under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style.

🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️

🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸

Model AudioBox ↑ SongEval ↑ Style
Align ↑
Lyric
Align ↑
CE CU PC PQ Coh. Mus. Mem. Cla. Nat.
Udio-v1.5 7.45 7.65 6.15 8.03 4.15 3.96 4.09 3.93 3.86 34.9 24.8
Suno-v4.5 7.63 7.85 6.22 8.25 4.64 4.51 4.63 4.53 4.49 40.5 32.7
Suno-v5 7.69 7.87 6.51 8.29 4.72 4.62 4.71 4.63 4.56 46.8 34.2
Mureka-V7.6 7.44 7.71 6.35 8.13 4.43 4.29 4.35 4.29 4.21 36.2 22.4
MinMax-2.0 7.71 7.95 6.42 8.38 4.61 4.51 4.59 4.50 4.41 43.1 29.5
Yue 6.58 7.29 4.95 7.39 3.01 2.80 2.85 2.79 2.82 26.8 −4.6
ACE-Step 1.0 7.22 7.52 6.50 7.76 3.99 3.73 3.85 3.78 3.68 28.5 0.9
LeVo 7.61 7.78 5.92 8.31 3.55 3.35 3.32 3.31 3.20 29.4 −1.2
DiffRhythm 2 7.25 7.61 6.33 7.99 3.99 3.79 3.97 3.82 3.66 32.1 3.8
HeartMuLa 7.66 7.89 6.15 8.25 4.68 4.55 4.69 4.55 4.45 31.7 28.6
ACE-Step 1.5 7.42 8.09 6.47 8.35 4.72 4.67 4.72 4.66 4.59 39.1 26.3

Table 1: Comparison with commercial (top) and open-source (bottom) music generation models. Bold = best, underline = second best. ↑ higher is better.

Generation Speed
4-min song on A100
2s
Ours
20s
A few
models
2min
Some
models
4min
Most
models
10–120× faster
than alternatives

🎵 Examples

Caption Lyrics ACE-Step generated

🏗️ Framework & Application

ACE-Step Framework
Application Map
Album Art
01:00 / 04:00