Create Your Own Free Music - 1 song in seconds!!!

3 months ago

If you follow me or even if you don't but you have seen some of my articles then you probably know that I like to utilize Locally hosted AI as much as possible. There are numerous reasons for that but I won't bore you with all of that.

Because of my incessent desire to run all AI locally (well all that I can run locally on my current hardware anyways) I subscribe to a lot of differnet news/articles/media to stay up to date on all things Local AI. 1 of those is the ComfyUI Blog. And I get e-mail notifications everytime something new pops up in the ComfyUI world. You can also just view the blog here https://blog.comfy.org or add it to a daily AI roundup agent if you are so inclined.

The specific post we are going to talk about is this one titled:

"ACE-Step 1.5 XL: Commercial-Grade Music Generation in ComfyUI"

https://blog.comfy.org/p/ace-step-15-xl-commercial-grade-music#media-36b3b62a-a82d-4dd6-86c6-adaa7a5200e7

If you want to skip to me actually making a song, Head to the "Let's Create a Song" Header in this article towards the bottom.

Getting Everything Ready

There are 3 different 'models' to pick from. I had to test all of them OBVIOUSLY 😉

XL-Base — Most versatile, highest diversity. For maximum creative range.
XL-SFT — Peak audio quality, some loss in diversity. For clean final outputs.
XL-Turbo— 8 steps, ~6x faster, no CFG. For fast iteration.

I have used all 3, and after my inital testing I like to stick to XL-SFT specifically. But your mileage may very.

Key Highlights

Commercial-Grade Quality — Evaluation metrics place output between Suno v4.5 and v5, with 4B parameters delivering richer audio than the 2B predecessors
Ultra-Fast Generation — Under 2 seconds per full song on an A100, under 10 seconds on an RTX 3090. xl-turbo cuts inference to just 8 steps (~6x faster than base/sft)
Flexible Duration — Generate anything from 10-second loops to full 10-minute compositions
1000+ Instruments and Styles — Fine-grained timbre description across a massive range of musical genres
50+ Language Lyrics — Prompt with lyrics for structure and style control in over 50 languages
Commercially Licensed — MIT license, trained on licensed music, royalty-free/public domain, and synthetic MIDI-to-Audio data

It states that it can generate a full song under 10 seconds on a RTX 3090. I have a RTX 5070TI so it should generate a song in 6-8 seconds, on paper at least.

Assuming you already have ComfyUI installed and working...

Start your ComfyUI, if not already, and go to the 'Templates' section on the left hand menu.

Then search for "ACE". I have the "Use Case" dropdown menu defaulted to only show Locally Hosted options becuase I am not paying for tokens... 😉

There are currently 4 Templates that we care about and they are all on the top row. Here they are:

Here is what each one will look like when you first open them up.

ACE Step v1 M2M Editing

Edit exsisting songs to change style and lyrics using ACE Step v1 M2M.

ACE-Step v1 Text to Instrumentals

Generate Instrumental music from text prompts using ACE Step v1.

ACE Step v1 Text to Song

Generate songs with vocals from text prompts using ACE Step v1

Text to Song (New)

Generate songs from text prompts using ACE Step v1.

Manual Method

Once you are familiar with ComfyUI you can just download the json files and get more indepth with your Local AI. But it is not for the faint of heart.

https://github.com/Comfy-Org/workflow_templates/blob/main/templates/audio_ace_step1_5_xl_sft.json

You could also just use it as another way to see which Models are used in workflows and download them directly from Hugging Face, if you like it that way!

Models:
Directory: "diffusion_models"
Name: "acestep_v1.5_xl_sft_bf16.safetensors",
URL: "https://huggingface.co/Comfy-Org/ace_step_1.5_ComfyUI_files/resolve/main/split_files/diffusion_models/acestep_v1.5_xl_sft_bf16.safetensors"

Directory: "vae"
Name: "ace_1.5_vae.safetensors",
URL: "https://huggingface.co/Comfy-Org/ace_step_1.5_ComfyUI_files/resolve/main/split_files/vae/ace_1.5_vae.safetensors",

But I won't get into that here. If you know how, then there is no point in me explaining it. And if you don't know.... go ask your AI how to do it.

Be just be careful downloading models willy nilly...

Because if you are not careful, it can quickly get out of hand haha

Let's Create A Song!

For this walkthrough I am going to use the "ACE Step v1 Text to Song" Template.

Step 1: Generate the Lyrics

You can do this a million ways. For this example I am asking my local 'gpt-oss:20b' model on Ollama to generate the lyrics for me. I just typed up a short (crappy) list of what the lyrics should be about.

After thinking for 18.8 seconds this is what it created for me.

Title: “Hive High”

(Intro – short, upbeat synth riff)

Verse 1
Wake up, the sunrise is in the code,
I tap my phone, and the world is all my own.
From Steem’s spark to Hive’s blazing light,
We’ve built a universe where the future’s bright.

I post a story, a meme, a poem that shines,
The community’s humming, it’s time to rise.
Rewards in HIVE, the digital gold I keep,
Each up‑vote a boost, it’s the wealth we reap.

Chorus
Hive’s on the rise, we’re reaching for the moon,
Play, earn, share – we’re in a neon boom.
Splinterlands, Dungeon Cities, Holozing too,
The Hive’s alive, and the sky’s the limit for you.

Verse 2
In Splinterlands, I raise my deck of fire,
Turn the tide, the crowd goes wild with a cheer.
Dungeon Cities, build, explore, claim your land,
Every block of pixels earns you a stand.

Holozing brings the magic, a 3‑D delight,
We spin the stories, and we light the night.
Games, art, music – all wrapped in one chain,
We’re the community, we’re the game, we’re the gain.

Verse 3
From the ashes of old, a new era was born,
Hive split, it rose, the world’s reborn.
Governed by the people, power in our hands,
We shape the future with the swipe of a brand.

So post your thoughts, engage with friends so wide,
HIVE flows like music, on a joyous ride.
Soon the moon will glisten, the tokens will ascend,
On Hive we’re all dreamers, together we’ll transcend.

(Outro – a fade‑out of synths, a bright hook)

Hook (repeated) –
Hive, Hive, the future’s bright,
Hive, Hive, join the flight!

And now we put that into ComfyUI:

I also changed the Conditioning to:

anime opening, shonen anime, 90s rock, 90s pop rock, electric guitar, live drums, male vocals, powerful vocals, raw vocals, anthem, heroic, natural voice, no autotune

The best I could do was upload to Google Drive and then embedd the song here but you may have to click on the 'Listen on Google Drive' to listen to this song. And I will preface it with, it needs a LOT of work haha.

Listen to the track:

Your browser does not support the audio element.

Listen on Google Drive

But this post is just to show you how easy it is to create your own song locally with AI. Now granted it did take ~20 seconds to generate the song but still that is crazy that I can generate an entire song in under 30 seconds.

I generated a quick video I put on repeat so you can see the video and song all in one.

If that doesn't work you can view it on my YouTube channel that I use to post "Blockchain" videos here:

hive diy technology ai comfyui

0.000

4 comments

@stayoutoftherz 80

3 months ago

Not very convincing, I have to say. It lacks any melody.

0.000

@cherokee4life 68

3 months ago

well yeah the song is trash for sure. But that was just me throwing together lyrics and conditions and trying to not do too much touch up and rework. You can get decent quality songs if you play around with it long enough.

0.000

@nozem01 74

3 months ago

Imagine in a couple of years what AI can do... pretty cool work you're doing mate! 🤩

0.000

@cherokee4life 68

3 months ago

I mean give me real hardware to run my AI on and I can show you even crazier stuff it can do today. I am just limited with my little baby GPU 😉

0.000