Sin descripción

1 Ramas

沉默の金 ac306c0bd8 feat(workflow): add GGUF metadata support to model conversion workflow		hace 5 meses
.github	ac306c0bd8 feat(workflow): add GGUF metadata support to model conversion workflow	hace 5 meses
cli	1cd0be76d6 fix(cli): correct sample count calculation for output audio buffer	hace 5 meses
include	5e006bccbc feat(cli): enhance argument parsing with proper error handling	hace 5 meses
scripts	5c0f9c8d53 feat(script): add name and description parameters to GGUF converter	hace 5 meses
src	5e006bccbc feat(cli): enhance argument parsing with proper error handling	hace 5 meses
tests	b8a5c45ae9 Initial commit	hace 5 meses
third_party	b8a5c45ae9 Initial commit	hace 5 meses
.gitignore	b8a5c45ae9 Initial commit	hace 5 meses
CMakeLists.txt	bf5823431c build(cmake): bump minimum CMake version to 3.17 and add CUDA driver stub linking	hace 5 meses
GGML_DEPENDENCY.md	b8a5c45ae9 Initial commit	hace 5 meses
README.md	fcf1eecf4c build: update CUDA build workflow with multi-version matrix	hace 5 meses
README.zh.md	fcf1eecf4c build: update CUDA build workflow with multi-version matrix	hace 5 meses

MelBandRoformer.cpp

中文 | English

High-performance C++ inference implementation for the Mel-Band-Roformer audio source separation model.

📖 Introduction

This project is a pure C++ inference engine for the Mel-Band-Roformer audio source separation model, built on the GGML tensor library. It theoretically supports most Mel-Band-Roformer models and is primarily used for extracting vocals or accompaniment from music.

✨ Key Features

🚀 High-Performance Inference: Supports CPU/GPU (CUDA, Vulkan) acceleration
📦 GGUF Model Format: Unified model file format for easy distribution
🎚️ Multiple Quantization Support: FP32/FP16/Q8_0/Q4_0/Q4_1/Q5_0/Q5_1
🔧 Easy Deployment: Only requires executable and GGML library
🎵 Complete Audio Pipeline: Built-in STFT/ISTFT and audio I/O
⚡ Pipeline Optimization: CPU preprocessing and GPU inference run in parallel

🚀 Quick Start

Download

Pre-built Binaries: Download executables for your platform from the Releases page
GGUF Models: Download pre-converted model files from MelBandRoformer-GGUF

Command Line Usage

./mel_band_roformer-cli <model.gguf> <input.wav> <output.wav> [options]

Options:
  --chunk-size <N>   Chunk size (in samples), defaults to model value
  --overlap <N>      Number of overlaps, defaults to model value
  --help, -h         Show help message

Parameter Description:

Parameter	Description
`--chunk-size`	Number of audio samples to process at once. Larger values require more VRAM but may improve processing efficiency. Default is typically 352800 (~8 seconds @44100Hz).
`--overlap`	Number of overlaps between chunks. Increasing this value can improve output quality as it helps reduce artifacts when reassembling chunks, but will increase inference time. Recommended value is 2-4.

Examples:

# Basic usage (using model defaults)
./mel_band_roformer-cli model.gguf song.wav vocals.wav

# Custom chunking parameters
./mel_band_roformer-cli model.gguf song.wav vocals.wav --chunk-size 352800 --overlap 2

# High quality mode (increase overlap to reduce artifacts)
./mel_band_roformer-cli model.gguf song.wav vocals.wav --overlap 4

Note: Input audio must be 44100 Hz. Stereo or mono is supported (auto-expanded).

🔧 Building from Source

Prerequisites

CMake >= 3.17
C++17 compatible compiler (MSVC 2019+, GCC 9+, Clang 10+)
GGML source code (submodule or local directory)

Getting GGML Dependency

The project supports multiple ways to obtain GGML:

# Option 1: Git Submodule (Recommended)
git submodule add https://github.com/ggerganov/ggml.git
git submodule update --init --recursive

# Option 2: Sibling Directory
cd ..
git clone https://github.com/ggerganov/ggml.git

# Option 3: Explicit Path
cmake -B build -DGGML_DIR=/path/to/ggml

See GGML_DEPENDENCY.md for details.

Build Commands

# CPU Build
cmake -B build
cmake --build build --config Release --parallel

# CUDA Acceleration (Recommended)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --parallel

# Enable Tests
cmake -B build -DGGML_CUDA=ON -DMBR_BUILD_TESTS=ON
cmake --build build --config Release --parallel

CMake Options

Option	Default	Description
`GGML_CUDA`	`ON`	Enable CUDA backend
`MBR_BUILD_CLI`	`ON`	Build command line tool
`MBR_BUILD_TESTS`	`OFF`	Build test suite

📦 Model Conversion

If you need to convert models yourself, use convert_to_gguf.py to convert PyTorch weights to GGUF format.

Install Dependencies:

pip install torch numpy pyyaml librosa einops gguf

Conversion Command:

python scripts/convert_to_gguf.py \
    --ckpt model.ckpt \
    --config config.yaml \
    --out model.gguf \
    --dtype q8_0

Supported Quantization Types

Type	Precision	Size	Recommended Use
`fp32`	Highest	100%	Debugging/Baseline
`fp16`	High	50%	High precision needs
`q8_0`	Good	25%	Recommended (balance of precision and performance)
`q5_1`	Medium	18%	Resource constrained
`q4_0`	Lower	12.5%	Extreme compression

Note: The conversion script currently does not support K-Quant types (Q4_K, Q5_K, etc.). This is mainly because the gguf-py library has not yet implemented K-Quant quantization (only supports reading/dequantization), and most models do not meet the requirement that dim must be divisible by 256.

💻 C++ API

#include <mel_band_roformer/inference.h>
#include <mel_band_roformer/audio.h>

// 1. Load audio file
AudioBuffer input = AudioFile::Load("input.wav");

// 2. Initialize inference engine
Inference engine("model.gguf");

// 3. Get model's recommended inference parameters
int chunk_size = engine.GetDefaultChunkSize();   // e.g., 352800
int num_overlap = engine.GetDefaultNumOverlap(); // e.g., 2

// 4. Run inference (with progress callback)
auto stems = engine.Process(input.data, chunk_size, num_overlap,
    [](float progress) {
        std::cout << "Progress: " << int(progress * 100) << "%" << std::endl;
    });

// 5. Save result
AudioBuffer output{stems[0], 2, 44100, stems[0].size()};
AudioFile::Save("vocals.wav", output);

🏗️ Project Architecture

MelBandRoformer.cpp/
├── include/
│   └── mel_band_roformer/
│       ├── inference.h        # Inference Engine API
│       └── audio.h            # Audio I/O API
├── src/
│   ├── model.h/cpp            # Model weight loading & graph building (internal)
│   ├── inference.cpp          # Core inference logic (STFT → Network → ISTFT)
│   ├── stft.h                 # STFT/ISTFT implementation (Radix-2 FFT)
│   ├── audio.cpp              # Audio read/write implementation (dr_wav)
│   └── utils.h/cpp            # NPY loading, tensor comparison tools
├── third_party/
│   └── dr_libs/dr_wav.h       # dr_libs audio library
├── cli/
│   └── main.cpp               # Command line tool
├── scripts/
│   ├── convert_to_gguf.py      # PyTorch → GGUF conversion tool
│   ├── generate_test_data.py   # Test data generation script
│   └── generate_test_audio.py  # CI test audio generation (no external files needed)
├── tests/                     # Unit test suite
├── models/                    # Model file directory
└── CMakeLists.txt             # Build configuration

📐 Core Module Details

1. Model Loading (`model.h/cpp`)

The MelBandRoformer class is responsible for:

GGUF Weight Loading: Parsing hyperparameters and tensors from file
Buffer Generation: freq_indices, num_bands_per_freq, etc.
Computation Graph Building:
- BuildBandSplitGraph() - Band split layer
- BuildTransformersGraph() - Time-frequency Transformer stacking
- BuildMaskEstimatorGraph() - Mask estimator

2. Inference Engine (`inference.cpp`)

The Inference class implements the complete audio processing pipeline:

Input Audio → Chunking → STFT → Neural Network → Mask Application → ISTFT → Overlap-Add → Output

Key Methods:

Method	Function
`Process()`	Process complete audio (auto-chunking)
`ProcessChunk()`	Process a single audio chunk
`ComputeSTFT()`	Short-Time Fourier Transform
`PostProcessAndISTFT()`	Mask application and inverse transform

Pipeline Optimization:

Chunk N:   [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
Chunk N+1:                   [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
                              ↑ Parallel execution

3. STFT Implementation (`stft.h`)

Pure C++ implementation, numerically aligned with PyTorch torch.stft/istft:

Radix-2 Cooley-Tukey FFT: Efficient O(N log N) implementation
Hann Window: Periodic window function
Center Padding: Reflect mode padding
OpenMP Parallelization: Frame-level parallel acceleration

4. Audio I/O (`audio.h/cpp`)

Lightweight audio processing based on dr_libs:

Read: WAV file → float32 interleaved format
Write: float32 interleaved format → WAV file

🧪 Testing

Running Tests

# Set environment variables
$env:MBR_MODEL_PATH = "models/model.gguf"
$env:MBR_TEST_DATA_DIR = "test_data"

# Run all tests
ctest --test-dir build -C Release

# Run specific test
ctest --test-dir build -C Release -R test_inference

Test Suite

Test File	Verification Content
`test_audio`	Audio read/write functionality
`test_component_stft`	STFT/ISTFT numerical precision
`test_component_bandsplit`	Band split layer
`test_component_layers`	Transformer layers
`test_component_mask`	Mask estimator
`test_inference`	End-to-end inference
`test_chunking_logic`	Chunking overlap-add logic

Generating Test Data

First clone Music-Source-Separation-Training and install its dependencies:

git clone https://github.com/ZFTurbo/Music-Source-Separation-Training.git
cd Music-Source-Separation-Training
pip install -r requirements.txt
cd ..

python scripts/generate_test_data.py \
    --model-repo "Music-Source-Separation-Training" \
    --audio "test.wav" \
    --checkpoint "model.ckpt" \
    --output "test_data"

Acknowledgements

ggerganov/ggml - Efficient tensor library
ZFTurbo/Music-Source-Separation-Training - PyTorch reference implementation
dr_libs - Lightweight audio library

README.md