# BSRoformer.cpp [δΈ­ζ–‡](README.zh.md) | English High-performance C++ inference implementation for the **BS Roformer** and **Mel-Band-Roformer** audio source separation model. ## πŸ“– Introduction This project is a pure C++ inference engine for the **BS Roformer** and **Mel-Band-Roformer** audio source separation models, built on the [GGML](https://github.com/ggerganov/ggml) tensor library. It primarily used for extracting vocals or accompaniment from music. ### ✨ Key Features - πŸš€ **High-Performance Inference**: Supports CPU/GPU (CUDA, Vulkan) acceleration - πŸ—οΈ **Multi-Architecture**: Support for both **Mel-Band Roformer** and **BS Roformer** - πŸ“¦ **GGUF Model Format**: Unified model file format for easy distribution - 🎚️ **Multiple Quantization Support**: FP32/FP16/Q8_0/Q4_0/Q4_1/Q5_0/Q5_1 - πŸ”§ **Easy Deployment**: Only requires executable and GGML library - 🎡 **Complete Audio Pipeline**: Built-in STFT/ISTFT and audio I/O - ⚑ **Pipeline Optimization**: CPU preprocessing and GPU inference run in parallel --- ## πŸš€ Quick Start ### Download - **Pre-built Binaries**: Download executables for your platform from the [Releases](../../releases) page - **GGUF Models**: Download pre-converted model files from [BSRoformer-GGUF](https://huggingface.co/chenmozhijin/BSRoformer-GGUF) ### Command Line Usage ```bash ./bs_roformer-cli [options] Options: --chunk-size Chunk size (in samples), defaults to model value --overlap Number of overlaps, defaults to model value --help, -h Show help message ``` **Parameter Description:** | Parameter | Description | |-----------|-------------| | `--chunk-size` | Number of audio samples to process at once. Larger values require more VRAM but may improve processing efficiency. Default is typically 352800 (~8 seconds @44100Hz). | | `--overlap` | Number of overlaps between chunks. Increasing this value can **improve output quality** as it helps reduce artifacts when reassembling chunks, but will increase inference time. Recommended value is 2-4. | **Examples:** ```bash # Basic usage (using model defaults) ./bs_roformer-cli model.gguf song.wav vocals.wav # Custom chunking parameters ./bs_roformer-cli model.gguf song.wav vocals.wav --chunk-size 352800 --overlap 2 # High quality mode (increase overlap to reduce artifacts) ./bs_roformer-cli model.gguf song.wav vocals.wav --overlap 4 ``` > **Note**: Input audio must be **44100 Hz**. Stereo or mono is supported (auto-expanded). --- ## πŸ”§ Building from Source ### Prerequisites - CMake >= 3.17 - C++17 compatible compiler (MSVC 2019+, GCC 9+, Clang 10+) - GGML source code (submodule or local directory) ### Getting GGML Dependency The project supports multiple ways to obtain GGML: ```bash # Option 1: Git Submodule (Recommended) git submodule add https://github.com/ggerganov/ggml.git git submodule update --init --recursive # Option 2: Sibling Directory cd .. git clone https://github.com/ggerganov/ggml.git # Option 3: Explicit Path cmake -B build -DGGML_DIR=/path/to/ggml ``` See [GGML_DEPENDENCY.md](GGML_DEPENDENCY.md) for details. ### Build Commands ```bash # CPU Build cmake -B build cmake --build build --config Release --parallel # CUDA Acceleration (Recommended) cmake -B build -DGGML_CUDA=ON cmake --build build --config Release --parallel # Enable Tests cmake -B build -DGGML_CUDA=ON -DBSR_BUILD_TESTS=ON cmake --build build --config Release --parallel ``` ### CMake Options | Option | Default | Description | |--------|---------|-------------| | `GGML_CUDA` | `ON` | Enable CUDA backend | | `BSR_BUILD_CLI` | `ON` | Build command line tool | | `BSR_BUILD_TESTS` | `OFF` | Build test suite | > **Breaking Change:** build/test prefixes were renamed from `MBR_*` to `BSR_*` with no compatibility aliases. --- ## πŸ“¦ Model Conversion If you need to convert models yourself, use `convert_to_gguf.py` to convert PyTorch weights to GGUF format. **Install Dependencies:** ```bash pip install torch numpy pyyaml librosa einops gguf ``` **Conversion Command:** ```bash python scripts/convert_to_gguf.py \ --ckpt model.ckpt \ --config config.yaml \ --out model.gguf \ --dtype q8_0 # For BS Roformer (optional, usually auto-detected) python scripts/convert_to_gguf.py ... --arch bs ``` ### Supported Quantization Types | Type | Precision | Size | Recommended Use | |------|-----------|------|-----------------| | `fp32` | Highest | 100% | Debugging/Baseline | | `fp16` | High | 50% | High precision needs | | `q8_0` | Good | 25% | **Recommended** (balance of precision and performance) | | `q5_1` | Medium | 18% | Resource constrained | | `q4_0` | Lower | 12.5% | Extreme compression | > **Note**: The conversion script currently does not support K-Quant types (Q4_K, Q5_K, etc.). This is mainly because the gguf-py library has not yet implemented K-Quant quantization (only supports reading/dequantization), and most models do not meet the requirement that dim must be divisible by 256. --- ## πŸ’» C++ API ```cpp #include #include // 1. Load audio file AudioBuffer input = AudioFile::Load("input.wav"); // 2. Initialize inference engine Inference engine("model.gguf"); // 3. Get model's recommended inference parameters int chunk_size = engine.GetDefaultChunkSize(); // e.g., 352800 int num_overlap = engine.GetDefaultNumOverlap(); // e.g., 2 // 4. Run inference (with progress callback) auto stems = engine.Process(input.data, chunk_size, num_overlap, [](float progress) { std::cout << "Progress: " << int(progress * 100) << "%" << std::endl; }); // 5. Save result AudioBuffer output{stems[0], 2, 44100, stems[0].size()}; AudioFile::Save("vocals.wav", output); ``` --- ## πŸ—οΈ Project Architecture ``` BSRoformer.cpp/ β”œβ”€β”€ include/ β”‚ └── bs_roformer/ β”‚ β”œβ”€β”€ inference.h # Inference Engine API β”‚ └── audio.h # Audio I/O API β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ model.h/cpp # Model weight loading & graph building (internal) β”‚ β”œβ”€β”€ inference.cpp # Core inference logic (STFT β†’ Network β†’ ISTFT) β”‚ β”œβ”€β”€ stft.h # STFT/ISTFT implementation (Radix-2 FFT) β”‚ β”œβ”€β”€ audio.cpp # Audio read/write implementation (dr_wav) β”‚ └── utils.h/cpp # NPY loading, tensor comparison tools β”œβ”€β”€ third_party/ β”‚ └── dr_libs/dr_wav.h # dr_libs audio library β”œβ”€β”€ cli/ β”‚ └── main.cpp # Command line tool β”œβ”€β”€ scripts/ β”‚ β”œβ”€β”€ convert_to_gguf.py # PyTorch β†’ GGUF conversion tool β”‚ β”œβ”€β”€ generate_test_data.py # Test data generation script β”‚ └── generate_test_audio.py # CI test audio generation (no external files needed) β”œβ”€β”€ tests/ # Unit test suite β”œβ”€β”€ models/ # Model file directory └── CMakeLists.txt # Build configuration ``` --- ## πŸ“ Core Module Details ### 1. Model Loading (`model.h/cpp`) The `BSRoformer` class is responsible for: - **GGUF Weight Loading**: Parsing hyperparameters and tensors from file - **Buffer Generation**: `freq_indices`, `num_bands_per_freq`, etc. - **Computation Graph Building**: - `BuildBandSplitGraph()` - Band split layer - `BuildTransformersGraph()` - Time-frequency Transformer stacking - `BuildMaskEstimatorGraph()` - Mask estimator ### 2. Inference Engine (`inference.cpp`) The `Inference` class implements the complete audio processing pipeline: ``` Input Audio β†’ Chunking β†’ STFT β†’ Neural Network β†’ Mask Application β†’ ISTFT β†’ Overlap-Add β†’ Output ``` **Key Methods**: | Method | Function | |--------|----------| | `Process()` | Process complete audio (auto-chunking) | | `ProcessChunk()` | Process a single audio chunk | | `ComputeSTFT()` | Short-Time Fourier Transform | | `PostProcessAndISTFT()` | Mask application and inverse transform | **Pipeline Optimization**: ``` Chunk N: [CPU Preprocess] β†’ [GPU Inference] β†’ [CPU Postprocess] Chunk N+1: [CPU Preprocess] β†’ [GPU Inference] β†’ [CPU Postprocess] ↑ Parallel execution ``` ### 3. STFT Implementation (`stft.h`) Pure C++ implementation, numerically aligned with PyTorch `torch.stft/istft`: - **Radix-2 Cooley-Tukey FFT**: Efficient O(N log N) implementation - **Hann Window**: Periodic window function - **Center Padding**: Reflect mode padding - **OpenMP Parallelization**: Frame-level parallel acceleration ### 4. Audio I/O (`audio.h/cpp`) Lightweight audio processing based on [dr_libs](https://github.com/mackron/dr_libs): - Read: WAV file β†’ `float32` interleaved format - Write: `float32` interleaved format β†’ WAV file --- ## πŸ§ͺ Testing ### Running Tests ```bash # Set environment variables $env:BSR_MODEL_PATH = "models/model.gguf" $env:BSR_TEST_DATA_DIR = "test_data" # Run all tests ctest --test-dir build -C Release # Run specific test ctest --test-dir build -C Release -R test_inference ``` ### Test Suite | Test File | Verification Content | |-----------|---------------------| | `test_audio` | Audio read/write functionality | | `test_component_stft` | STFT/ISTFT numerical precision | | `test_component_bandsplit` | Band split layer | | `test_component_layers` | Transformer layers | | `test_component_mask` | Mask estimator | | `test_inference` | End-to-end inference | | `test_chunking_logic` | Chunking overlap-add logic | ### Generating Test Data First clone [Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training) and install its dependencies: ```bash git clone https://github.com/ZFTurbo/Music-Source-Separation-Training.git cd Music-Source-Separation-Training pip install -r requirements.txt cd .. python scripts/generate_test_data.py \ --model-repo "Music-Source-Separation-Training" \ --audio "test.wav" \ --checkpoint "model.ckpt" \ --output "test_data" ``` --- ## Acknowledgements - [ggerganov/ggml](https://github.com/ggerganov/ggml) - Efficient tensor library - [ZFTurbo/Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training) - PyTorch reference implementation - [dr_libs](https://github.com/mackron/dr_libs) - Lightweight audio library