|
|
4 месяцев назад | |
|---|---|---|
| .github | 4 месяцев назад | |
| cli | 4 месяцев назад | |
| include | 4 месяцев назад | |
| scripts | 4 месяцев назад | |
| src | 4 месяцев назад | |
| tests | 4 месяцев назад | |
| third_party | 5 месяцев назад | |
| .gitignore | 4 месяцев назад | |
| CMakeLists.txt | 4 месяцев назад | |
| GGML_DEPENDENCY.md | 4 месяцев назад | |
| README.md | 4 месяцев назад | |
| README.zh.md | 4 месяцев назад |
中文 | English
High-performance C++ inference implementation for the BS Roformer and Mel-Band-Roformer audio source separation model.
This project is a pure C++ inference engine for the BS Roformer and Mel-Band-Roformer audio source separation models, built on the GGML tensor library. It primarily used for extracting vocals or accompaniment from music.
./bs_roformer-cli <model.gguf> <input.wav> <output.wav> [options]
Options:
--chunk-size <N> Chunk size (in samples), defaults to model value
--overlap <N> Number of overlaps, defaults to model value
--help, -h Show help message
Parameter Description:
| Parameter | Description |
|---|---|
--chunk-size |
Number of audio samples to process at once. Larger values require more VRAM but may improve processing efficiency. Default is typically 352800 (~8 seconds @44100Hz). |
--overlap |
Number of overlaps between chunks. Increasing this value can improve output quality as it helps reduce artifacts when reassembling chunks, but will increase inference time. Recommended value is 2-4. |
Examples:
# Basic usage (using model defaults)
./bs_roformer-cli model.gguf song.wav vocals.wav
# Custom chunking parameters
./bs_roformer-cli model.gguf song.wav vocals.wav --chunk-size 352800 --overlap 2
# High quality mode (increase overlap to reduce artifacts)
./bs_roformer-cli model.gguf song.wav vocals.wav --overlap 4
Note: Input audio must be 44100 Hz. Stereo or mono is supported (auto-expanded).
The project supports multiple ways to obtain GGML:
# Option 1: Git Submodule (Recommended)
git submodule add https://github.com/ggerganov/ggml.git
git submodule update --init --recursive
# Option 2: Sibling Directory
cd ..
git clone https://github.com/ggerganov/ggml.git
# Option 3: Explicit Path
cmake -B build -DGGML_DIR=/path/to/ggml
See GGML_DEPENDENCY.md for details.
# CPU Build
cmake -B build
cmake --build build --config Release --parallel
# CUDA Acceleration (Recommended)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --parallel
# Enable Tests
cmake -B build -DGGML_CUDA=ON -DBSR_BUILD_TESTS=ON
cmake --build build --config Release --parallel
| Option | Default | Description |
|---|---|---|
GGML_CUDA |
ON |
Enable CUDA backend |
BSR_BUILD_CLI |
ON |
Build command line tool |
BSR_BUILD_TESTS |
OFF |
Build test suite |
Breaking Change: build/test prefixes were renamed from
MBR_*toBSR_*with no compatibility aliases.
If you need to convert models yourself, use convert_to_gguf.py to convert PyTorch weights to GGUF format.
Install Dependencies:
pip install torch numpy pyyaml librosa einops gguf
Conversion Command:
python scripts/convert_to_gguf.py \
--ckpt model.ckpt \
--config config.yaml \
--out model.gguf \
--dtype q8_0
# For BS Roformer (optional, usually auto-detected)
python scripts/convert_to_gguf.py ... --arch bs
| Type | Precision | Size | Recommended Use |
|---|---|---|---|
fp32 |
Highest | 100% | Debugging/Baseline |
fp16 |
High | 50% | High precision needs |
q8_0 |
Good | 25% | Recommended (balance of precision and performance) |
q5_1 |
Medium | 18% | Resource constrained |
q4_0 |
Lower | 12.5% | Extreme compression |
Note: The conversion script currently does not support K-Quant types (Q4_K, Q5_K, etc.). This is mainly because the gguf-py library has not yet implemented K-Quant quantization (only supports reading/dequantization), and most models do not meet the requirement that dim must be divisible by 256.
#include <bs_roformer/inference.h>
#include <bs_roformer/audio.h>
// 1. Load audio file
AudioBuffer input = AudioFile::Load("input.wav");
// 2. Initialize inference engine
Inference engine("model.gguf");
// 3. Get model's recommended inference parameters
int chunk_size = engine.GetDefaultChunkSize(); // e.g., 352800
int num_overlap = engine.GetDefaultNumOverlap(); // e.g., 2
// 4. Run inference (with progress callback)
auto stems = engine.Process(input.data, chunk_size, num_overlap,
[](float progress) {
std::cout << "Progress: " << int(progress * 100) << "%" << std::endl;
});
// 5. Save result
AudioBuffer output{stems[0], 2, 44100, stems[0].size()};
AudioFile::Save("vocals.wav", output);
BSRoformer.cpp/
├── include/
│ └── bs_roformer/
│ ├── inference.h # Inference Engine API
│ └── audio.h # Audio I/O API
├── src/
│ ├── model.h/cpp # Model weight loading & graph building (internal)
│ ├── inference.cpp # Core inference logic (STFT → Network → ISTFT)
│ ├── stft.h # STFT/ISTFT implementation (Radix-2 FFT)
│ ├── audio.cpp # Audio read/write implementation (dr_wav)
│ └── utils.h/cpp # NPY loading, tensor comparison tools
├── third_party/
│ └── dr_libs/dr_wav.h # dr_libs audio library
├── cli/
│ └── main.cpp # Command line tool
├── scripts/
│ ├── convert_to_gguf.py # PyTorch → GGUF conversion tool
│ ├── generate_test_data.py # Test data generation script
│ └── generate_test_audio.py # CI test audio generation (no external files needed)
├── tests/ # Unit test suite
├── models/ # Model file directory
└── CMakeLists.txt # Build configuration
model.h/cpp)The BSRoformer class is responsible for:
freq_indices, num_bands_per_freq, etc.BuildBandSplitGraph() - Band split layerBuildTransformersGraph() - Time-frequency Transformer stackingBuildMaskEstimatorGraph() - Mask estimatorinference.cpp)The Inference class implements the complete audio processing pipeline:
Input Audio → Chunking → STFT → Neural Network → Mask Application → ISTFT → Overlap-Add → Output
Key Methods:
| Method | Function |
|---|---|
Process() |
Process complete audio (auto-chunking) |
ProcessChunk() |
Process a single audio chunk |
ComputeSTFT() |
Short-Time Fourier Transform |
PostProcessAndISTFT() |
Mask application and inverse transform |
Pipeline Optimization:
Chunk N: [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
Chunk N+1: [CPU Preprocess] → [GPU Inference] → [CPU Postprocess]
↑ Parallel execution
stft.h)Pure C++ implementation, numerically aligned with PyTorch torch.stft/istft:
audio.h/cpp)Lightweight audio processing based on dr_libs:
float32 interleaved formatfloat32 interleaved format → WAV file# Set environment variables
$env:BSR_MODEL_PATH = "models/model.gguf"
$env:BSR_TEST_DATA_DIR = "test_data"
# Run all tests
ctest --test-dir build -C Release
# Run specific test
ctest --test-dir build -C Release -R test_inference
| Test File | Verification Content |
|---|---|
test_audio |
Audio read/write functionality |
test_component_stft |
STFT/ISTFT numerical precision |
test_component_bandsplit |
Band split layer |
test_component_layers |
Transformer layers |
test_component_mask |
Mask estimator |
test_inference |
End-to-end inference |
test_chunking_logic |
Chunking overlap-add logic |
First clone Music-Source-Separation-Training and install its dependencies:
git clone https://github.com/ZFTurbo/Music-Source-Separation-Training.git
cd Music-Source-Separation-Training
pip install -r requirements.txt
cd ..
python scripts/generate_test_data.py \
--model-repo "Music-Source-Separation-Training" \
--audio "test.wav" \
--checkpoint "model.ckpt" \
--output "test_data"