This repository accompanies the paper:
M. Hidalgo-Araya et al., "A Probabilistic Generative Model for Spectral Speech Enhancement", 2025.
A comprehensive evaluation framework for virtual hearing aids using the VOICEBANK_DEMAND dataset with warped filter bank (WFB) preprocessing.
This repository provides the complete implementation and evaluation framework for the spectral speech enhancement model presented in the paper. It includes:
- Implementation: Full codebase for the Warped-Frequency Filter Bank (WFB) front-end and Speech Enhancement Model (SEM) backend
- Evaluation Pipeline: Automated evaluation on the VOICEBANK_DEMAND dataset with comprehensive metrics (PESQ, DNSMOS)
- Reproducibility: All configurations and scripts needed to reproduce the results reported in the paper
- Benchmark Comparisons: Automated generation of comparison tables
This repository provides a complete pipeline for:
- Dataset Preparation: Download, resample, and preprocess VOICEBANK_DEMAND dataset
- WFB Preprocessing: Create warped filter bank processed dataset for consistent evaluation
- Evaluation: Run evaluations for baseline and hearing aid algorithms using
run_evaluation.jl - Results Analysis: Generate summary tables and metrics organized by SNR and environment
- Benchmark Results: Automatically generate and update benchmark comparison tables in the README
The latest benchmark results comparing different hearing aid algorithms are automatically generated and displayed in the Benchmark Results section below. To update these results with the latest evaluation runs, simply run:
julia scripts/update_readme_benchmark.jlThis script automatically:
- Finds the latest runs for each hearing aid (excluding Baseline_clean)
- Generates comprehensive comparison tables for:
- Overall summary across all metrics
- Performance by SNR level (2.5, 7.5, 12.5, 17.5 dB)
- Performance by environment and SNR (bus, cafe, living, office, psquare)
- Updates the README with the latest results and configuration details
- Julia 1.11+: Required for all functionality
- Python 3.7+: Required for metrics evaluation (PESQ, DNSMOS)
- Git: For cloning and submodule management
- Clone the repository with submodules:
git clone --recursive <repository-url>
cd Spectral_Subtraction- Install Julia dependencies:
using Pkg
Pkg.activate(".")
Pkg.instantiate()- Install Python dependencies for metrics:
cd dependencies/HADatasets
python install_python_deps.py
cd ../..Download the VOICEBANK_DEMAND dataset from the official source:
- Visit the official dataset page: https://datashare.ed.ac.uk/handle/10283/2791
- Download the dataset files
- Extract and place them in the following structure:
databases/VOICEBANK_DEMAND/
├── data/
│ ├── clean_testset_wav/ # Clean audio files
│ └── noisy_testset_wav/ # Noisy audio files
├── logfiles/
│ └── log_testset.txt # SNR information
└── testset_txt/ # Text transcriptions
Resample the VOICEBANK_DEMAND dataset to 16 kHz using Julia:
using HADatasets
# Create dataset instance pointing to the database directory
dataset = HADatasets.VOICEBANKDEMANDDataset("databases/VOICEBANK_DEMAND")
# Resample with default settings (16kHz, 1.0s minimum duration)
HADatasets.resample_data(dataset)This creates:
databases/VOICEBANK_DEMAND_resampled/
├── clean_testset_wav/ # Resampled clean files
├── noisy_testset_wav/ # Resampled noisy files
└── logfiles/
└── log_testset_resampled.txt # Updated log file
Note: The resampled dataset preserves the same directory structure as the original, with all audio files resampled to 16 kHz.
Why WFB preprocessing is needed:
The hearing aid processing pipeline uses a Warped Filter Bank (WFB) that warps the frequency domain of the audio. Since PESQ is sensitive to changes in the data or missing samples, we need to ensure consistent preprocessing for fair evaluation.
The WFB preprocessing:
- Processes all audio through the BaselineHearingAid (which has unity gains, so the audio is unaltered except for the WFB warping)
- Creates a preprocessed dataset where all files have been through the same WFB pipeline
- Ensures that when we evaluate hearing aids, we compare against a consistent WFB-processed clean reference
Create the WFB dataset:
julia scripts/convert_to_wfb.jlOr test with a limited number of samples first:
julia scripts/convert_to_wfb.jl --num-samples=10This script:
- Loads the BaselineHearingAid configuration
- Processes all clean and noisy files from
VOICEBANK_DEMAND_resampledthrough the WFB - Creates
VOICEBANK_DEMAND_resampled_wfb/with the same directory structure:databases/VOICEBANK_DEMAND_resampled_wfb/ ├── clean_testset_wav/ # WFB-processed clean files ├── noisy_testset_wav/ # WFB-processed noisy files ├── logfiles/ # Copied logfiles
Note: If the WFB dataset already exists, the script will detect it and skip processing with the following messages:
[Info: WFB dataset already exists and appears to be processed
[Info: Skipping conversion - dataset already processed
All evaluations, including baselines and hearing aid algorithms, are run using the run_evaluation.jl script:
Before evaluating hearing aids, establish baseline scores for comparison:
Baseline Best (Clean vs Clean) - Upper bound performance:
julia scripts/run_evaluation.jl configurations/baseline_clean/baseline_clean.tomlBaseline Unprocessed (Clean vs Noisy) - Lower bound performance:
julia scripts/run_evaluation.jl configurations/baseline_noise/baseline_noise.tomlEvaluate each hearing aid algorithm on the WFB-processed dataset:
# Evaluate SEM Hearing Aid
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml
# Test with a single file first
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml --single-file p257_001.wav
# Limit number of samples for testing
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml --num-samples 50
# Custom checkpoint interval (save every N files)
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml --checkpoint-interval 20
# Save processed output audio files
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml --save-outputResults are organized in timestamped directories:
results/VOICEBANK_DEMAND/
├── BaselineHearingAid/
│ └── run_<timestamp>/
│ ├── BaselineHearingAid.toml
│ └── table/
│ ├── results.csv # Complete results for all files
│ ├── overall_summary.csv # Overall average scores
│ ├── summary_by_snr.csv # Average scores by SNR level
│ ├── summary_by_environment_snr.csv # Average scores by environment and SNR
│ └── checkpoint_*.csv # Optional checkpoint files (if --checkpoint-interval used)
│ └── run_<timestamp>/
│ └── ...
└── SEMHearingAid/
└── run_<timestamp>/
└── ...
Each evaluation computes the following metrics:
- PESQ (Perceptual Evaluation of Speech Quality): 1-5 scale, higher is better
- SIG (Signal Quality from DNSMOS): 1-5 scale, higher is better
- BAK (Background Quality from DNSMOS): 1-5 scale, higher is better
- OVRL (Overall Quality from DNSMOS): 1-5 scale, higher is better
The evaluation automatically generates:
overall_summary.csv: Overall average scores across all conditionssummary_by_snr.csv: Average scores for each SNR level (2.5, 7.5, 12.5, 17.5 dB)summary_by_environment_snr.csv: Average scores per environment per SNR levelresults.csv: Complete results for all individual files
- Automatic checkpoints: Saved every N files (default: 10, configurable) - checkpoint files are created when using
--checkpoint-intervaloption - Resume capability: If evaluation is interrupted, checkpoints can be merged manually
- Final results: All results are saved to
results.csvin the table directory
After running evaluations for multiple hearing aids, you can automatically generate and update benchmark comparison tables in the README:
julia scripts/update_readme_benchmark.jlThis script:
- Finds the latest runs for each hearing aid (excluding Baseline_clean)
- Generates comprehensive comparison tables showing:
- Overall summary across all metrics (PESQ, SIG, BAK, OVRL)
- Performance breakdown by SNR level (2.5, 7.5, 12.5, 17.5 dB)
- Performance breakdown by environment and SNR (bus, cafe, living, office, psquare)
- Updates the README with the latest results and configuration details
The benchmark results are displayed in the Benchmark Results section below.
This repository uses comprehensive speech quality assessment metrics to evaluate hearing aid algorithms. All metrics are computed using the HADatasets module, which provides standardized implementations of ITU-T and IEEE/ACM standards.
- Type: Intrusive (requires reference signal)
- Scale: 1-5 (higher is better)
- Standard: ITU-T P.862.2
- Use Case: Overall speech quality assessment
- Description: PESQ is a perceptual metric that predicts the subjective quality of speech as perceived by human listeners. It compares the processed/enhanced audio to the clean reference signal and provides a score that correlates with Mean Opinion Score (MOS) ratings.
Important Note: PESQ is sensitive to changes in the data or missing samples. This is why the evaluation pipeline uses WFB-processed clean audio as the reference, ensuring that both the processed output and reference have undergone the same WFB preprocessing for fair comparison.
- Type: Non-intrusive (no reference required)
- Scale: 1-5 (higher is better)
- Standard: Microsoft DNS Challenge P.835
- Use Case: Noise suppression quality assessment
- Description: DNSMOS is a deep learning-based metric that predicts subjective quality scores without requiring a clean reference signal. It follows the ITU-T P.835 subjective test framework to measure three key quality dimensions.
P.835 Dimensions:
-
OVRL (Overall Quality): Overall audio quality assessment
- Measures the overall perceived quality of the processed audio
- Combines both speech and background noise quality perceptions
-
SIG (Signal Quality): Speech quality assessment
- Focuses specifically on the quality of the speech signal
- Measures how natural and clear the speech sounds
-
BAK (Background Quality): Background noise quality assessment
- Evaluates the quality of the background/noise component
- Measures how well noise is suppressed while preserving speech
The combination of PESQ and DNSMOS provides a comprehensive evaluation:
- PESQ provides an intrusive reference-based assessment, giving a direct comparison to the clean signal
- DNSMOS provides a non-intrusive assessment that doesn't require a reference, making it useful for real-world scenarios where clean references may not be available
- The three DNSMOS dimensions (OVRL, SIG, BAK) provide detailed insights into different aspects of speech enhancement performance
This evaluation framework adopts the ITU-T P.835 subjective test framework to measure speech enhancement quality across multiple dimensions, enabling comprehensive assessment of hearing aid algorithms for monaural speech enhancement tasks.
Spectral_Subtraction/
├── databases/
│ ├── VOICEBANK_DEMAND/ # Original dataset (downloaded)
│ ├── VOICEBANK_DEMAND_resampled/ # Resampled dataset (16 kHz)
│ └── VOICEBANK_DEMAND_resampled_wfb/ # WFB-processed dataset
├── configurations/
│ ├── BaselineHearingAid/
│ ├── SEMHearingAid/
├── results/
│ └── VOICEBANK_DEMAND/ # Evaluation results
├── scripts/
│ ├── convert_to_wfb.jl # WFB conversion script
│ ├── run_evaluation.jl # Evaluation script
│ └── update_readme_benchmark.jl # Benchmark results update script
├── src/
│ └── Experiments.jl # Main evaluation module
└── dependencies/
├── HADatasets/ # Dataset and metrics module
└── VirtualHearingAid/ # Hearing aid processing module
The SEM follows the model introduced in the paper:
The Speech Enhancement Model (SEM) uses a probabilistic generative model for Bayesian inference of speech and noise characteristics, enabling adaptive spectral enhancement.
The WFB front-end provides perceptually-aligned frequency warping for consistent evaluation:
The input signal passes through a cascade of first-order all-pass filters, producing warped delay-line signals. A time-domain FIR structure with weights generates the output, while the warped signals are provided to the Spectral Enhancement Model for inference and synthesis.
- Input: WFB-processed noisy audio (
VOICEBANK_DEMAND_resampled_wfb/noisy_testset_wav/) - Processing: Pass through hearing aid algorithm
- Reference: WFB-processed clean audio (
VOICEBANK_DEMAND_resampled_wfb/clean_testset_wav/) - Metrics: Compare processed output to WFB-processed clean reference
- BaselineHearingAid: Unity gain processing (no noise reduction, WFB only)
- SEMHearingAid: Speech Enhancement Model (Bayesian inference)
To reproduce the results reported in the paper:
-
Prepare the
VOICEBANK_DEMAND_resampled_wfbdataset by following Steps 1 and 2 in this README. -
Run the hearing aid configurations:
julia scripts/run_evaluation.jl configurations/SEMHearingAid/SEMHearingAid.toml
-
Update the README tables:
julia scripts/update_readme_benchmark.jl
-
The results used in the paper correspond to the runs in:
results/VOICEBANK_DEMAND/<Device>/run_<timestamp>/
This reproduces the tables in the paper's results section.
To add a new hearing aid algorithm:
-
Implement the backend in
dependencies/VirtualHearingAid(create a new<Name>Backendtype). -
Create a configuration file in
configurations/<NewHearingAid>/<NewHearingAid>.toml:[parameters.hearingaid]withtype = "<NewHearingAid>"[parameters.frontend]WFB parameters (nbands, fs, etc.)[parameters.backend.*]for algorithm-specific parameters
-
Run evaluation:
julia scripts/run_evaluation.jl configurations/<NewHearingAid>/<NewHearingAid>.toml
-
Update the benchmark tables:
julia scripts/update_readme_benchmark.jl
See existing configurations in configurations/ for examples of the TOML structure.
- Tested on: macOS / Linux, Julia 1.11+, Python 3.7+
- GPU: Not required. All models are CPU-friendly
- Storage: ~2 GB for the resampled dataset, ~4 GB for the WFB-processed dataset
- Missing files: Ensure the dataset is downloaded and extracted correctly
- Resampling errors: Check that audio files are valid WAV files
- WFB conversion fails: Verify BaselineHearingAid configuration exists
- Memory errors: Use
--num-samplesto process in smaller batches - Checkpoint errors: Manually merge existing checkpoints if needed
- Metrics errors: Ensure Python dependencies are installed (see HADatasets README)
The metrics evaluation functionality relies on Python integration and the following optional dependencies:
- PyCall: Python integration (for full metrics functionality)
- pesq: Python PESQ implementation (MIT License)
- dnsmos_wrapper: Custom wrapper for Microsoft DNSMOS (Creative Commons Attribution 4.0 International)
These dependencies are automatically installed when running the Python installation script:
cd dependencies/HADatasets
python install_python_deps.pyLicensed under Creative Commons Attribution 4.0 International:
- Attribution Required: Must give appropriate credit to Microsoft
- Commercial Use: Allowed
- Modification: Allowed
- Distribution: Allowed
@inproceedings{reddy2022dnsmos,
title={DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors},
author={Reddy, Chandan KA and Gopal, Vishak and Cutler, Ross},
booktitle={ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2022},
organization={IEEE}
}@inproceedings{dubey2023icassp,
title={ICASSP 2023 Deep Noise Suppression Challenge},
author={Dubey, Harishchandra and Aazami, Ashkan and Gopal, Vishak and Naderi, Babak and Braun, Sebastian and Cutler, Ross and Gamper, Hannes and Golestaneh, Mehrsa and Aichner, Robert},
booktitle={ICASSP},
year={2023}
}@misc{Valentini-Botinhao2017NoisySpeech,
author = {Valentini-Botinhao, Cassia},
title = {Noisy speech database for training speech enhancement algorithms and TTS models},
year = {2017},
howpublished = {Edinburgh DataShare},
doi = {10.7488/ds/2117},
url = {https://doi.org/10.7488/ds/2117}
}- ICASSP 2023 Deep Noise Suppression Challenge: Official challenge website and resources
- DNSMOS Implementation: Microsoft's DNS Challenge repository with DNSMOS implementation
- VoiceBank+Demand Dataset: Official dataset download page

