PyTorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors.
We recommend to create an anaconda environment
conda create -n DiaPer python=3.7
conda activate DiaPerClone the repository
git clone https://github.com/BUTSpeechFIT/DiaPer.gitInstall the packages
conda install pip
pip install git+https://github.com/fnlandini/transformers
conda install numpy
conda install -c conda-forge tensorboard
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install safe_gpu
pip install yamlargparse==1.31.1
pip install scikit-learn==1.0.2
pip install decorator==5.1.1
pip install librosa==0.9.1
pip install setuptools==59.5.0
pip install h5py==3.8.0
pip install matplotlib==3.5.3Other versions might work but these were the settings used for this work.
Run the example
./run_example.shIf it works, you should be set.
To run the training you can call:
python diaper/train.py -c examples/train.yamlNote that in the example you need to define the train and validation data directories as well as the output directory. The rest of the parameters are standard ones, as used in our publication. For adaptation or fine-tuning, the process is similar:
python diaper/train.py -c examples/finetune_adaptedmorespeakers.yamlIn that case, you will need to provide the path where to find the trained model that you want to adapt/fine-tune.
To run the inference, you can call:
python diaper/infer.py -c examples/infer.yamlNote that in the example you need to define the data, model and output directories.
Or, if you want to only evaluate one file:
python diaper/infer_single_file.py -c examples/infer.yaml --wav-dir <directory with wav file> --wav-name <filename without extension>
Note that in the example you need to define the model and output directories.
You can also run inference using the models we share. Either with the usual approach or a single file like:
python diaper/infer_single_file.py -c examples/infer_16k_10attractors.yaml --wav-dir examples --wav-name IS1009afor the model trained on simulated conversations (no fine-tuning) or with fine-tuning as:
python diaper/infer_single_file.py -c examples/infer_16k_10attractors_AMIheadsetFT.yaml --wav-dir examples --wav-name IS1009aYou should obtain results as in examples/IS1009a_infer_16k_10attractors.rttm and examples/IS1009a_infer_16k_10attractors_AMIheadsetFT.rttm respectively.
All models trained on publicly available and free data are shared inside the folder models. Both families of models with 10 and 20 attractors are available. If you want to use any of them, modify the infer files above to suit your needs. You will need to change models_path and epochs (and rttms_dir, where the output will be generated) to use the model you want.
| 10 attractors | 10 attractors | 20 attractors | 20 attractors | VAD+VBx+OSD | |
|---|---|---|---|---|---|
| DER and RTTMs | without FT | with FT | without FT | with FT | --- |
| AISHELL-4 | 48.21% 📁 | 41.43% 📁 | 47.86% 📁 | 31.30% 📁 | 15.84% 📁 |
| AliMeeting (far) | 38.67% 📁 | 32.60% 📁 | 34.35% 📁 | 26.27% 📁 | 28.84% 📁 |
| AliMeeting (near) | 28.19% 📁 | 27.82% 📁 | 23.90% 📁 | 24.44% 📁 | 22.59% 📁 |
| AMI (array) | 57.07% 📁 | 49.75% 📁 | 52.29% 📁 | 50.97% 📁 | 34.61% 📁 |
| AMI (headset) | 36.36% 📁 | 32.94% 📁 | 35.08% 📁 | 30.49% 📁 | 22.42% 📁 |
| Callhome | 14.86% 📁 | 13.60% 📁 | -- | -- | 13.62% 📁 |
| CHiME6 | 78.25% 📁 | 70.77% 📁 | 77.51% 📁 | 69.94% 📁 | 70.42% 📁 |
| DIHARD 2 | 43.75% 📁 | 32.97% 📁 | 44.51% 📁 | 31.23% 📁 | 26.67% 📁 |
| DIHARD 3 full | 34.21% 📁 | 24.12% 📁 | 34.82% 📁 | 22.77% 📁 | 20.28% 📁 |
| DipCo | 48.26% 📁 | -- | 43.37% 📁 | -- | 49.22% 📁 |
| Mixer6 | 21.03% 📁 | 13.41% 📁 | 18.51% 📁 | 10.99% 📁 | 35.60% 📁 |
| MSDWild | 35.69% 📁 | 15.46% 📁 | 25.07% 📁 | 14.59% 📁 | 16.86% 📁 |
| RAMC | 38.05% 📁 | 21.11% 📁 | 32.08% 📁 | 18.69% 📁 | 18.19% 📁 |
| VoxConverse | 23.20% 📁 | -- | 22.10% 📁 | -- | 6.12% 📁 |
In case of using the software, referencing results or finding the repository useful in any way please cite:
@article{landini2023diaper,
title={DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors},
author={Landini, Federico and Diez, Mireia and Stafylakis, Themos and Burget, Luk{\'a}{\v{s}}},
journal={arXiv preprint arXiv:2312.04324},
year={2023}
}
If you did not use it for a publication but still found it useful, also let me know by email, I would love to know too :)
If you have comments or questions, please contact me at landini@fit.vutbr.cz