This code is useful for estimating the level of phonetic reduction in speech data. It was developed by Javier Vazquez and Nigel Ward of the Interactive Systems Group (ISG) of the University of Texas at El Paso (UTEP), in 2024-2025. Publications describing the data, methods, and intended uses are available via https://www.cs.utep.edu/nigel/reduction/
To use this code, fork the repository and then follow the example usage below.
from reduction_model import Reduction . . reduction = Reduction()
hubert_features = reduction.extract(['audio1.wav','audio2.wav','audio3.wav'])
2a. Load the pretrained English model (although this is vulnerable to pickle updates):
reduction.loadModel()
2b. Train your own model using our data. First download npy files created at ISG, which are available at www.cs.utep.edu/nigel/reduction/ , into default_data. Then:
reduction.default_fit()
2c. Train your own model using your own data.
reduction.fit(X=hubert_features,y=['labels.txt'])
In general, once you have the HuBERT features available, you can train a model on any subset of these features for which there are corresponding reduction labels. The labels format is a tab-delimited file in the order of Channel, Start Time, End Time, and Reduction Value. The Channel specifies Left or Right for Stereo audio and None for Mono audio. The Start and End Time specify the region's timespan in seconds. The Reduction value specifies the annotator's value for the specified region. There are examples in the default_data folder.
You can find data for training from scratch in http://www.cs.utep.edu/nigel/reduction/annotations.zip (792KB). By default, you'll want to extract all the audio files and label files to default_data.
3a. Now that you have a trained model, you can use it to predict the reduction value for each frame. Frames occur every 20ms.
frame_predictions = reduction.predict(hubert_features[0])
For this, you will need features extracted as described in Step 1. The output will be the predicted reduction value vectors for each track of the audio.
3b. Alternatively you can obtain per-region predictions using predict_utterance. In most cases, these regions will likely be words or phrases. For this, a tab-delimited text file needs to be provided specifying the regions of interest. This file will contain lines specifying the Channel, Start Time, and End Time, as described under 2b. If there is a fourth field, for the label, it will be ignored here.
utterance_predictions = reduction.predict_utterances(hubert_features,='utterance_timeframes.txt')
Install a recent version of python from www.python.org, and then
py -m pip install torch torchaudio numpy scikit-learn_
py -m pip install sox soundfile matplotlib_ # optional
Note that you may need to use pip3 to get the modules in the right place, as in pip3 install numpy
Note that py is probably better than python, since newer, depending on your configuration.
After you fork the github repo, you will get a directory with a subdirectory called REDUCTION. Within that, you'll find the python files and a directory called default_data. The description above assumes that you've inovked py from inside the REDUCTION subdirectory. You'll also see documentation, a pickled model, and a tiny set of test data.
The downstream model (decision head) here is simple, as most of the work is done by the HuBert features. Thus, to predict the level of reduction in a speech file, it first needs to be converted into HuBERT features. Thus a major funcion of this code is to provide easy access to the HuBERT model available in PyTorch. In particular, it uses the HuBERT Base model which extracts 12 layers of 768 features for every 20 ms of audio. The last layer is the one used here as it had the best performance in pilot tests. Note that, depending on whether the audio is mono or stereo, the HuBERT model will produce features for one or two channels.
Note that reduction.extract() will return a list of torch tensors, each of which has three dimensions: number-of-tracks (usually 2), number of 20-ms frames, number of features per frame (which is 768).
Note that feature extraction takes some time, for example, perhaps 5 minutes to process a 10 minute file.
The API is designed to work with lists of files, not individual files, let alone individual utterances or words. This is because the predictions are not trustworthy at the utterance or word level, so it only makes sense to use this as part of a workflow collecting statistics over substantial data, typically tens of minutes of dialog across multiple audio files.
Note that in the Github repo we include neither the audio files for training nor the derived npy files; this is only because they are large. Accordingly the API is not very elegant. However the code is flexible enough to modify, assuming a basic knowledge of python.
testFeats = reduction.extract(['tinytest/redu-enun-test.wav'])
testPreds = reduction.predict(testFeats[0])
Now you can visualize the predictions with, for example
import matplotlib.pyplot as plt
plt.plot(testPreds)
plt.show()
then you can line it up the human labels, visualizable with Elan, to gauge quality. You can also compare the testPreds to the predictions we obtained, which are in tinytest/redu-enum-test-predictions.npy
As a point of interest, the audio files were originally taken from the DRAL corpus, which is downloadable from https://www.cs.utep.edu/nigel/dral/, or from the Linguistic Data Consortium under Catalog number LDC2024S08