Skip to content

elpaco-escience/corpusparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data frame optimization

The heaviest task we want to perform on the input data frame consists on appending audio snippets. This involves opening a .wav file for each row.

Some of these rows points to the same .wav file, so we'll make sure the file is opened only once.

Input dataframe

from pandas import read_csv
df = read_csv("tests/test_public.csv")
df
start_time end_time participant utterance key language uid
0 2.0 2.9 pablo unit test /public-dutch/dutch-01 dutch dutch-public-000-000-000001
1 1.2 2.5 pablo prueba de audio /public-spanish/spanish-01 spanish spanish-public-000-000-000001
2 1.9 2.9 pablo los tests /public-spanish/spanish-02 spanish spanish-public-000-000-000002
3 1.9 2.9 none nothing /missing_file klingon klingon-000-000-000001
4 10.2 12.5 pablo out of bounds /public-spanish/spanish-wrong spanish spanish-public-000-000-000003

Please note the times are in seconds.

Auxiliary functions:

This adapter will help us converting our syntax (using keys) into librosa's syntax (using filenames).

key = "/public-dutch/dutch-01"
from corpusparser.auxs import filename_from_key
filename_from_key(key)
'data/public-dutch/dutch-01.wav'

Extract audio features

from corpusparser.parsers import *

Example of usage

Extract all audio

audio_from_key(key)
array([0.        , 0.        , 0.        , ..., 0.00112915, 0.00177002,
       0.00216675], dtype=float32)

Extract sample rate

samplerate_from_key(key)
24000

Extract an audio snippet

df[df["key"] == key].reset_index()
index start_time end_time participant utterance key language uid
0 0 2.0 2.9 pablo unit test /public-dutch/dutch-01 dutch dutch-public-000-000-000001
snippet = subset_audio_from_key(df, key, row=0)
snippet
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:51: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  start_i = floor(start_time * rate)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:52: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  end_i = ceil(end_time * rate)





array([-1.5258789e-04, -6.1035156e-05,  1.2207031e-04, ...,
        1.5258789e-04,  2.1362305e-04,  1.2207031e-04], dtype=float32)

Append all audio snippets to dataframe

df = extend_dataframe(df)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:22: UserWarning: PySoundFile failed. Trying audioread instead.
  audio, rate = librosa.core.load(filename_from_key(key), sr=sr, **kwargs) # sr=None uses the native sampling rate
/home/pablo/miniconda3/envs/ffmpeg-test/lib/python3.12/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:25: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:32: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:34: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:32: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:34: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")
df
start_time end_time participant utterance key language uid audio rate
0 2.0 2.9 pablo unit test /public-dutch/dutch-01 dutch dutch-public-000-000-000001 [-0.00015258789, -6.1035156e-05, 0.00012207031... 24000
1 1.2 2.5 pablo prueba de audio /public-spanish/spanish-01 spanish spanish-public-000-000-000001 [-0.0009460449, -0.00076293945, -0.00076293945... 16000
2 1.9 2.9 pablo los tests /public-spanish/spanish-02 spanish spanish-public-000-000-000002 [0.0066223145, 0.007019043, 0.0073547363, 0.00... 16000
3 1.9 2.9 none nothing /missing_file klingon klingon-000-000-000001 [] 0
4 10.2 12.5 pablo out of bounds /public-spanish/spanish-wrong spanish spanish-public-000-000-000003 [] 16000

(Optional) Listen to the snippets

From key

from corpusparser.listeners import *
listen_audio_from_key(df, key = key, row = 0)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:51: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  start_i = floor(start_time * rate)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:52: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  end_i = ceil(end_time * rate)
Your browser does not support the audio element.

From data frame index

listen_snippet_from_df(df, row = 0)
Your browser does not support the audio element.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •