Data frame optimization

The heaviest task we want to perform on the input data frame consists on appending audio snippets. This involves opening a .wav file for each row.

Some of these rows points to the same .wav file, so we'll make sure the file is opened only once.

Input dataframe

from pandas import read_csv

df = read_csv("tests/test_public.csv")
df

	start_time	end_time	participant	utterance	key	language	uid
0	2.0	2.9	pablo	unit test	/public-dutch/dutch-01	dutch	dutch-public-000-000-000001
1	1.2	2.5	pablo	prueba de audio	/public-spanish/spanish-01	spanish	spanish-public-000-000-000001
2	1.9	2.9	pablo	los tests	/public-spanish/spanish-02	spanish	spanish-public-000-000-000002
3	1.9	2.9	none	nothing	/missing_file	klingon	klingon-000-000-000001
4	10.2	12.5	pablo	out of bounds	/public-spanish/spanish-wrong	spanish	spanish-public-000-000-000003

Please note the times are in seconds.

Auxiliary functions:

This adapter will help us converting our syntax (using keys) into librosa's syntax (using filenames).

key = "/public-dutch/dutch-01"

from corpusparser.auxs import filename_from_key
filename_from_key(key)

'data/public-dutch/dutch-01.wav'

Extract audio features

from corpusparser.parsers import *

Example of usage

Extract all audio

audio_from_key(key)

array([0.        , 0.        , 0.        , ..., 0.00112915, 0.00177002,
       0.00216675], dtype=float32)

Extract sample rate

samplerate_from_key(key)

Extract an audio snippet

df[df["key"] == key].reset_index()

	index	start_time	end_time	participant	utterance	key	language	uid
0	0	2.0	2.9	pablo	unit test	/public-dutch/dutch-01	dutch	dutch-public-000-000-000001

snippet = subset_audio_from_key(df, key, row=0)
snippet

/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:51: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  start_i = floor(start_time * rate)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:52: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  end_i = ceil(end_time * rate)





array([-1.5258789e-04, -6.1035156e-05,  1.2207031e-04, ...,
        1.5258789e-04,  2.1362305e-04,  1.2207031e-04], dtype=float32)

Append all audio snippets to dataframe

df = extend_dataframe(df)

/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:22: UserWarning: PySoundFile failed. Trying audioread instead.
  audio, rate = librosa.core.load(filename_from_key(key), sr=sr, **kwargs) # sr=None uses the native sampling rate
/home/pablo/miniconda3/envs/ffmpeg-test/lib/python3.12/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:25: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:32: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:34: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:32: FutureWarning: PySoundFile failed. Trying audioread instead.
	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  sr = librosa.get_samplerate(filename_from_key(key), **kwargs)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:34: UserWarning: Something went wrong with key: /missing_file
  warnings.warn(f"Something went wrong with key: {key}")

df

	start_time	end_time	participant	utterance	key	language	uid	audio	rate
0	2.0	2.9	pablo	unit test	/public-dutch/dutch-01	dutch	dutch-public-000-000-000001	[-0.00015258789, -6.1035156e-05, 0.00012207031...	24000
1	1.2	2.5	pablo	prueba de audio	/public-spanish/spanish-01	spanish	spanish-public-000-000-000001	[-0.0009460449, -0.00076293945, -0.00076293945...	16000
2	1.9	2.9	pablo	los tests	/public-spanish/spanish-02	spanish	spanish-public-000-000-000002	[0.0066223145, 0.007019043, 0.0073547363, 0.00...	16000
3	1.9	2.9	none	nothing	/missing_file	klingon	klingon-000-000-000001	[]	0
4	10.2	12.5	pablo	out of bounds	/public-spanish/spanish-wrong	spanish	spanish-public-000-000-000003	[]	16000

(Optional) Listen to the snippets

From key

from corpusparser.listeners import *
listen_audio_from_key(df, key = key, row = 0)

/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:51: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  start_i = floor(start_time * rate)
/home/pablo/code/ffmpeg-test/corpusparser/parsers.py:52: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  end_i = ceil(end_time * rate)

Your browser does not support the audio element.

From data frame index

listen_snippet_from_df(df, row = 0)

Your browser does not support the audio element.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
corpusparser		corpusparser
data		data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
examples.ipynb		examples.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data frame optimization

Input dataframe

Auxiliary functions:

Extract audio features

Example of usage

Extract all audio

Extract sample rate

Extract an audio snippet

Append all audio snippets to dataframe

(Optional) Listen to the snippets

From key

From data frame index

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

elpaco-escience/corpusparser

Folders and files

Latest commit

History

Repository files navigation

Data frame optimization

Input dataframe

Auxiliary functions:

Extract audio features

Example of usage

Extract all audio

Extract sample rate

Extract an audio snippet

Append all audio snippets to dataframe

(Optional) Listen to the snippets

From key

From data frame index

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages