Skip to content

Why does "film_clap_cond1" CLAP use "embed_mode": "text" instead of "audio"? #94

@xzmates

Description

@xzmates

Dear Author, Thank you for your significant contribution to audio generation. However, I have the following questions and would appreciate your answers.
your source code utils.py "get_basic_config()" says
"film_clap_cond1": {
"cond_stage_key": "text",
"conditioning_key": "film",
"target": "audioldm2.latent_diffusion.modules.encoders.modules.CLAPAudioEmbeddingClassifierFreev2",
"params": {
"sampling_rate": 48000,
"embed_mode": "text",
"amodel": "HTSAT-base",
},
},
HTSAT-base is AudioEncoder, why "embed_mode": "text" instead of "audio"? Can I replace it with CLAP's audio encoder? like follows:
"film_clap_cond1": {
"cond_stage_key": "audio",
"conditioning_key": "film",
"target": "audioldm2.latent_diffusion.modules.encoders.modules.CLAPAudioEmbeddingClassifierFreev2",
"params": {
"sampling_rate": 48000,
"embed_mode": "audio",
"amodel": "HTSAT-base",
},
},

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions