-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Related to #147
Auto models have the following cons when it comes to real-life multi-language usage:
- only basic models available
- models too general - often ( = "in my case" :v ), you only need like just 2 or 3 specific languages:
- you can't get Intermediate Results, even if the specific languages you use do have support for that,
- at least theoretically - a model specialized in a single language should have better precision than one that's generalized over a set of languages you never really use
- I think it is reasonable enough to make an assumption that a single listening session will in most cases not mix languages, and if it does, it should be easy to restart listening when you switch
- this could be less intuitive for some, but I think (at least in my case) - this would better fit my workflow overall
- I think it is reasonable enough to make an assumption that a single listening session will in most cases not mix languages, and if it does, it should be easy to restart listening when you switch
- auto models might be slower, at least that's what's been suggested in comments in STT language auto-detection #147 (comment)
Feature draft:
- say, you have a single "favorite" model selected per each language that's enabled in the app
- you have an extra "language recognition" enumeration model
- in an "auto" language mode, every time a STT listening session starts, when you start speaking, the language recognition model runs first while buffering input for the STT model
- after the language is recognized with enough certainty, the choice of the model is made based on language recognized and the "favorite" model that's chosen for that language
- stuff buffered so far is passed to the STT model
- here we can probably assume that just the last 3-5 seconds maximum will be needed if we can assume the lang recognition model is fast and accurate enough
- (maybe even the lang recognition model could point out the exact moment the speech actually starts?)
- the rest of the current listening session proceeds as normal
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request