Motivation
I propose to extend flexeval to support Multimodal Language Models (MLMs), which take multimodal input data as inputs and generate text as outputs. A typical example is Vision Language Models (VLMs), accepting text+image as inputs.
Since the primary difference between MLMs and standard LLMs lies in the input format, the necessary modifications are exlusively focused on input data handling.
Expected changes
Assuming MLMs are run separately and accessed via OpenAI API schema, the main changes are required in the template processing logic:
|
messages = [{"role": "user", "content": input_utterance}] |
expecting input_utterance is str.
MLMs require a structured list of dicts, like:
{
"role": "user",
"content": [
{"type": "text", "text": "..."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "image_url", "image_url": {"url": "/path/to/image.jpg",}},
],
},
Thus, proposed implementation details are:
- Allow
input_utterance to be parsed into list[dict[str, Any]] using ast.literal_eval or json.loads
- (Optional, but recommended) Add support for preprocessing functions, e.g., image resizing, as arguments will be useful.
Motivation
I propose to extend flexeval to support Multimodal Language Models (MLMs), which take multimodal input data as inputs and generate text as outputs. A typical example is Vision Language Models (VLMs), accepting text+image as inputs.
Since the primary difference between MLMs and standard LLMs lies in the input format, the necessary modifications are exlusively focused on input data handling.
Expected changes
Assuming MLMs are run separately and accessed via OpenAI API schema, the main changes are required in the template processing logic:
flexeval/flexeval/core/chat_dataset/template_based.py
Line 109 in 811ebc9
expecting
input_utteranceisstr.MLMs require a structured list of dicts, like:
{ "role": "user", "content": [ {"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}, {"type": "image_url", "image_url": {"url": "/path/to/image.jpg",}}, ], },Thus, proposed implementation details are:
input_utteranceto be parsed intolist[dict[str, Any]]usingast.literal_evalorjson.loads