Skip to content

[Lisa++] Why I cannot get results of image captioning with instance segmentation? I even cannot replicate the camera lens example. #203

@zhitongcui

Description

@zhitongcui

I manually downloaded lisa_plus_7b and clip-vit-large-patch14 from hugging face and put them in my folder. Then I run python chat_instance.py --precision='bf16' with a single GPU(32g memory) of Tesla V100.

I tried to ask in this way then I get the following results, and no camera lens were segmented, I only get the original image:
Please input your prompt: Please describe the two cameras. + DEFAULT_INST_CONV_SEG
Please input the image path: ./imgs/camera_lens.jpg
text_output: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: Please describe the two cameras. + DEFAULT_INST_CONV_SEG ASSISTANT: Sure, the image depicts two different sized cameras placed on a surface. One of the cameras is larger and has a front element attached to it, which appears to be a lens, and the other camera is smaller in size. The larger camera has a front element that is a lens as well, and it is placed on top of the smaller camera. The smaller camera has a front facing lens and is positioned directly below the larger lens in the image.
len(pred_masks): 1
[x.shape for x in pred_masks]: [torch.Size([0, 1620, 2160])]
./lisa_output/camera_lens_masked_img.jpg has been saved.

With another example, there are some fruits in the image:
Please input your prompt: What kind of fruits in this image? Please describe and segment them to me one by one?
Please input the image path: ./imgs/fruits.jpg
text_output: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: What kind of fruits in this image? Please describe and segment them to me one by one? ASSISTANT: Sure, the segmentation results are:- Apples- Pears- Grapes- Plums (or dark colored fruit, it's difficult to determine the exact type from the image)- Kiwis-[ [SEG] ]
len(pred_masks): 1
[x.shape for x in pred_masks]: [torch.Size([1, 1200, 900])]
./lisa_output/fruits_mask_0.jpg has been saved.
./lisa_output/fruits_masked_img.jpg has been saved.

I still cannot get any segment results.

However, if I ask in the following way, I can get segment results but with very low quality:
Please input your prompt: Please segment the fruit in the image one by one. + DEFAULT_INSTANT_SEG
Please input the image path: ./imgs/fruits.jpg
text_output: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: Please segment the fruit in the image one by one. + DEFAULT_INSTANT_SEG ASSISTANT: Sure, the segmentation results are [SEG] , [SEG] , and [SEG] .
len(pred_masks): 1
[x.shape for x in pred_masks]: [torch.Size([3, 1200, 900])]
./lisa_output/fruits_mask_0.jpg has been saved.
./lisa_output/fruits_mask_1.jpg has been saved.
./lisa_output/fruits_mask_2.jpg has been saved.
./lisa_output/fruits_masked_img.jpg has been saved.

**Is there something I ignore potentionally? Would be very appreciate to get any reply. Thanks!I only installed the necessary packages for inference, without the flash-attn module, is it necessary for inference? **

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions