[Lisa++] Why I cannot get results of image captioning with instance segmentation? I even cannot replicate the camera lens example.

I manually downloaded lisa_plus_7b and clip-vit-large-patch14 from hugging face and put them in my folder. Then I run python chat_instance.py --precision='bf16' with a single GPU(32g memory) of Tesla V100.

**I tried to ask in this way then I get the following results, and no camera lens were segmented, I only get the original image:** 
Please input your prompt: Please describe the two cameras. + DEFAULT_INST_CONV_SEG
Please input the image path: ./imgs/camera_lens.jpg
text_output:   A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: Please describe the two cameras. + DEFAULT_INST_CONV_SEG ASSISTANT: Sure, the image depicts two different sized cameras placed on a surface. One of the cameras is larger and has a front element attached to it, which appears to be a lens, and the other camera is smaller in size. The larger camera has a front element that is a lens as well, and it is placed on top of the smaller camera. The smaller camera has a front facing lens and is positioned directly below the larger lens in the image.
len(pred_masks):  1
[x.shape for x in pred_masks]:  [torch.Size([0, 1620, 2160])]
./lisa_output/camera_lens_masked_img.jpg has been saved.


**With another example, there are some fruits in the image:**
Please input your prompt: What kind of fruits in this image? Please describe and segment them to me one by one?    
Please input the image path: ./imgs/fruits.jpg
text_output:  A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: What kind of fruits in this image? Please describe and segment them to me one by one? ASSISTANT: Sure, the segmentation results are:- Apples- Pears- Grapes- Plums (or dark colored fruit, it's difficult to determine the exact type from the image)- Kiwis-[ [SEG] ]
len(pred_masks):  1
[x.shape for x in pred_masks]:  [torch.Size([1, 1200, 900])]
./lisa_output/fruits_mask_0.jpg has been saved.
./lisa_output/fruits_masked_img.jpg has been saved.

I still cannot get any segment results. 

**However, if I ask in the following way, I can get segment results but with very low quality:**
Please input your prompt: Please segment the fruit in the image one by one. + DEFAULT_INSTANT_SEG                                                                
Please input the image path: ./imgs/fruits.jpg
text_output:  A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: Please segment the fruit in the image one by one. + DEFAULT_INSTANT_SEG ASSISTANT: Sure, the segmentation results are [SEG] , [SEG] , and [SEG] .
len(pred_masks):  1
[x.shape for x in pred_masks]:  [torch.Size([3, 1200, 900])]
./lisa_output/fruits_mask_0.jpg has been saved.
./lisa_output/fruits_mask_1.jpg has been saved.
./lisa_output/fruits_mask_2.jpg has been saved.
./lisa_output/fruits_masked_img.jpg has been saved.

**Is there something I ignore potentionally? Would be very appreciate to get any reply. Thanks！I only installed the necessary packages for inference, without the flash-attn module, is it necessary for inference? **



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Lisa++] Why I cannot get results of image captioning with instance segmentation? I even cannot replicate the camera lens example. #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Lisa++] Why I cannot get results of image captioning with instance segmentation? I even cannot replicate the camera lens example. #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions