-
Notifications
You must be signed in to change notification settings - Fork 0
VLM-enriched object detection based navigation and fixed semantic navigation #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds VLM-enriched object detection for precise navigation and fixes performance issues in semantic navigation.
Key Changes
New Navigation Skills:
navigate_to_detected_object(): Navigate to objects detected by vision system using ObjectDB lookuplist_detected_objects(): Query and list all detected objects for situational awareness
VLM Enrichment:
- ObjectDB now uses VLM to enrich YOLO labels (e.g., "person" → "person in white shirt")
- VLM enrichment occurs on object creation and re-enriches every 10 detections
- Fallback to YOLO labels when VLM fails or is disabled
Performance Fixes:
- RPC methods now return lightweight dicts instead of full Object3D objects (fixes 120s timeouts)
- SpatialMemory no longer fetches images from visual memory during queries
- Commented out verbose logging
Frame ID Changes:
- Changed TF frame from "map" to "world" across navigation and spatial perception modules
New Blueprint:
- Added
agentic_detectionblueprint combining detection, spatial memory, and agent capabilities
Critical Issue
The Object3D initialization never sets yolo_label from detection.name, causing the entire VLM enrichment and object tracking system to fail with None values.
Confidence Score: 1/5
- Critical bug will cause object detection and navigation to fail completely
- Object3D.init never initializes yolo_label from detection.name, leaving it as None. This breaks: (1) object tracking matching logic that compares yolo_label, (2) VLM enrichment that uses yolo_label in prompts and fallbacks, (3) object name assignment that falls back to yolo_label. The entire object detection navigation feature will fail.
- dimos/perception/detection/moduleDB.py - Missing yolo_label initialization in Object3D.init
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| dimos/perception/detection/moduleDB.py | 1/5 | Critical bug: Object3D.init never initializes yolo_label from detection.name, causing None values throughout VLM enrichment and object tracking logic |
| dimos/agents/skills/navigation.py | 4/5 | Added navigate_to_detected_object and list_detected_objects skills. Changed frame_id from "map" to "world" throughout |
| dimos/perception/spatial_perception.py | 4/5 | Changed SpatialMemory base class to SkillModule, updated frame_id from "map" to "world", added 1-second sleep in start(), improved error handling |
| dimos/agents_deprecated/memory/spatial_vector_db.py | 5/5 | Commented out image fetching and logging to fix RPC timeouts (performance optimization) |
| dimos/robot/all_blueprints.py | 5/5 | Added new blueprint "unitree-go2-agentic-detection" to registry |
| dimos/robot/unitree_webrtc/unitree_go2_blueprints.py | 5/5 | Added agentic_detection blueprint combining detection, spatial_memory, utilization, llm_agent, and common agentic components |
Sequence Diagram
sequenceDiagram
participant User
participant Agent
participant NavigationSkill
participant ObjectDB
participant VLM
participant Navigator
User->>Agent: "Navigate to person in white shirt"
Agent->>NavigationSkill: navigate_to_detected_object("person in white shirt")
NavigationSkill->>ObjectDB: lookup("person in white shirt")
Note over ObjectDB: Search objects by name<br/>(matches VLM-enriched labels)
alt Object found
ObjectDB-->>NavigationSkill: [{"name": "person in white shirt", "pos_x": 1.2, "pos_y": 3.4, ...}]
NavigationSkill->>Navigator: set_goal(PoseStamped)
Navigator-->>NavigationSkill: Navigation complete
NavigationSkill-->>Agent: "Successfully navigated to 'person in white shirt'"
else Object not found
ObjectDB-->>NavigationSkill: []
NavigationSkill-->>Agent: "No detected object matching..."
end
Note over User,Navigator: Parallel: Detection & Enrichment Flow
participant Camera
participant Detection3D
Camera->>Detection3D: Image + PointCloud
Detection3D->>ObjectDB: Detection3DPC (name="person")
alt New object
ObjectDB->>ObjectDB: create_new_object()
Note over ObjectDB: BUG: yolo_label not set!<br/>Should be: yolo_label = detection.name
ObjectDB->>VLM: _enrich_with_vlm(obj)
Note over VLM: Uses obj.yolo_label (None!)<br/>in prompt and fallbacks
VLM-->>ObjectDB: rich_label
ObjectDB->>ObjectDB: Store as Object3D<br/>(name=rich_label, yolo_label=None)
else Existing object
ObjectDB->>ObjectDB: add_to_object()
alt Every 10 detections
ObjectDB->>VLM: Re-enrich
VLM-->>ObjectDB: Updated rich_label
end
end
dimos/agents/skills/navigation.py
Outdated
| except Exception: | ||
| return f"Error querying ObjectDB for '{object_name}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P3] Bare except Exception: swallows the actual error. Consider logging the exception for debugging: except Exception as e: logger.error(f"Error querying ObjectDB: {e}") before returning the error message.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR introduces VLM-enriched object detection for navigation, enabling the robot to navigate to specific objects with rich descriptions (e.g., "person in white shirt" instead of just "person"). Key improvements include:
New Features:
- VLM enrichment pipeline that enhances YOLO labels with detailed descriptions using Qwen VLM
- Two new navigation skills:
navigate_to_detected_object()for object-based navigation andlist_detected_objects()for environment awareness - New
agentic_detectionblueprint combining detection, navigation, spatial memory, and agent capabilities - Lightweight RPC methods (
lookup,get_all_detected_objects) that return dicts instead of full Object3D instances
Performance Fixes:
- RPC timeout fix: Returns lightweight dicts reduce response time from 120s to 0.2s
- SpatialMemory improvements: TF timing fixes and image fetching removal reduce latency from 120s to 0.15s
- Reduced logging verbosity in spatial memory components
Architecture Changes:
- Frame reference migration: Changes from "map" frame to "world" frame across navigation and spatial perception
- SpatialMemory now extends SkillModule instead of Module
- Object tracking now separates YOLO labels (for matching) from VLM labels (for display)
The changes enable more precise object-based navigation by allowing agents to distinguish between similar objects based on visual characteristics.
Confidence Score: 1/5
- Critical runtime error will occur when agent_encode() is called
- The PR contains a P0 blocking bug in moduleDB.py line 311 where
len(obj.detections)is called on an integer field. This will cause TypeError at runtime whenever the agent_encode() method is invoked. Additionally, there are P1 type safety issues with potential None returns that violate the function contract. While the VLM enrichment feature and performance improvements are valuable, the critical bug must be fixed before merging. - dimos/perception/detection/moduleDB.py requires immediate attention due to TypeError bug in agent_encode() method
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| dimos/agents/skills/navigation.py | 3/5 | Adds two new navigation skills (navigate_to_detected_object, list_detected_objects) and changes frame references from "map" to "world". Contains typos in error messages. |
| dimos/perception/detection/moduleDB.py | 1/5 | Major changes: VLM enrichment for object labels, new RPC methods (lookup, get_all_detected_objects), and lightweight dict returns. Critical bug: len() called on int in agent_encode(). |
| dimos/perception/spatial_perception.py | 4/5 | Changes SpatialMemory to extend SkillModule, adds 1s startup delay, changes frame references from "map" to "world", reduces logging verbosity. Minor warning message inconsistency. |
| dimos/agents_deprecated/memory/spatial_vector_db.py | 4/5 | Comments out logging and image retrieval from visual memory to improve performance. |
| dimos/robot/unitree_webrtc/unitree_go2_blueprints.py | 5/5 | Adds new agentic_detection blueprint combining detection with navigation, spatial memory, and agent capabilities. |
Sequence Diagram
sequenceDiagram
participant Agent
participant NavigationSkill
participant ObjectDB
participant VLM as Qwen VLM
participant Detection as YOLO Detection
participant Nav as Navigation
Detection->>ObjectDB: Detection3DPC (YOLO label)
ObjectDB->>ObjectDB: Check closest_object()
alt New Object
ObjectDB->>VLM: _enrich_with_vlm(obj)
VLM-->>ObjectDB: Rich description
ObjectDB->>ObjectDB: Create Object3D<br/>(yolo_label + vlm_label)
else Existing Object (every 10 detections)
ObjectDB->>ObjectDB: Add to existing
ObjectDB->>VLM: _enrich_with_vlm(obj)
VLM-->>ObjectDB: Updated description
end
Agent->>NavigationSkill: "Navigate to person in white"
NavigationSkill->>ObjectDB: lookup("person in white")
ObjectDB-->>NavigationSkill: [dict with pos_x, pos_y, frame_id]
NavigationSkill->>NavigationSkill: Create PoseStamped
NavigationSkill->>Nav: _navigate_to(goal_pose)
Nav-->>NavigationSkill: Success/Failure
NavigationSkill-->>Agent: "Successfully navigated to..."
Agent->>NavigationSkill: "What do you see?"
NavigationSkill->>ObjectDB: get_all_detected_objects()
ObjectDB-->>NavigationSkill: [list of dicts]
NavigationSkill-->>Agent: "I've detected N objects..."
| # we need at least 3 detectieons to consider it a valid object | ||
| # for this to be serious we need a ratio of detections within the window of observations | ||
| if len(obj.detections) < 4: # type: ignore[arg-type] | ||
| if len(obj.detections) < 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] obj.detections is an int, not a list. Calling len() on it will cause TypeError: object of type 'int' has no len() at runtime. Should be if obj.detections < 4: instead.
| if len(obj.detections) < 4: | |
| if obj.detections < 4: |
| if not self.enable_vlm_enrichment or self.vlm_model is None: | ||
| return obj.yolo_label # Fall back to YOLO label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Function signature promises str return, but returns obj.yolo_label which is typed as str | None. If yolo_label is None (possible per field definition), this violates the type contract and could cause issues in callers expecting a string.
| logger.warning(f"No image for {obj.track_id}, using YOLO label") | ||
| return obj.yolo_label | ||
|
|
||
| prompt = f"Describe this {obj.yolo_label} in detail. Include color, appearance, and distinguishing features. Keep it concise (under 10 words)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] If obj.yolo_label is None, the prompt becomes "Describe this None in detail...", which is semantically incorrect. Should handle None case or assert yolo_label is not None before this point.
| return | ||
|
|
||
| if tf is None: | ||
| logger.warning("No TF transform (map -> base_link) available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P3] Warning message says "(map -> base_link)" but code at line 223 actually requests "world" to "base_link" transform. Message should match the actual frame names.
| logger.warning("No TF transform (map -> base_link) available") | |
| logger.warning("No TF transform (world -> base_link) available") |
Additional Comments (2)
|
dimos/agents/skills/navigation.py
Outdated
| return f"Failed to reach '{object_name}'. Navigation was cancelled or failed." | ||
|
|
||
| @skill() | ||
| def list_detected_objects(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the objectDBModule, otherwise its possible that if you have a blueprint without that module it will fail
dimos/agents/skills/navigation.py
Outdated
|
|
||
| # Get objects | ||
| try: | ||
| get_all_rpc = self.get_rpc_calls("ObjectDBModule.get_all_detected_objects") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah see here this is a hardcoded ObjectDBModule check - not good practice easier to just put this skill in there directly
| agentic, | ||
| MCPModule.blueprint(), | ||
| ) | ||
| agentic_detection = autoconnect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good
|
|
||
| return "\n".join(lines) | ||
|
|
||
| def navigate_with_text(self, query: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def navigate_with_text(self, query: str) -> str: | |
| @skill() | |
| def navigate_with_text(self, query: str) -> str: |
…itance - Add DetectedObjectInfo dataclass for passing object data to navigation (can move to dimos.msgs if needed) - Improve lookup with hybrid substring/token matching - Clean up skill logging
|
@greptile |
Greptile OverviewGreptile SummaryOverviewThis PR adds VLM-enriched object detection for precise robot navigation. Objects detected by YOLO are enhanced with rich descriptions from a Vision Language Model (e.g., "person" → "person in white shirt"), enabling navigation to specific objects like "go to the red coffee mug" or "navigate to person in white." Key Features:
Architecture:
Critical Issues Found🔴 Object Tracking Completely Broken (moduleDB.py)The 🔴 Navigation Always Reports Failure (navigation.py)Type mismatch in 🔴 TypeError in Object Listing (moduleDB.py)Calling Performance ImprovementsThe RPC optimizations are well-executed:
Confidence Score: 1/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Agent as LLM Agent
participant NavSkill as NavigationSkillContainer
participant ObjectDB as ObjectDBModule
participant VLM as QwenVlModel
participant Detection as Detection3DModule
participant Nav as NavigationInterface
Note over Agent,Nav: VLM-Enriched Object Detection Navigation
Detection->>ObjectDB: ImageDetections3DPC
ObjectDB->>ObjectDB: add_detection(detection)
ObjectDB->>ObjectDB: create_new_object(detection)
ObjectDB->>VLM: query(image, prompt)
VLM-->>ObjectDB: rich_label (e.g., "person in white shirt")
ObjectDB->>ObjectDB: Store object with yolo_label + vlm_label
Agent->>NavSkill: navigate_to_detected_object("person in white")
NavSkill->>ObjectDB: lookup("person in white")
ObjectDB->>ObjectDB: Match objects by name/label
ObjectDB-->>NavSkill: [DetectedObjectInfo]
NavSkill->>NavSkill: Create goal_pose from obj.pose
NavSkill->>Nav: set_goal(goal_pose)
loop Poll navigation state
NavSkill->>Nav: get_state()
Nav-->>NavSkill: NavigationState
end
NavSkill->>Nav: is_goal_reached()
Nav-->>NavSkill: bool (success/failure)
NavSkill-->>Agent: "Successfully navigated to 'person in white shirt'"
Note over Agent,NavSkill: Alternative: List Detected Objects
Agent->>ObjectDB: list_detected_objects()
ObjectDB-->>Agent: "I've detected 5 object(s):\n- person in white shirt (12 detections)\n- red coffee mug (8 detections)"
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 5 comments
dimos/agents/skills/navigation.py
Outdated
|
|
||
| result = self._navigate_to(goal_pose) | ||
|
|
||
| if result == "SUCCESS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: _navigate_to() returns bool (True/False), but this code compares it to the string "SUCCESS". This condition will always be False, causing navigation to always report failure even when successful.
The method should check if result: instead of if result == "SUCCESS":.
| if result == "SUCCESS": | |
| if result: |
| # we need at least 3 detectieons to consider it a valid object | ||
| # for this to be serious we need a ratio of detections within the window of observations | ||
| if len(obj.detections) < 4: # type: ignore[arg-type] | ||
| if len(obj.detections) < 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj.detections is an int, not a collection, so calling len() on it will raise TypeError: object of type 'int' has no len().
| if len(obj.detections) < 4: | |
| if obj.detections < 4: |
| """Convert Object3D to DetectedObjectInfo.""" | ||
| try: | ||
| return DetectedObjectInfo( | ||
| track_id=obj.track_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DetectedObjectInfo expects track_id: str (non-optional), but obj.track_id can be None (type is str | None from line 56). If track_id is None, this will cause a type error when creating the dataclass instance.
While the exception handler at line 366 will catch this and return None, objects with None track_ids should be filtered out before calling _to_detection_info() to avoid unnecessary exceptions.
| return | ||
|
|
||
| if tf is None: | ||
| logger.warning("No TF transform (map -> base_link) available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message refers to "map -> base_link" but the code at line 223 actually queries "world -> base_link". The error message should be updated to match the actual frame IDs being used.
| logger.warning("No TF transform (map -> base_link) available") | |
| logger.warning("No TF transform (world -> base_link) available") |
Additional Comments (1)
This will cause The
|
| if not self._skill_started: | ||
| raise ValueError(f"{self} has not been started.") | ||
| tf = self.tf.get("map", "base_link", time_tolerance=2.0) | ||
| tf = self.tf.get("world", "base_link", time_tolerance=2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a breaking change that i thinik was never tested or caught
| semantic map for a location matching the description. | ||
| CALL THIS SKILL FOR ONE SUBJECT AT A TIME. For example: "Go to the person wearing a blue shirt in the living room", | ||
| you should call this skill twice, once for the person wearing a blue shirt and once for the living room. | ||
| def navigate_to_detected_object(self, object_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object name cant be a string need type
| if not self._skill_started: | ||
| raise ValueError(f"{self} has not been started.") | ||
|
|
||
| lookup_rpc = self.get_rpc_calls("ObjectDBModule.lookup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paul-nechifor we need to fix this rpc shit with modules super weird and hard coded
|
|
||
| goal_pose = PoseStamped( | ||
| position=obj.pose.position, | ||
| orientation=Quaternion(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets calculate goal_pose via a seperate method. This is goal point which is just position. Goal_pose requires /global_map and /odom and the /goal_point and it picks the closest point within some buffer due to the size of the robot that is not in collision and then points the right way using trig.
| if success_msg: | ||
| return success_msg | ||
|
|
||
| logger.info(f"No tagged location found for {query}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
| goal_pose = PoseStamped( | ||
| position=make_vector3(*robot_location.position), | ||
| orientation=Quaternion.from_euler(Vector3(*robot_location.rotation)), | ||
| frame_id="map", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug fix from frame thing
| ) | ||
|
|
||
| logger.info(f"Added image vector {vector_id} with metadata: {metadata}") | ||
| #logger.info(f"Added image vector {vector_id} with metadata: {metadata}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
| logger = setup_logger() | ||
|
|
||
| @dataclass | ||
| class DetectedObjectInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use detection2d type or extend (maybe but prob not) with a super minimal thing
| self._vlm_model = QwenVlModel() | ||
| return self._vlm_model | ||
|
|
||
| def _enrich_with_vlm(self, obj: Object3D) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cant be here. Needs to be in the VLM stuff in models/vl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read base.py
Changes