The ActionPlanner has been simplified to use a deterministic workflow approach. No more complex multi-step reasoning - just clean, predictable paths based on captcha type.
- Classify the captcha type (checkbox, image_selection, drag_puzzle, text)
- Execute deterministic workflow using only 2 tools:
detect(prompt)- Find objects in image (returns bounding boxes)point(prompt)- Find a specific location (returns x, y coordinates)
No planning needed - just call the tool directly:
from src.planner import ActionPlanner
planner = ActionPlanner(backend="gemini")
# No need to classify - just find the checkbox
# In your solver:
# location = point_tool("checkbox center")
# click(location)Deterministic path:
- Classify → get instruction
- Get detection target
- Detect all instances
- Click each one
planner = ActionPlanner(backend="gemini")
# Step 1: Classify
classification = planner.classify("captcha.png")
# Returns: {"type": "image_selection", "instruction": "Select all traffic lights", "reasoning": "..."}
# Step 2: Get what to detect
target = planner.get_detection_target(
instruction=classification["instruction"],
image_path="captcha.png"
)
# Returns: "traffic light"
# Step 3 & 4: In your solver
# boxes = detect_tool(target) # Returns [(x1, y1, x2, y2), ...]
# for box in boxes:
# center = get_center(box)
# click(center)Deterministic path:
- Classify → get instruction
- Get drag prompts (what to drag, where to drag)
- Point to find source
- Point to find destination
- Drag from source to destination
planner = ActionPlanner(backend="gemini")
# Step 1: Classify
classification = planner.classify("captcha.png")
# Returns: {"type": "drag_puzzle", "instruction": "Drag the piece to complete the puzzle", ...}
# Step 2: Get prompts for point() tool
prompts = planner.get_drag_prompts(
instruction=classification["instruction"],
image_path="captcha.png"
)
# Returns: {"draggable_prompt": "puzzle piece", "destination_prompt": "empty slot"}
# Step 3-5: In your solver
# source = point_tool(prompts["draggable_prompt"])
# destination = point_tool(prompts["destination_prompt"])
# drag(source, destination)Deterministic path:
- Classify
- Read text
- Type it
planner = ActionPlanner(backend="gemini")
# Step 1: Classify
classification = planner.classify("captcha.png")
# Returns: {"type": "text", "instruction": null, ...}
# Step 2: Read text
text = planner.read_text("captcha.png")
# Returns: "XyZ123"
# Step 3: In your solver
# type_text(text)planner = ActionPlanner(
backend="gemini", # or "ollama"
model=None, # Auto-selected based on backend
gemini_api_key=None, # Or set GEMINI_API_KEY env var
)Classify the captcha type.
Returns:
{
"type": "checkbox" | "image_selection" | "drag_puzzle" | "text",
"instruction": "instruction text or None",
"reasoning": "brief explanation"
}For image_selection captchas, get the object class to detect.
Example: "Select all traffic lights" → "traffic light"
For drag_puzzle captchas, get prompts for the point() tool.
Returns:
{
"draggable_prompt": "what to drag",
"destination_prompt": "where to drag to"
}For text captchas, read the distorted text.
Returns: The text string to type
from src.planner import ActionPlanner
def solve_captcha(image_path: str):
planner = ActionPlanner(backend="gemini")
# Step 1: Classify
classification = planner.classify(image_path)
captcha_type = classification["type"]
instruction = classification["instruction"]
# Step 2: Execute deterministic workflow
if captcha_type == "checkbox":
# Direct approach - no planner needed
location = point("checkbox center")
click(location)
elif captcha_type == "image_selection":
# Get what to detect
target = planner.get_detection_target(instruction, image_path)
# Detect and click all
boxes = detect(target)
for box in boxes:
click(get_center(box))
elif captcha_type == "drag_puzzle":
# Get drag prompts
prompts = planner.get_drag_prompts(instruction, image_path)
# Find and drag
source = point(prompts["draggable_prompt"])
destination = point(prompts["destination_prompt"])
drag(source, destination)
elif captcha_type == "text":
# Read and type
text = planner.read_text(image_path)
type_text(text)- Predictable - Same input = same workflow path
- Debuggable - Easy to see what went wrong at each step
- Testable - Each method can be tested independently
- Simple - No complex multi-step reasoning or self-questioning
- Fast - Minimal LLM calls needed
Old way (complex):
action = planner.plan(image_path, context, elements, prompt_text)
# Returns complex PlannedAction with many optional fieldsNew way (simple):
# Step 1: Classify
classification = planner.classify(image_path)
# Step 2-N: Deterministic workflow based on type
if classification["type"] == "checkbox":
# point("checkbox center") → click
...