Agent Development Guidelines

Project Overview

Goal: This agent safely answers data questions by querying a Postgres database, returning structured results, and rendering them in a UI. Its behavior is validated using scenario tests and end-to-end UI tests.

Framework: Mastra Language: TypeScript

This project follows the Better Agents standard for building production-ready AI agents.

Agent Capabilities

The data analytics agent provides the following features:

Natural Language Querying: Accepts user questions in plain English about data stored in a Postgres database
Safe SQL Generation: Generates and executes SQL queries safely, preventing injection attacks and unauthorized access
Structured Results: Returns query results in structured formats (JSON, tables, charts)
UI Rendering: Displays results in an interactive web interface
Multi-turn Conversations: Supports follow-up questions and refinements to queries
Error Handling: Gracefully handles invalid queries, database errors, and provides helpful feedback

Technical Architecture

Database Integration: Connects to Postgres databases with configurable connection parameters
AI Model: Uses Vercel AI SDK for natural language processing and SQL generation
Prompt Management: All prompts managed via LangWatch Prompt CLI for versioning and optimization
Testing: Comprehensive scenario tests for end-to-end validation
Instrumentation: Full LangWatch integration for monitoring and analytics

Core Principles

1. Scenario Agent Testing

Scenario allows for end-to-end validation of multi-turn conversations and real-world scenarios. Most agent functionality should be tested with Scenario tests, and these MUST be created and maintained strictly using the LangWatch MCP (do not access external Scenario docs).

CRITICAL: Every new agent feature MUST be tested with Scenario tests (use LangWatch MCP to access the docs) before considering it complete.

Write simulation tests for multi-turn conversations
Validate edge cases
Ensure business value is delivered
Test different conversation paths

Best practices:

NEVER check for regex or word matches in the agent's response, use judge criteria instead
Use functions on the Scenario scripts for things that can be checked deterministically (tool calls, database entries, etc) instead of relying on the judge
For the rest, use the judge criteria to check if agent is reaching the desired goal and
When broken, run on single scenario at a time to debug and iterate faster, not the whole suite
Write as few scenarios as possible, try to cover more ground with few scenarios, as they are heavy to run
If user made 1 request, just 1 scenario might be enough, run it at the end of the implementation to check if it works
ALWAYS consult the Scenario docs through the LangWatch MCP on how to install and write scenarios.

2. Prompt Management

ALWAYS use LangWatch Prompt CLI for managing prompts:

Use the LangWatch MCP to learn about prompt management, search for Prompt CLI docs
Never hardcode prompts in your application code
Store all prompts in the prompts/ directory as YAML files, use "langwatch prompt create " to create a new prompt
Run langwatch prompt sync after changing a prompt to update the registry

Example prompt structure:

# prompts/my_prompt.yaml
model: gpt-4o
temperature: 0.7
messages:
  - role: system
    content: |
      Your system prompt here
  - role: user
    content: |
      {{ user_input }}

DO NOT use hardcoded prompts in your application code, example:

BAD:

Agent(prompt="You are a helpful assistant.")

GOOD:

import langwatch

prompt = langwatch.prompts.get("my_prompt")
Agent(prompt=prompt.prompt)

import { LangWatch } from "langwatch";

const langwatch = new LangWatch({
  apiKey: process.env.LANGWATCH_API_KEY
});

const prompt = await langwatch.prompts.get("my_prompt")
Agent(prompt=prompt!.prompt)

Prompt fetching is very reliable when using the prompts cli because the files are local (double check they were created with the CLI and are listed on the prompts.json file). DO NOT add try/catch around it and DO NOT duplicate the prompt here as a fallback

Explore the prompt management get started and data model docs if you need more advanced usages such as compiled prompts with variables or messages list.

3. Evaluations for specific cases

Only write evaluations for specific cases:

When a RAG is implemented, so we can evaluate the accuracy given many sample queries (using an LLM to compare expected with generated outputs)
For classification tasks, e.g. categorization, routing, simple true/false detection, etc
When the user asks and you are sure an agent scenario wouldn't test the behaviour better

This is because evaluations are good for things when you have a lot of examples, with avery clear definition of what is correct and what is not (that is, you can just compare expected with generated) and you are looking for single input/output pairs. This is not the case for multi-turn agent flows.

Create evaluations in Jupyter notebooks under tests/evaluations/:

Generate csv example datasets yourself to be read by pandas with plenty of examples
Use LangWatch Evaluations API to create evaluation notebooks and track the evaluation results
Use either a simple == comparison or a direct (e.g. openai) LLM call to compare expected with generated if possible and not requested otherwise

4. General good practices

ALWAYS use the package manager cli commands to init, add and install new dependencies, DO NOT guess package versions, DO NOT add them to the dependencies file by hand.
When setting up, remember to load dotenv for the tests so env vars are available
Double check the guidelines on AGENTS.md after the end of the implementation.

Framework-Specific Guidelines

Vercel AI SDK Framework

Always use the Vercel MCP for learning:

The Vercel MCP server provides real-time documentation for Vercel AI SDK
Ask it questions about Vercel AI SDK APIs and best practices
Follow Vercel AI SDK's recommended patterns for agent development

When implementing agent features:

Consult the Vercel MCP: "How do I [do X] in Vercel AI SDK?"
Use Vercel AI SDK's unified provider architecture
Follow Vercel AI SDK's TypeScript patterns and conventions
Leverage Vercel AI SDK's framework integrations (Next.js, React, Svelte, Vue, Node.js)

Initial setup:

Use pnpm init to create a new project
Install dependencies: pnpm add ai @ai-sdk/openai (or other provider packages like @ai-sdk/anthropic, @ai-sdk/google)
Set up TypeScript configuration
Proceed with the user definition request to implement the agent and test it out
Run the agent using pnpm tsx src/index.ts or integrate with your chosen framework

Key Concepts:

Unified Provider Architecture: Consistent interface across multiple AI model providers
generateText: Generate text using any supported model
streamText: Stream text responses for real-time interactions
Framework Integration: Works with Next.js, React, Svelte, Vue, and Node.js

Project Structure

This project follows a standardized structure for production-ready agents:

|__ app/           # Main application code
|__ prompts/          # Versioned prompt files (YAML)
|_____ *.yaml
|__ tests/
|_____ evaluations/   # Jupyter notebooks for component evaluation
|________ *.ipynb
|_____ scenarios/     # End-to-end scenario tests
|________ *.test.ts
|__ prompts.json      # Prompt registry
|__ .env              # Environment variables (never commit!)

Development Workflow

When Starting a New Feature:

Understand Requirements: Clarify what the agent should do
Design the Approach: Plan which components you'll need
Implement with Prompts: Use LangWatch Prompt CLI to create/manage prompts
Write Unit Tests: Test deterministic components
Create Evaluations: Build evaluation notebooks for probabilistic components
Write Scenario Tests: Create end-to-end tests using Scenario
Run Tests: Verify everything works before moving on

Always:

✅ Version control your prompts
✅ Write tests for new features
✅ Use LangWatch MCP to learn best practices and to work with Scenario tests and evaluations
✅ Follow the Agent Testing Pyramid
✅ Document your agent's capabilities

Never:

❌ Hardcode prompts in application code
❌ Skip testing new features
❌ Commit API keys or sensitive data
❌ Optimize without measuring (use evaluations first)

Using LangWatch MCP

The LangWatch MCP server provides expert guidance on:

Prompt management with Prompt CLI
Writing and maintaining Scenario tests (use LangWatch MCP to learn)
Creating evaluations
Best practices for agent development

The MCP will provide up-to-date documentation and examples. For Scenario specifically, always navigate its documentation and examples through the LangWatch MCP instead of accessing it directly.

Getting Started

Set up your environment: Copy .env.example to .env and fill in your API keys
Learn the tools: Ask the LangWatch MCP about prompt management and testing
Start building: Implement your agent in the app/ directory
Write tests: Create scenario tests for your agent's capabilities
Iterate: Use evaluations to improve your agent's performance

Resources

Scenario Documentation: https://scenario.langwatch.ai/
Agent Testing Pyramid: https://scenario.langwatch.ai/best-practices/the-agent-testing-pyramid
LangWatch Dashboard: https://app.langwatch.ai/
- Mastra Documentation: Use the Mastra MCP for up-to-date docs

Remember: Building production-ready agents means combining great AI capabilities with solid software engineering practices. Follow these guidelines to create agents that are reliable, testable, and maintainable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Development Guidelines

Project Overview

Agent Capabilities

Technical Architecture

Core Principles

1. Scenario Agent Testing

2. Prompt Management

3. Evaluations for specific cases

4. General good practices

Framework-Specific Guidelines

Vercel AI SDK Framework

Project Structure

Development Workflow

When Starting a New Feature:

Always:

Never:

Using LangWatch MCP

Getting Started

Resources

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Agent Development Guidelines

Project Overview

Agent Capabilities

Technical Architecture

Core Principles

1. Scenario Agent Testing

2. Prompt Management

3. Evaluations for specific cases

4. General good practices

Framework-Specific Guidelines

Vercel AI SDK Framework

Project Structure

Development Workflow

When Starting a New Feature:

Always:

Never:

Using LangWatch MCP

Getting Started

Resources