Goal: This agent safely answers data questions by querying a Postgres database, returning structured results, and rendering them in a UI. Its behavior is validated using scenario tests and end-to-end UI tests.
Framework: Mastra Language: TypeScript
This project follows the Better Agents standard for building production-ready AI agents.
The data analytics agent provides the following features:
- Natural Language Querying: Accepts user questions in plain English about data stored in a Postgres database
- Safe SQL Generation: Generates and executes SQL queries safely, preventing injection attacks and unauthorized access
- Structured Results: Returns query results in structured formats (JSON, tables, charts)
- UI Rendering: Displays results in an interactive web interface
- Multi-turn Conversations: Supports follow-up questions and refinements to queries
- Error Handling: Gracefully handles invalid queries, database errors, and provides helpful feedback
- Database Integration: Connects to Postgres databases with configurable connection parameters
- AI Model: Uses Vercel AI SDK for natural language processing and SQL generation
- Prompt Management: All prompts managed via LangWatch Prompt CLI for versioning and optimization
- Testing: Comprehensive scenario tests for end-to-end validation
- Instrumentation: Full LangWatch integration for monitoring and analytics
Scenario allows for end-to-end validation of multi-turn conversations and real-world scenarios. Most agent functionality should be tested with Scenario tests, and these MUST be created and maintained strictly using the LangWatch MCP (do not access external Scenario docs).
CRITICAL: Every new agent feature MUST be tested with Scenario tests (use LangWatch MCP to access the docs) before considering it complete.
- Write simulation tests for multi-turn conversations
- Validate edge cases
- Ensure business value is delivered
- Test different conversation paths
Best practices:
- NEVER check for regex or word matches in the agent's response, use judge criteria instead
- Use functions on the Scenario scripts for things that can be checked deterministically (tool calls, database entries, etc) instead of relying on the judge
- For the rest, use the judge criteria to check if agent is reaching the desired goal and
- When broken, run on single scenario at a time to debug and iterate faster, not the whole suite
- Write as few scenarios as possible, try to cover more ground with few scenarios, as they are heavy to run
- If user made 1 request, just 1 scenario might be enough, run it at the end of the implementation to check if it works
- ALWAYS consult the Scenario docs through the LangWatch MCP on how to install and write scenarios.
ALWAYS use LangWatch Prompt CLI for managing prompts:
- Use the LangWatch MCP to learn about prompt management, search for Prompt CLI docs
- Never hardcode prompts in your application code
- Store all prompts in the
prompts/directory as YAML files, use "langwatch prompt create " to create a new prompt - Run
langwatch prompt syncafter changing a prompt to update the registry
Example prompt structure:
# prompts/my_prompt.yaml
model: gpt-4o
temperature: 0.7
messages:
- role: system
content: |
Your system prompt here
- role: user
content: |
{{ user_input }}DO NOT use hardcoded prompts in your application code, example:
BAD:
Agent(prompt="You are a helpful assistant.")
GOOD:
import langwatch
prompt = langwatch.prompts.get("my_prompt")
Agent(prompt=prompt.prompt)import { LangWatch } from "langwatch";
const langwatch = new LangWatch({
apiKey: process.env.LANGWATCH_API_KEY
});
const prompt = await langwatch.prompts.get("my_prompt")
Agent(prompt=prompt!.prompt)Prompt fetching is very reliable when using the prompts cli because the files are local (double check they were created with the CLI and are listed on the prompts.json file). DO NOT add try/catch around it and DO NOT duplicate the prompt here as a fallback
Explore the prompt management get started and data model docs if you need more advanced usages such as compiled prompts with variables or messages list.
Only write evaluations for specific cases:
- When a RAG is implemented, so we can evaluate the accuracy given many sample queries (using an LLM to compare expected with generated outputs)
- For classification tasks, e.g. categorization, routing, simple true/false detection, etc
- When the user asks and you are sure an agent scenario wouldn't test the behaviour better
This is because evaluations are good for things when you have a lot of examples, with avery clear definition of what is correct and what is not (that is, you can just compare expected with generated) and you are looking for single input/output pairs. This is not the case for multi-turn agent flows.
Create evaluations in Jupyter notebooks under tests/evaluations/:
- Generate csv example datasets yourself to be read by pandas with plenty of examples
- Use LangWatch Evaluations API to create evaluation notebooks and track the evaluation results
- Use either a simple == comparison or a direct (e.g. openai) LLM call to compare expected with generated if possible and not requested otherwise
- ALWAYS use the package manager cli commands to init, add and install new dependencies, DO NOT guess package versions, DO NOT add them to the dependencies file by hand.
- When setting up, remember to load dotenv for the tests so env vars are available
- Double check the guidelines on AGENTS.md after the end of the implementation.
Always use the Vercel MCP for learning:
- The Vercel MCP server provides real-time documentation for Vercel AI SDK
- Ask it questions about Vercel AI SDK APIs and best practices
- Follow Vercel AI SDK's recommended patterns for agent development
When implementing agent features:
- Consult the Vercel MCP: "How do I [do X] in Vercel AI SDK?"
- Use Vercel AI SDK's unified provider architecture
- Follow Vercel AI SDK's TypeScript patterns and conventions
- Leverage Vercel AI SDK's framework integrations (Next.js, React, Svelte, Vue, Node.js)
Initial setup:
- Use
pnpm initto create a new project - Install dependencies:
pnpm add ai @ai-sdk/openai(or other provider packages like@ai-sdk/anthropic,@ai-sdk/google) - Set up TypeScript configuration
- Proceed with the user definition request to implement the agent and test it out
- Run the agent using
pnpm tsx src/index.tsor integrate with your chosen framework
Key Concepts:
- Unified Provider Architecture: Consistent interface across multiple AI model providers
- generateText: Generate text using any supported model
- streamText: Stream text responses for real-time interactions
- Framework Integration: Works with Next.js, React, Svelte, Vue, and Node.js
This project follows a standardized structure for production-ready agents:
|__ app/ # Main application code
|__ prompts/ # Versioned prompt files (YAML)
|_____ *.yaml
|__ tests/
|_____ evaluations/ # Jupyter notebooks for component evaluation
|________ *.ipynb
|_____ scenarios/ # End-to-end scenario tests
|________ *.test.ts
|__ prompts.json # Prompt registry
|__ .env # Environment variables (never commit!)
- Understand Requirements: Clarify what the agent should do
- Design the Approach: Plan which components you'll need
- Implement with Prompts: Use LangWatch Prompt CLI to create/manage prompts
- Write Unit Tests: Test deterministic components
- Create Evaluations: Build evaluation notebooks for probabilistic components
- Write Scenario Tests: Create end-to-end tests using Scenario
- Run Tests: Verify everything works before moving on
- ✅ Version control your prompts
- ✅ Write tests for new features
- ✅ Use LangWatch MCP to learn best practices and to work with Scenario tests and evaluations
- ✅ Follow the Agent Testing Pyramid
- ✅ Document your agent's capabilities
- ❌ Hardcode prompts in application code
- ❌ Skip testing new features
- ❌ Commit API keys or sensitive data
- ❌ Optimize without measuring (use evaluations first)
The LangWatch MCP server provides expert guidance on:
- Prompt management with Prompt CLI
- Writing and maintaining Scenario tests (use LangWatch MCP to learn)
- Creating evaluations
- Best practices for agent development
The MCP will provide up-to-date documentation and examples. For Scenario specifically, always navigate its documentation and examples through the LangWatch MCP instead of accessing it directly.
- Set up your environment: Copy
.env.exampleto.envand fill in your API keys - Learn the tools: Ask the LangWatch MCP about prompt management and testing
- Start building: Implement your agent in the
app/directory - Write tests: Create scenario tests for your agent's capabilities
- Iterate: Use evaluations to improve your agent's performance
- Scenario Documentation: https://scenario.langwatch.ai/
- Agent Testing Pyramid: https://scenario.langwatch.ai/best-practices/the-agent-testing-pyramid
- LangWatch Dashboard: https://app.langwatch.ai/
- Mastra Documentation: Use the Mastra MCP for up-to-date docs
Remember: Building production-ready agents means combining great AI capabilities with solid software engineering practices. Follow these guidelines to create agents that are reliable, testable, and maintainable.