Skip to content

Conversation

@katmayb
Copy link
Contributor

@katmayb katmayb commented Nov 18, 2025

@github-actions github-actions bot added the langsmith For docs changes to LangSmith label Nov 18, 2025
@github-actions
Copy link
Contributor

Mintlify preview ID generated: preview-evalsi-1763502859-c2d102b

@github-actions
Copy link
Contributor

Mintlify preview ID generated: preview-evalsi-1763668729-42c6cad

---

LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are:
LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.
LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

#### Heuristic

LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference).
_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.
_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really use the term heuristic evaluators


Learn [how to analyze experiment results](/langsmith/analyze-an-experiment).

## Experiment configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this isnt conceptual anymore and not something i would want someone new to evals to learn right away. we should move to the Set up evolutions > evaluation techniques section.

@tanushree-sharma
Copy link
Contributor

tanushree-sharma commented Nov 24, 2025

WDYT about the following structure/content for the concepts page. Feels like theres a lot of info and i think the order can be better:

  • What to evaluate: Figure out whats important to measure. Spit out your system into its parts. Evaluate each critical part and evals. Recommendation: Start with building examples of what good looks like for each part (manually curated examples).
  • Introduce offline and online evals: when to use each. Offline and online eval targets. I also think we should separate out a new concept we're introducing (eg. Datasets) from best practices for using that concept (eg. Dataset Curation). Feels like a lot of info at onece. We could move best practices to a new section at the bottom and cross link?
  • Evaluation lifecycle
  • Evaluation types --> Link out to Set up evaluations > Evaluation types. Add a landing page that explains each (move the content that's here)
  • Best Practices


Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence.

1. Create a [dataset](/langsmith/manage-datasets) with representative test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this flow but I'd remove the numbered bullets for each of these sections. I think for someone new since they're super high level it makes the bullets hard to follow

1. Deploy the updated application.
1. Confirm the fix with online evaluations.

## Core evaluation objects
Copy link
Contributor

@tanushree-sharma tanushree-sharma Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather frame it as: these are the kinds of evals you can run online vs offline. This section makes that really clear and then you can introduce the different concepts for offline and online

Thoughts on using the word evaluation "targets" instead of "objects"? Offline evaluations and online evaluations run on different targets; online evals operate on runs, while offline evals operate on examples

_Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size.

### Splits
#### Splits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would remove Splits and Versions from here because its not really conceptual and they are explained elsewhere


![Offline](/langsmith/images/offline.png)

### Benchmarking
Copy link
Contributor

@tanushree-sharma tanushree-sharma Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idt Benchmarking, Unit tests, regression tests, backtesting add value to this guide. i'd keep pariwise though

![Online](/langsmith/images/online.png)

## Testing
### Real-time monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with these. Not super helpful as their own sections i think. i'd rather weave these concepts into the Online Evals section

@github-actions
Copy link
Contributor

Mintlify preview ID generated: preview-evalsi-1764016425-db5351a

Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each.

A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance:

A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
- **RAG system**: Examples of good retrievals (relevant documents) and good answers (accurate, complete).
- **Agent**: Examples of correct tool selection and proper argument formatting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Agent**: Examples of correct tool selection and proper argument formatting.
- **Agent**: Examples of correct tool selection and proper formatting or trajectory that the agent took.

## Evaluation lifecycle

If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? There are a few techniques you can use:
As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.
As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.

## Core evaluation targets

Evaluators receive these inputs:
Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach.
Evaluations run on different targets depending on whether they are offline or online.

_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.

Pairwise evaluators allow you to compare the outputs of two versions of an application. This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples).
### Defining and running evaluators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this section since it has overlapping info

## Evaluators

### Pairwise
_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.
_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.
Run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on tracing projects or datasets.

![Comparison view](/langsmith/images/comparison-view.png)
### Evaluator outputs

Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains:
Evaluators return **feedback**, which is the scores from evaluation. Feedback is a dictionary or list of dictionaries. Each dictionary contains:

![Example](/langsmith/images/example-concept.png)

There are a few high-level approaches to LLM evaluation:
Learn more about [managing datasets](/langsmith/manage-datasets).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this section up right after here since its related/fundamental to offline evals: https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts#experiment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

langsmith For docs changes to LangSmith

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants