WIP update eval intro docs for clarity online offline user journey #1518

katmayb · 2025-11-18T21:53:29Z

Preview

https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts

github-actions · 2025-11-18T21:54:59Z

Mintlify preview ID generated: preview-evalsi-1763502859-c2d102b

github-actions · 2025-11-20T19:59:26Z

Mintlify preview ID generated: preview-evalsi-1763668729-42c6cad

tanushree-sharma · 2025-11-24T03:50:39Z

src/langsmith/evaluation-concepts.mdx

 ---

-LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are:
+LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.


Suggested change

LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

tanushree-sharma · 2025-11-24T03:58:50Z

src/langsmith/evaluation-concepts.mdx

+#### Heuristic

-LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference).
+_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.


Suggested change

_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.

_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.

We don't really use the term heuristic evaluators

tanushree-sharma · 2025-11-24T04:02:02Z

src/langsmith/evaluation-concepts.mdx


+Learn [how to analyze experiment results](/langsmith/analyze-an-experiment).

 ## Experiment configuration


imo this isnt conceptual anymore and not something i would want someone new to evals to learn right away. we should move to the Set up evolutions > evaluation techniques section.

tanushree-sharma · 2025-11-24T04:22:42Z

WDYT about the following structure/content for the concepts page. Feels like theres a lot of info and i think the order can be better:

What to evaluate: Figure out whats important to measure. Spit out your system into its parts. Evaluate each critical part and evals. Recommendation: Start with building examples of what good looks like for each part (manually curated examples).
Introduce offline and online evals: when to use each. Offline and online eval targets. I also think we should separate out a new concept we're introducing (eg. Datasets) from best practices for using that concept (eg. Dataset Curation). Feels like a lot of info at onece. We could move best practices to a new section at the bottom and cross link?
Evaluation lifecycle
Evaluation types --> Link out to Set up evaluations > Evaluation types. Add a landing page that explains each (move the content that's here)
Best Practices

tanushree-sharma · 2025-11-24T04:25:40Z

src/langsmith/evaluation-concepts.mdx

+
+Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence.
+
+1. Create a [dataset](/langsmith/manage-datasets) with representative test cases.


I like this flow but I'd remove the numbered bullets for each of these sections. I think for someone new since they're super high level it makes the bullets hard to follow

tanushree-sharma · 2025-11-24T04:29:14Z

src/langsmith/evaluation-concepts.mdx

+1. Deploy the updated application.
+1. Confirm the fix with online evaluations.
+
+## Core evaluation objects


I'd rather frame it as: these are the kinds of evals you can run online vs offline. This section makes that really clear and then you can introduce the different concepts for offline and online

Thoughts on using the word evaluation "targets" instead of "objects"? Offline evaluations and online evaluations run on different targets; online evals operate on runs, while offline evals operate on examples

tanushree-sharma · 2025-11-24T04:31:40Z

src/langsmith/evaluation-concepts.mdx

+_Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size.

-### Splits
+#### Splits


would remove Splits and Versions from here because its not really conceptual and they are explained elsewhere

tanushree-sharma · 2025-11-24T04:36:55Z

src/langsmith/evaluation-concepts.mdx


 ![Offline](/langsmith/images/offline.png)

 ### Benchmarking


idt Benchmarking, Unit tests, regression tests, backtesting add value to this guide. i'd keep pariwise though

tanushree-sharma · 2025-11-24T04:38:43Z

src/langsmith/evaluation-concepts.mdx

 ![Online](/langsmith/images/online.png)

-## Testing
+### Real-time monitoring


Same with these. Not super helpful as their own sections i think. i'd rather weave these concepts into the Online Evals section

github-actions · 2025-11-24T20:34:17Z

Mintlify preview ID generated: preview-evalsi-1764016425-db5351a

tanushree-sharma · 2025-11-25T04:05:04Z

src/langsmith/evaluation-concepts.mdx

+Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each.

-A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
+**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:


Suggested change

**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:

**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance:

tanushree-sharma · 2025-11-25T04:07:12Z

src/langsmith/evaluation-concepts.mdx

-A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
+**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
+- **RAG system**: Examples of good retrievals (relevant documents) and good answers (accurate, complete).
+- **Agent**: Examples of correct tool selection and proper argument formatting.


Suggested change

- **Agent**: Examples of correct tool selection and proper argument formatting.

- **Agent**: Examples of correct tool selection and proper formatting or trajectory that the agent took.

tanushree-sharma · 2025-11-25T04:12:22Z

src/langsmith/evaluation-concepts.mdx

+## Evaluation lifecycle

-If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? There are a few techniques you can use:
+As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.


Suggested change

As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.

As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously.

tanushree-sharma · 2025-11-25T04:13:06Z

src/langsmith/evaluation-concepts.mdx

+## Core evaluation targets

-Evaluators receive these inputs:
+Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach.


Suggested change

Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach.

Evaluations run on different targets depending on whether they are offline or online.

tanushree-sharma · 2025-11-25T04:15:08Z

src/langsmith/evaluation-concepts.mdx

+_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.

-Pairwise evaluators allow you to compare the outputs of two versions of an application. This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples).
+### Defining and running evaluators


I'd remove this section since it has overlapping info

tanushree-sharma · 2025-11-25T04:15:31Z

src/langsmith/evaluation-concepts.mdx

+## Evaluators

-### Pairwise
+_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.


Suggested change

_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.

_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available.

Run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on tracing projects or datasets.

tanushree-sharma · 2025-11-25T04:18:47Z

src/langsmith/evaluation-concepts.mdx

-![Comparison view](/langsmith/images/comparison-view.png)
+### Evaluator outputs
+
+Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains:


Suggested change

Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains:

Evaluators return **feedback**, which is the scores from evaluation. Feedback is a dictionary or list of dictionaries. Each dictionary contains:

tanushree-sharma · 2025-11-25T04:23:50Z

src/langsmith/evaluation-concepts.mdx

+![Example](/langsmith/images/example-concept.png)

-There are a few high-level approaches to LLM evaluation:
+Learn more about [managing datasets](/langsmith/manage-datasets).


I'd move this section up right after here since its related/fundamental to offline evals: https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts#experiment

WIP update eval intro docs for clarity online offline user journey

7f928c6

github-actions bot added the langsmith For docs changes to LangSmith label Nov 18, 2025

katmayb added 2 commits November 20, 2025 14:43

Updates to concepts page

7e5bd94

Remove test page and update overview

39bb829

tanushree-sharma reviewed Nov 24, 2025

View reviewed changes

Add updates from feedback

062b4e8

katmayb requested a review from tanushree-sharma November 24, 2025 21:13

tanushree-sharma reviewed Nov 25, 2025

View reviewed changes

tanushree-sharma approved these changes Nov 25, 2025

View reviewed changes

	LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.
	LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

	_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.
	_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.


		Learn [how to analyze experiment results](/langsmith/analyze-an-experiment).

		## Experiment configuration


		Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence.

		1. Create a [dataset](/langsmith/manage-datasets) with representative test cases.

	Start with manually curated examples. Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
	Start with manually curated examples. Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance:

	- Agent: Examples of correct tool selection and proper argument formatting.
	- Agent: Examples of correct tool selection and proper formatting or trajectory that the agent took.

	Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach.
	Evaluations run on different targets depending on whether they are offline or online.

	Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains:
	Evaluators return feedback, which is the scores from evaluation. Feedback is a dictionary or list of dictionaries. Each dictionary contains:

WIP update eval intro docs for clarity online offline user journey #1518

Are you sure you want to change the base?

WIP update eval intro docs for clarity online offline user journey #1518

Uh oh!

Conversation

katmayb commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

katmayb commented Nov 18, 2025 •

edited

Loading

tanushree-sharma commented Nov 24, 2025 •

edited

Loading

tanushree-sharma Nov 24, 2025 •

edited

Loading

tanushree-sharma Nov 24, 2025 •

edited

Loading