-
Notifications
You must be signed in to change notification settings - Fork 1.1k
WIP update eval intro docs for clarity online offline user journey #1518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Mintlify preview ID generated: preview-evalsi-1763502859-c2d102b |
|
Mintlify preview ID generated: preview-evalsi-1763668729-42c6cad |
| --- | ||
|
|
||
| LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are: | ||
| LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. | |
| LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. |
| #### Heuristic | ||
|
|
||
| LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference). | ||
| _Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. | |
| _Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really use the term heuristic evaluators
|
|
||
| Learn [how to analyze experiment results](/langsmith/analyze-an-experiment). | ||
|
|
||
| ## Experiment configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this isnt conceptual anymore and not something i would want someone new to evals to learn right away. we should move to the Set up evolutions > evaluation techniques section.
|
WDYT about the following structure/content for the concepts page. Feels like theres a lot of info and i think the order can be better:
|
|
|
||
| Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence. | ||
|
|
||
| 1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this flow but I'd remove the numbered bullets for each of these sections. I think for someone new since they're super high level it makes the bullets hard to follow
| 1. Deploy the updated application. | ||
| 1. Confirm the fix with online evaluations. | ||
|
|
||
| ## Core evaluation objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather frame it as: these are the kinds of evals you can run online vs offline. This section makes that really clear and then you can introduce the different concepts for offline and online
Thoughts on using the word evaluation "targets" instead of "objects"? Offline evaluations and online evaluations run on different targets; online evals operate on runs, while offline evals operate on examples
| _Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size. | ||
|
|
||
| ### Splits | ||
| #### Splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would remove Splits and Versions from here because its not really conceptual and they are explained elsewhere
|
|
||
|  | ||
|
|
||
| ### Benchmarking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idt Benchmarking, Unit tests, regression tests, backtesting add value to this guide. i'd keep pariwise though
|  | ||
|
|
||
| ## Testing | ||
| ### Real-time monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with these. Not super helpful as their own sections i think. i'd rather weave these concepts into the Online Evals section
|
Mintlify preview ID generated: preview-evalsi-1764016425-db5351a |
| Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each. | ||
|
|
||
| A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair. | ||
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: | |
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance: |
| A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair. | ||
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: | ||
| - **RAG system**: Examples of good retrievals (relevant documents) and good answers (accurate, complete). | ||
| - **Agent**: Examples of correct tool selection and proper argument formatting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Agent**: Examples of correct tool selection and proper argument formatting. | |
| - **Agent**: Examples of correct tool selection and proper formatting or trajectory that the agent took. |
| ## Evaluation lifecycle | ||
|
|
||
| If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? There are a few techniques you can use: | ||
| As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously. | |
| As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously. |
| ## Core evaluation targets | ||
|
|
||
| Evaluators receive these inputs: | ||
| Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach. | |
| Evaluations run on different targets depending on whether they are offline or online. |
| _Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available. | ||
|
|
||
| Pairwise evaluators allow you to compare the outputs of two versions of an application. This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples). | ||
| ### Defining and running evaluators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this section since it has overlapping info
| ## Evaluators | ||
|
|
||
| ### Pairwise | ||
| _Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available. | |
| _Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available. | |
| Run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on tracing projects or datasets. |
|  | ||
| ### Evaluator outputs | ||
|
|
||
| Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains: | |
| Evaluators return **feedback**, which is the scores from evaluation. Feedback is a dictionary or list of dictionaries. Each dictionary contains: |
|  | ||
|
|
||
| There are a few high-level approaches to LLM evaluation: | ||
| Learn more about [managing datasets](/langsmith/manage-datasets). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this section up right after here since its related/fundamental to offline evals: https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts#experiment
Preview
https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts