You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`use_self_reflection` = False. When using `get_trustworthiness_score()` on
517
+
"base" preset, a faster self-reflection is employed.
518
518
519
519
By default, TLM uses the: "medium" `quality_preset`, "gpt-4.1-mini" base
520
520
`model`, and `max_tokens` is set to 512. You can set custom values for these
@@ -550,11 +550,12 @@ def validate(
550
550
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
551
551
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
552
552
553
-
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
554
-
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
555
-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
553
+
use_self_reflection (bool, default = `True`): whether the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
554
+
Setting this False disables reflection and will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
555
+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts
556
+
and catches responses that are noticeably incorrect/bad upon further analysis.
556
557
557
-
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
558
+
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "semantic"): how the
558
559
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
559
560
Supported similarity measures include - "semantic" (based on natural language inference),
560
561
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
@@ -573,8 +574,6 @@ def validate(
573
574
- name: Name of the evaluation criteria.
574
575
- criteria: Instructions specifying the evaluation criteria.
575
576
576
-
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
577
-
578
577
prompt: The prompt to use for the TLM call. If not provided, the prompt will be
579
578
generated from the messages.
580
579
@@ -583,9 +582,6 @@ def validate(
583
582
rewritten_question: The re-written query if it was provided by the client to Codex from a user to be
584
583
used instead of the original query.
585
584
586
-
tools: Tools to use for the LLM call. If not provided, it is assumed no tools were
587
-
provided to the LLM.
588
-
589
585
extra_headers: Send extra headers
590
586
591
587
extra_query: Add additional query parameters to the request
`use_self_reflection` = False. When using `get_trustworthiness_score()` on
1085
+
"base" preset, a faster self-reflection is employed.
1091
1086
1092
1087
By default, TLM uses the: "medium" `quality_preset`, "gpt-4.1-mini" base
1093
1088
`model`, and `max_tokens` is set to 512. You can set custom values for these
@@ -1123,11 +1118,12 @@ async def validate(
1123
1118
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
1124
1119
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
1125
1120
1126
-
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
1127
-
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
1128
-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
1121
+
use_self_reflection (bool, default = `True`): whether the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
1122
+
Setting this False disables reflection and will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
1123
+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts
1124
+
and catches responses that are noticeably incorrect/bad upon further analysis.
1129
1125
1130
-
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
1126
+
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "semantic"): how the
1131
1127
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
1132
1128
Supported similarity measures include - "semantic" (based on natural language inference),
1133
1129
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
@@ -1146,8 +1142,6 @@ async def validate(
1146
1142
- name: Name of the evaluation criteria.
1147
1143
- criteria: Instructions specifying the evaluation criteria.
1148
1144
1149
-
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
1150
-
1151
1145
prompt: The prompt to use for the TLM call. If not provided, the prompt will be
1152
1146
generated from the messages.
1153
1147
@@ -1156,9 +1150,6 @@ async def validate(
1156
1150
rewritten_question: The re-written query if it was provided by the client to Codex from a user to be
1157
1151
used instead of the original query.
1158
1152
1159
-
tools: Tools to use for the LLM call. If not provided, it is assumed no tools were
1160
-
provided to the LLM.
1161
-
1162
1153
extra_headers: Send extra headers
1163
1154
1164
1155
extra_query: Add additional query parameters to the request
0 commit comments