From 3ed7d3f784c7dc0ec760940436c75aed22f9dca8 Mon Sep 17 00:00:00 2001 From: Quirin Anton Klaus Hosse Date: Wed, 30 Apr 2025 13:27:30 +0200 Subject: [PATCH] Update reinforcement_learning_terms.md --- docs/Cheatsheets/reinforcement_learning_terms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Cheatsheets/reinforcement_learning_terms.md b/docs/Cheatsheets/reinforcement_learning_terms.md index e021c4f..d8c5937 100644 --- a/docs/Cheatsheets/reinforcement_learning_terms.md +++ b/docs/Cheatsheets/reinforcement_learning_terms.md @@ -69,7 +69,7 @@ Think of the advantage function as a measure of how much better it was to take t ## The Learning Process -Most learning algorithms consider an *objective function* $J(\pi)$, which is a function that maps a policy $\pi$ to a real number. The goal of learning is then to find a policy $\pi^*$ that maximizes the objective function, i.e. $J(\pi^*) = \max_{\pi} J(\pi)$. A convenient choice for $J$ would be any of the Q function, value function, or advantage function. For our purposes we will focus on the advantage function, because the Proximal Policy Optimization (PPO) algorithm uses that as an bjective. +Most learning algorithms consider an *objective function* $J(\pi)$, which is a function that maps a policy $\pi$ to a real number. The goal of learning is then to find a policy $\pi^*$ that maximizes the objective function, i.e. $J(\pi^*) = \max_{\pi} J(\pi)$. A convenient choice for $J$ would be any of the Q function, value function, or advantage function. For our purposes we will focus on the advantage function, because the Proximal Policy Optimization (PPO) algorithm uses that as an objective. ## Generalized Advantage Estimation