diff --git a/index.qmd b/index.qmd index bb10bf6..e086de7 100644 --- a/index.qmd +++ b/index.qmd @@ -21,21 +21,21 @@ Stanford, May 2025, THK ## Structure of this book The book has three parts which introduce fundamental models, present learning paradigms, and discuss assumptions. -### Background -We provide background on axioms underlying comparisons in **Chapter 1**. We discover key modeling assumptions It covers random preference models the Independence of Irrelevant Alternatives (IIA), and types of comparison data (binary rankings, accept-reject, lists). The chapter also discusses the main limitations of IIA based on heterogeneity. +### Part 1: Background +We provide background on axioms underlying comparisons in **Chapter 1**. We discover key modeling assumptions. It covers random preference models, the Independence of Irrelevant Alternatives (IIA), and types of comparison data (binary rankings, accept-reject, lists). The chapter also discusses the main limitations of IIA based on heterogeneity. -### Learning +### Part 2: Learning The second part introduces several approaches to learning from comparisons. - **Chapter 2** considers a setting where comparison data is given and studies both maximum likelihood and posterior-based learning of comparison models. It uses case studies from language modeling and robotics. We discuss the challenges in learning multimodal/heterogenous rewards that fail to satisfy IIA. -- **Chapter 3** considers active data collection of comparisons with the goal of optimal inference on comparison models using Various strategies are explored, including reducing the learner's variance, exploiting ambiguity and domain knowledge in ranking, with a case study from robotics. +- **Chapter 3** considers active data collection of comparisons with the goal of optimal inference on comparison models. Various strategies are explored, including reducing the learner's variance, exploiting ambiguity and domain knowledge in ranking, with a case study from robotics. -- **Chapter 4** studies processes where comparisons are used to guide decisions. We first set up the bandit approach to recommending maximal objects with respect to comparisons, and discuss dueling bandits. We then consider as well as reinforcement learning from human feedback (RLHF) to align language models that decide on which text to generate. We highlight the role of uncertainty quantification and exploration for decision-making. +- **Chapter 4** studies processes where comparisons are used to guide decisions. We first set up the bandit approach to recommending maximal objects with respect to comparisons, and discuss dueling bandits. We then consider reinforcement learning from human feedback (RLHF) to align language models that decide on which text to generate. We highlight the role of uncertainty quantification and exploration for decision-making. -- **Chapter 5** considers decision-making in the presence of heterogeneity. We first focus on dealing with heterogeneity to maximize average utility using **personalization**. We then discuss aggregation mechanisms that are voting-based and decisions that are independent of some some features of the outcome. +- **Chapter 5** considers decision-making in the presence of heterogeneity. We first focus on dealing with heterogeneity to maximize average utility using **personalization**. We then discuss aggregation mechanisms that are voting-based and decisions that are independent of some features of the outcome. -### Reflection +### Part 3: Reflection The final part of the book discusses limitations of comparison data, and opportunities resulting from stated preference data. - **Chapter 6** critiques machine learning from comparisons. It takes different disciplinary lenses, from social psychology, philosophy, and critical studies, to highlight where comparisons are limited in the expression of human preferences, and what are alternatives. @@ -43,15 +43,15 @@ The final part of the book discusses limitations of comparison data, and opportu - **Chapter 7** considers models that are broader than comparisons in our model, many of which we can think of as **stated preferences**. These are models in which value judgments are given in terms of Likert scales or textual descriptions. We propose ways in how such feedback can be merged with comparison data to better express preferences. ## How to engage with this book -Threre are three models of reading, and teaching with, this book. Chapter 1 is underlying all of the book, so is part of all of these pathways. +There are three models of reading, and teaching with, this book. Chapter 1 is underlying all of the book, so is part of all of these pathways. - - For practitioners and those teaching applied AI content, we recommend a reading of Chapters 1, 2, 4, and 7, which can be used as a sequence in an early graduate course on Machine Learning. It allows to highlight human data sources in an introductory machine learning course. + - For practitioners and those teaching applied AI content, we recommend a reading of Chapters 1, 2, 4, and 7, which can be used as a sequence in an early graduate course on Machine Learning. This sequence allows highlighting human data sources in an introductory machine learning course. - For people with background in discrete choice, we propose to skim Chapter 1, and study Chapters 2 and 4. These studies allow readers to integrate machine learning in their studies of discrete choice, demand models, and Industrial Organization. - - For those with deep background in machine learning, we propose to study chapters 2-4 and 7. These chapters maximize the amount of machine learning covered, and is suitable for a deep learning-based course of machine learning. + - For those with deep background in machine learning, we propose to study Chapters 2-4 and 7. These chapters maximize the amount of machine learning covered, and are suitable for a deep learning-based course on machine learning. - - For those interested in the methodological and theoretical foundations of machine leraning from comparisons, we recommend a reading of chapters 1, 5, 6, and 7. Chapter 1 and 5 study the underpinnings of revealed preferences and aggregation, chapter 6 critiques these assumptions, and chapter 7 looks at broader ways of eliciting preferences. It is suitable for critical study in a course on Computation and Society. + - For those interested in the methodological and theoretical foundations of machine learning from comparisons, we recommend a reading of Chapters 1, 5, 6, and 7. Chapters 1 and 5 study the underpinnings of revealed preferences and aggregation, Chapter 6 critiques these assumptions, and Chapter 7 looks at broader ways of eliciting preferences. It is suitable for critical study in a course on Computation and Society. ## Prior knowledge The book assumes knowledge of the fundamentals of statistics, linear algebra and machine learning. Many example code excerpts are written in `python`, and make experience in the `python` programming language valuable for readers. diff --git a/src/chap2.qmd b/src/chap2.qmd index 1e66632..58e9267 100644 --- a/src/chap2.qmd +++ b/src/chap2.qmd @@ -32,7 +32,7 @@ execute: By the end of this chapter you will be able to: - **Differentiate** deterministic preferences from *stochastic (random)* preferences and justify why randomness is essential for modelling noisy human choice. -- **Define** and **apply** the *Independence of Irrelevant Alternatives (IIA)* axiom, explaining how it collapses the full preference distribution to an $n$-parameter logit model. +- **Define** and **apply** the *Independence of Irrelevant Alternatives (IIA)* axiom, explaining how it collapses the full preference distribution to an $M$-parameter logit model. - **Derive** choice probabilities for binary comparisons (Bradley–Terry), accept–reject decisions (logistic regression), and full or partial rankings (Plackett–Luce) from a random-utility model with i.i.d. Gumbel shocks. - **Identify** and **compare** the main types of comparison data (full lists, choice-from-a-set, pairwise) and map each to the underlying random preference distribution. - **Simulate** preference data using the Ackley test function, and **visualize** how utility landscapes translate into observed choices. @@ -42,52 +42,62 @@ By the end of this chapter you will be able to: ::: +::: {.callout-note title="Notation"} +Throughout this book, we use the following conventions: +- Items are indexed by $j, j', k \in \{1, \ldots, M\}$ +- Users are indexed by $i \in \{1, \ldots, N\}$ +- Latent utility for user $i$ on item $j$ is $H_{ij} \in \mathbb{R}$ +- Item appeal (item-specific parameter) is $V_j \in \mathbb{R}$ +- Binary preference outcomes are random variables $Y_{jj'} \in \{0, 1\}$, where $Y_{jj'} = 1$ means item $j$ is preferred over $j'$ +- Probabilities are written as $p(\cdot)$ with conditioning after $\mid$ +- The sigmoid function is $\sigma(x) = 1/(1 + e^{-x})$ +::: This is a book about machine learning from human preferences. This first chapter is about generative models for human actions, in particular for *comparisons*. In classical supervised learning, comparisons implicitly arise for a trained model: If the logit of a particular output from a supervised learning model is higher for the label $y$ than $y'$ we would say that the model is *more likely* to produce $y$ than $y'$. While this introduces some amount of comparison on model outputs, it does not help us if the data is given by $y$ being preferred to $y'$, written $y \succ y'$. -First, hence, we introduce stochastic preferences as a model of preferences. We then discuss the most important assumption made in stochastic choice, the Independence of Irrelevant Alternatives (IIA), and discuss its advantages and pitfalls. Chapters 1-5 will restrict to comparisons, including binary comparisons, accept-reject decisions, and ranking lists. Other related data types, such as Likert scales, will be considered in @chapter-beyond. +First, hence, we introduce stochastic preferences as a model of preferences. We then discuss the most important assumption made in stochastic choice, the Independence of Irrelevant Alternatives (IIA), and discuss its advantages and pitfalls. Chapters 1-5 will restrict to comparisons, including binary comparisons, accept-reject decisions, and ranking lists. Other related data types, such as Likert scales, will be considered in a later chapter. ## Random Preferences as a Model of Comparisons {#sec-foundations} -We start with a set of **objects** $y \in Y$---be they products, robot trajectories, or language model responses. We will consider models to generate comparisons that are orders. For realism, but also for mathematical simplicity, we will assume in this book that the set $Y$ of objects is discrete and has $n$ objects. +We start with a set of **items** $j \in \{1, \ldots, M\}$---be they products, robot trajectories, or language model responses. We will consider models to generate comparisons that are orders. For realism, but also for mathematical simplicity, we will assume in this book that the set of items is discrete and has $M$ items. + +Comparisons may be random and are generated by random draws of (total) orders. (Total) Orders have two properties. -Comparisons may be random and are generated by random draws of (total) orders. (Total) Orders have two properties. + - First, for two items $j, j'$ either $j \prec j'$ and/or $j \succ j'$ must hold, an assumption called *totality*:\footnote{One often also allows for preferences to capture a notion of equivalence, called indifference, which is unlikely to happen in random choice models we will learn from data. We use the benefit of simplified notation and restrict to no indifference.} Either $j$ is weakly preferred to $j'$ or $j'$ is weakly preferred to $j$. + - The second assumption is transitivity: if $j \succ j'$ and $j' \succ k$, then also $j \succ k$. - - First, for two objects $y, y'$ either $y \prec y'$ and/or $y \succ y'$ must hold, an assumption called *totality*:\footnote{One often also allows for preferences to capture a notion of equivalence, called indifference, which is unlikely to happen in random choice models we will learn from data. We use the benefit of simplified notation and restrict to no indifference.} Either $y$ is weakly preferred to $y'$ or $y'$ is weakly preferred to $y$. - - The second assumption is transitivity: if $y \succ y'$ and $y' \succ y''$, then also $y \succ y''$. - -In the following, we consider randomness as generated from a *decision-maker* who has an order, or preference relation, $\prec$ on a set of objects $Y$. We refer to the random object $\prec$ as the oracle preference. Each preference $\prec$ has an associated probability mass $\mathbb{P}[\mathord{\prec}]$, leading to an $(n!-1)$-dimensional vector encoding the full random preference set. (This might look like, and is already for small values of $n$ a large number. Reducing this representational complexity is a goal of this chapter.) +In the following, we consider randomness as generated from a *decision-maker* who has an order, or preference relation, $\prec$ on a set of items. We refer to the random object $\prec$ as the oracle preference. Each preference $\prec$ has an associated probability mass $p(\prec)$, leading to an $(M!-1)$-dimensional vector encoding the full random preference set. (This might look like, and is already for small values of $M$ a large number. Reducing this representational complexity is a goal of this chapter.) One might wonder why we need to have a random preference. Deterministic preferences are conceptually helpful constructs and are used broadly in the fields of Consumer theory (e.g., @mas1995microeconomic). However, they suffer when bringing them to data, as data is inherently noisy. -Even when allowing for randomness, assumptions we will impose in this section, be it transitivity or the Independence of Irrelevant Alternatives, are stark yet practical. In many situations they will fail, for good reasons. Whether these are human's inability to express rankings, contextual challenges of domains, or community norms, we will discuss them in, humans cannot clearly rank alternatives, their choices reflect individualistic norms, or they might have self-control pictures. Many of these wrinkles on the approach to preferences presented here is contained in @sec-beyond. Until then, we will make the fullest use of learning stochastic preferences. +Even when allowing for randomness, assumptions we will impose in this section, be it transitivity or the Independence of Irrelevant Alternatives, are stark yet practical. In many situations they will fail, for good reasons. Whether these are human's inability to express rankings, contextual challenges of domains, or community norms, we will discuss them in, humans cannot clearly rank alternatives, their choices reflect individualistic norms, or they might have self-control pictures. Many of these wrinkles on the approach to preferences presented here will be discussed in later chapters. Until then, we will make the fullest use of learning stochastic preferences. ## Types of Comparison Data -There are different types of comparison data we may observe. We can relate them back to the population preferences $\prec$. +There are different types of comparison data we may observe. We can relate them back to the population preferences $\prec$. ### Full Preference Lists -The conceptually simplest and practically most verbose preference sampling is to get the full preference ranking, i.e. $L = (y_1, y_2, \dots, y_n)$, where $y_1 \succ y_2 \succ \cdots \succ y_n$. In this case, we know not only that $y_1$ is preferred to $y_2$, but also, by transitivity, that it is preferred to all other options. Similarly, we know that $y_2$ is preferred to all options but $y_1$, *etc.* In many cases, we do not observe full preferences as the cognitive load for humans is too high. +The conceptually simplest and practically most verbose preference sampling is to get the full preference ranking, i.e., $L = (j_1, j_2, \dots, j_M)$, where $j_1 \succ j_2 \succ \cdots \succ j_M$. In this case, we know not only that $j_1$ is preferred to $j_2$, but also, by transitivity, that it is preferred to all other options. Similarly, we know that $j_2$ is preferred to all options but $j_1$, *etc.* In many cases, we do not observe full preferences as the cognitive load for humans is too high. ### The Most-Preferred Element from a Subset: (Binary) Choices -Another type of sample is $(y, Y')$ where $y$ is the most preferred alternative from $Y'$ for a sampled preference. Formally, $y \prec y'$ for all $y' \in Y' \setminus \{y\}$---$y$ is preferred to all elements of $Y$ but $y'$. +Another type of sample is $(j, \mathcal{S})$ where $j$ is the most preferred alternative from subset $\mathcal{S}$ for a sampled preference. Formally, $j \succ k$ for all $k \in \mathcal{S} \setminus \{j\}$---$j$ is preferred to all elements of $\mathcal{S}$ other than itself. -Formally, the probability that we observe $(y, Y')$ is +Formally, the probability that we observe $(j, \mathcal{S})$ is $$ -\mathbb{P}[(y, Y')] = \sum_{\prec: y \prec y' \forall y' \in Y' \setminus \{y\}} \mathbb{P} [\mathord{\prec}]. +p(j \mid \mathcal{S}) = \sum_{\prec: j \succ k \; \forall k \in \mathcal{S} \setminus \{j\}} p(\prec). $$ -That is, the probability of observing $(y, Y')$ is given by the sum of all preferred samples $\prec$ such that $y$ is preferred to all $y'$ in $Y'$ other than $y$. +That is, the probability of observing $(j, \mathcal{S})$ is given by the sum of all preferred samples $\prec$ such that $j$ is preferred to all $k$ in $\mathcal{S}$ other than $j$. -If the choice is binary, $Y' = \{y, y'\}$, we also write $(y \succ y')$ for a sample $(x, \{x,y\})$. We highlight that these objects are random, and depend on the sample of $\prec$. Binary data is convenient and quick to elicit and has been prominently applied in language model finetuning and evaluation. +If the choice is binary, $\mathcal{S} = \{j, j'\}$, we write this as $Y_{jj'} = 1$ (item $j$ preferred over $j'$). We highlight that these outcomes are random, and depend on the sample of $\prec$. Binary data is convenient and quick to elicit and has been prominently applied in language model finetuning and evaluation. -Sometimes, particularly when a decision-maker is offered an object or "nothing", we will implicitly assume that there is an "outside option" $y_0$ in $Y$, allowing us to interpret $(y, \{y, y_0\})$ as "accepting" $y$, and $(y_0, \{y, y_0\})$ as rejecting it. Outside options can be thought of as fundamental limits to what a system designer can obtain. Consider a recommendation system. A user of that system might engage with content or not. In principle, instead of engaging, they will do something else. We do not model this in out set of objects $Y$ as a fundamental abstraction. *All models are wrong, but some are useful.* +Sometimes, particularly when a decision-maker is offered an item or "nothing", we will implicitly assume that there is an "outside option" indexed by $0$, allowing us to interpret $Y_{j0} = 1$ as "accepting" item $j$, and $Y_{j0} = 0$ as rejecting it. Outside options can be thought of as fundamental limits to what a system designer can obtain. Consider a recommendation system. A user of that system might engage with content or not. In principle, instead of engaging, they will do something else. We do not model this explicitly as a fundamental abstraction. *All models are wrong, but some are useful.* ### Mind the Context -Choices are often conditional, and data is given by $(x, L)$ (for list-based data), $(x, y, Y')$ (for general choice-based data), or $(x, y, y')$ for binary data. $x \in X$ is some *context*: the environment of a purchase, the goal of a robot, or a user prompt for a large language model. It can also be a prompt to the decision-maker, e.g., to human raters on whether they should pick preferences based on helplessness or harmlessness @ganguli2022redteaminglanguagemodels. The inclusion of context in learning allows for the generalization of preferences, as we will see in subsequent chapters. +Choices are often conditional, and data is given by $(i, L)$ (for list-based data), $(i, j, \mathcal{S})$ (for general choice-based data), or $(i, j, j')$ for binary data. Here $i$ indexes the *user* or *context*: the environment of a purchase, the goal of a robot, or a user prompt for a large language model. It can also be a prompt to the decision-maker, e.g., to human raters on whether they should pick preferences based on helpfulness or harmlessness @ganguli2022redteaminglanguagemodels. The inclusion of context in learning allows for the generalization of preferences, as we will see in subsequent chapters. ## Random Utility Models -An equivalent way to represent random preferences is to identify a sample $\prec$ with a vector $u_{\prec} = (u_{\mathord{\prec}} (y))_{y \in Y} \in \mathbb R^Y$ where $y \succ y'$ if and only if $u(y) > u(y')$. (For the concerned reader: We assume that $u(y) = u(y')$ happens with zero probability; and for discrete $Y$ such a vector always exists.) +An equivalent way to represent random preferences is to identify a sample $\prec$ with a vector of utilities $H = (H_j)_{j=1}^M \in \mathbb{R}^M$ where $j \succ j'$ if and only if $H_j > H_{j'}$. (For the concerned reader: We assume that $H_j = H_{j'}$ happens with zero probability; and for discrete item sets such a vector always exists.) To get a sense for different random utility models, we consider a particular model that has the complexity of many models in modern machine learning: The Ackley function. In this model, each alternative is represented by a $d$-dimensional vector $(x_1, \ldots, x_d) \in \mathbb{R}^d$, the Ackley function is given by $$ @@ -325,69 +335,69 @@ plt.scatter(items[non_preferred, 0], items[non_preferred, 1], c='purple', label= plt.legend() plt.show() ``` -Similarly, we can sample data for $(y, Y')$ for $y \in Y' \subseteq Y$. +Similarly, we can sample data for $(j, \mathcal{S})$ for $j \in \mathcal{S} \subseteq \{1, \ldots, M\}$. -## Mean Utilities +## Mean Utilities -We can view a random utility model $u_{\mathord{\prec}}$ as a deterministic part and a random part: +We can view a random utility model $H$ as a deterministic part and a random part: $$ -u_{\mathord{\prec}} (y) = u(y) + \varepsilon_{\mathord{\prec}y}. +H_j = V_j + \varepsilon_j. $$ -The vector $u(y)$ is deterministic, and a vector $(\varepsilon_{\mathord{\prec}y})_{y \in Y}$ of noise is independent for different $\prec$. We say that $u(y)$ is the mean utility and $\varepsilon_{\mathord{\prec}y}$ is the noise. There are different forms of noise possible. We will focus on a particular one (called Type-1 Extreme value), but others are also popular, for example Gaussian noise. +The vector $(V_j)_{j=1}^M$ is deterministic, and a vector $(\varepsilon_j)_{j=1}^M$ of noise is independent across items. We say that $V_j$ is the mean utility and $\varepsilon_j$ is the noise. When users are modeled explicitly (as in Chapter 3), the latent utility $H_{ij}$ depends on both user and item parameters; for now, we focus on the single-user or population-level case. There are different forms of noise possible. We will focus on a particular one (called Type-1 Extreme value), but others are also popular, for example Gaussian noise. -There are (at least) three different ways to view this noise: +There are (at least) three different ways to view this noise: - - Either it is capturing the heterogeneity of different decision-makers---a view that is taken in the Economics field of Industrial Organization. Under this view, observing $(y, Y')$ more frequently than $(y', Y')$ is a sign of there being a higher number of decision-makers preferring $y$ over $y'$ than the other way around. - - or as errors of a decision-maker's optimization of utilities $u(y)$. This view is endorsed in the literature on Bounded Rationality. Under this view, it cannot be directly concluded from frequent observation of $(y, Y')$ compared to $(y', Y')$ that $y$ is preferred to $y$. It might have been chosen in error. - - We can also view it as a belief of the designer about the preferences $(u(y))_{y \in Y}$. In this view, the posterior after observing data can be used to make claims about relative preferences. + - Either it is capturing the heterogeneity of different decision-makers---a view that is taken in the Economics field of Industrial Organization. Under this view, observing $(j, \mathcal{S})$ more frequently than $(j', \mathcal{S})$ is a sign of there being a higher number of decision-makers preferring $j$ over $j'$ than the other way around. + - or as errors of a decision-maker's optimization of utilities $V_j$. This view is endorsed in the literature on Bounded Rationality. Under this view, it cannot be directly concluded from frequent observation of $(j, \mathcal{S})$ compared to $(j', \mathcal{S})$ that $j$ is preferred to $j'$. It might have been chosen in error. + - We can also view it as a belief of the designer about the preferences $(V_j)_{j=1}^M$. In this view, the posterior after observing data can be used to make claims about relative preferences. The interpretation will guide our decision-making predictions in Chapters 4 and 5. We next introduce a main way to simplify learning utility functions: The axiom of Independence of Irrelevant Alternatives. ## Independence of Irrelevant Alternatives -In later chapters, we will consider cases where we sample from the most preferred elements from all objects $(y,Y)$, which we call generation task. A simple assumption will allow us to recover the probabilities of $(y, Y)$, and in fact, the full distribution of $\prec$ from binary comparisons: The so-called Independence of Irrelevant Alternatives, introduced in @luce1959individual. This assumption not only allows us to easily identify a preference model, it will also massively reduce what is needed to be estimated from data: Instead of the full $n!-1$-dimensional object, it will be sufficient to learn $n$ values. +In later chapters, we will consider cases where we sample from the most preferred elements from all items, which we call the generation task. A simple assumption will allow us to recover the probabilities of choosing any item, and in fact, the full distribution of $\prec$ from binary comparisons: The so-called Independence of Irrelevant Alternatives, introduced in @luce1959individual. This assumption not only allows us to easily identify a preference model, it will also massively reduce what is needed to be estimated from data: Instead of the full $(M!-1)$-dimensional object, it will be sufficient to learn $M$ values. -IIA assumes that the relative likelihood of choosing $y$ compared to z does not change whether a third alternative $w$ is in the choice set or not. Formally, for every $Y' \subseteq Y$, $y,z \in Y'$, and $w \in Y \setminus Y'$, +IIA assumes that the relative likelihood of choosing $j$ compared to $k$ does not change whether a third alternative $\ell$ is in the choice set or not. Formally, for every $\mathcal{S} \subseteq \{1, \ldots, M\}$, $j, k \in \mathcal{S}$, and $\ell \notin \mathcal{S}$, $$ -\frac{\mathbb{P}[(y, Y')]}{\mathbb{P}[(z, Y')]} = \frac{\mathbb{P}[(y, Y' \cup \{w\})]}{\mathbb{P}[(z, Y' \cup \{w\})]}. +\frac{p(j \mid \mathcal{S})}{p(k \mid \mathcal{S})} = \frac{p(j \mid \mathcal{S} \cup \{\ell\})}{p(k \mid \mathcal{S} \cup \{\ell\})}. $$ -(In particular, it must be that $\mathbb{P}[(z, Y')] \neq 0$ and $\mathbb{P}[(z, Y' \cup \{w\})] \neq 0$.) That is, the relative probability of choosing $y$ over $y''$ and $y'$ over $y''$ should be independent of whether $z$ is present in the choice set $Y' \subseteq Y$. We will show that this single assumption is sufficient to make the choice model $n$-dimensional, making learning feasible. +(In particular, it must be that $p(k \mid \mathcal{S}) \neq 0$ and $p(k \mid \mathcal{S} \cup \{\ell\}) \neq 0$.) That is, the relative probability of choosing $j$ over $k$ should be independent of whether $\ell$ is present in the choice set. We will show that this single assumption is sufficient to make the choice model $M$-dimensional, making learning feasible. -First, to our primary example: All random utility models with *independent and identically distributed* noise terms satisfy IIA. (We ask the reader to convince themselves that the Ackerman function does not satisfy IIA.) +First, to our primary example: All random utility models with *independent and identically distributed* noise terms satisfy IIA. (We ask the reader to convince themselves that the Ackley function does not satisfy IIA.) ::: {.callout-tip title="theorem"} -A random utility model $u_{\mathord{\prec}}(y)$ satisfies IIA if and only if we can write it as $u_{\mathord{\prec}}(y) = u(y) + \varepsilon_{\mathord{\prec}y}$, where $u(y)$ is deterministic and $\varepsilon_{\mathord{\prec}y}$ is sampled independently and identically from the Gumbel distribution. The Gumbel distribution has cumulative distribution function $F(x) = e^{-e^{-x}}$. +A random utility model $H_j$ satisfies IIA if and only if we can write it as $H_j = V_j + \varepsilon_j$, where $V_j$ is deterministic and $\varepsilon_j$ is sampled independently and identically from the Gumbel distribution. The Gumbel distribution has cumulative distribution function $F(x) = e^{-e^{-x}}$. -::: +::: -This is quite strong, and an **equivalence**. If we are willing to assume IIA, it is sufficient to learn $n$ parameters to characterize the full distribution---an exponential decrease in parameters to learn. The Gumbel model may be unusual in particular for those with a stronger background in machine learning. A more familiar formulation arises for the probabilities of choice. +This is quite strong, and an **equivalence**. If we are willing to assume IIA, it is sufficient to learn $M$ parameters to characterize the full distribution---an exponential decrease in parameters to learn. The Gumbel model may be unusual in particular for those with a stronger background in machine learning. A more familiar formulation arises for the probabilities of choice. ::: {.callout-tip title="theorem"} -Assume a random preference model satisfies IIA, hence $u_{\mathord{\prec}} (y)= u(y) + \varepsilon_{\mathord{\prec}y}$. Then, the probabilities of lists are: +Assume a random preference model satisfies IIA, hence $H_j = V_j + \varepsilon_j$. Then, the probabilities of lists are: $$ -\mathbb{P}[(y_1 \succ y_2 \succ \cdots \succ y_n)] = \frac{e^{u(y_1)}}{\sum_{i=1}^n e^{u(y_1)}} \cdot \frac{e^{u(y_2)}}{\sum_{i=2}^n e^{u(y_i)}}\cdot \frac{e^{u(y_3)}}{\sum_{i=3}^n e^{u(y_1)}} \cdots \frac{e^{u(y_{n-1})}}{ e^{u(y_{n-1})} + e^{u(y_{n})}}. +p(j_1 \succ j_2 \succ \cdots \succ j_M) = \frac{e^{V_{j_1}}}{\sum_{m=1}^M e^{V_{j_m}}} \cdot \frac{e^{V_{j_2}}}{\sum_{m=2}^M e^{V_{j_m}}} \cdots \frac{e^{V_{j_{M-1}}}}{e^{V_{j_{M-1}}} + e^{V_{j_M}}}. $$ -For choices from sets, +For choices from sets, $$ -\mathbb{P} [(y, Y')] = \frac{e^{u(y)}}{\sum_{y' \in Y'} e^{u(y')}} = \operatorname{softmax}_y ((u(y'))_{y' \in Y'}). +p(j \mid \mathcal{S}) = \frac{e^{V_j}}{\sum_{k \in \mathcal{S}} e^{V_k}} = \operatorname{softmax}_j ((V_k)_{k \in \mathcal{S}}). $$ In particular, for binary comparisons $$ -\mathbb{P} [(y_1 \succ y_2)] = \frac{e^{u(y_1)}}{e^{u(y_1)} + e^{u(y_1)}} = \frac{1}{1 + e^{u(y_1) - u(y_2)}} = \sigma (u(y_1) - u(y_2)). +p(Y_{jj'} = 1) = \frac{e^{V_j}}{e^{V_j} + e^{V_{j'}}} = \frac{1}{1 + e^{-(V_j - V_{j'})}} = \sigma(V_j - V_{j'}). $$ -where $\sigma = 1/(1 + e^x)$ is the sigmoid function. +where $\sigma(x) = 1/(1 + e^{-x})$ is the sigmoid function. ::: -In particular, the choice probabilities $\mathbb{P} [(y, Y)]$ are equivalent to the multi-class logistic regression model (also called multinomial logit model), and generation from this model, that is, sampling $y$ with probability $\mathbb{P}[(y, Y)]$ is given by softmax-sampling $\operatorname{softmax}((u(y))_{y \in Y})$. +In particular, the choice probabilities $p(j \mid \mathcal{S})$ are equivalent to the multi-class logistic regression model (also called multinomial logit model), and generation from this model, that is, sampling $j$ with probability $p(j \mid \{1, \ldots, M\})$ is given by softmax-sampling $\operatorname{softmax}((V_j)_{j=1}^M)$. -This model has many names, depending on the feedback type we consider. For binary comparison data, it is the Bradley-Terry model @bradley1952rank. If data is in forms of list, it is called the Plackett-Luce model @plackett1975analysis. For accept-reject sampling it is also called logistic regression. For choices from subsets $Y$, is is called the (discrete choice) logit model. We will usually call the model the **logit model** and specify the feedback type. +This model has many names, depending on the feedback type we consider. For binary comparison data, it is the Bradley-Terry model @bradley1952rank. If data is in forms of list, it is called the Plackett-Luce model @plackett1975analysis. For accept-reject sampling it is also called logistic regression. For choices from subsets, it is called the (discrete choice) logit model. We will usually call the model the **logit model** and specify the feedback type. -The IIA assumption has many desirable properties, such as stochastic transitivity and relativity. The reader is asked to prove them in the exercises to this chapter. The learning of, and optimization based on, the mean utilities $(u(y))_{y \in Y}$ is one of the central goal of this book. Supervised learning based on it will be covered in the next chapter. +The IIA assumption has many desirable properties, such as stochastic transitivity and relativity. The reader is asked to prove them in the exercises to this chapter. The learning of, and optimization based on, the mean utilities $(V_j)_{j=1}^M$ is one of the central goals of this book. Supervised learning based on it will be covered in the next chapter. -One note on identification, that is, whether different utility functions $u$ generate the same random preference model $\prec$. The models that are implied by $u(y)$ and by $u(y) + c$ for any constant $c \in \mathbb{R}$ are the same, which means that any learning of the function $u$, which we will engage in in the next chapters, will need to fix one of the values $u(y)$. If there is an outside option $y_0$, then, it is typically chosen $u(y_0) = 0$ and all mean utilities are in comparison to the outside option. +One note on identification, that is, whether different utility vectors $V$ generate the same random preference model $\prec$. The models that are implied by $(V_j)$ and by $(V_j + c)$ for any constant $c \in \mathbb{R}$ are the same, which means that any learning of the utilities, which we will engage in in the next chapters, will need to fix one of the values $V_j$. If there is an outside option indexed by $0$, then, it is typically chosen $V_0 = 0$ and all mean utilities are in comparison to the outside option. IIA has limitations, which might require to allow for more flexible specifications of noise and heterogeneity. @@ -397,10 +407,10 @@ IIA is surprisingly strong, but does not allow for choice probabilities that are ### IIA and Heterogeneity -A crucial shortcoming of IIA is that if sub-populations satisfy IIA this does not mean that the full population satisfies IIA. Assume that our population consists of sub-populations $i = 1, \dots, m$ which have mass $\alpha_1, \alpha_2, \dots, \alpha_m$, respectively, in the population, and that each of the groups has preferences satisfying IIA. Because of IIA, we can represent each sub-group's stochastic preferences with an average utility $u_i \colon Y \to \mathbb R$, $i = 1, 2, \dots, n$. The distribution of the full population is then given by a mixture of the sub-population preferences. For example, for binary comparisons +A crucial shortcoming of IIA is that if sub-populations satisfy IIA this does not mean that the full population satisfies IIA. Assume that our population consists of sub-populations $i = 1, \dots, N$ which have mass $\alpha_1, \alpha_2, \dots, \alpha_N$, respectively, in the population, and that each of the groups has preferences satisfying IIA. Because of IIA, we can represent each sub-group's stochastic preferences with an average utility vector $(V_j^{(i)})_{j=1}^M$ for group $i$. The distribution of the full population is then given by a mixture of the sub-population preferences. For example, for binary comparisons $$ -\mathbb{P} [y_1 \succ y_2 ] = \sum_{i=1}^n \alpha_i \mathbb{P} [y_1 \succ y_2 | \text{ group } i] = \sum_{i = 1}^m \alpha_i \sigma (u_i (y_1) - u_i(y_2)). +p(Y_{jj'} = 1) = \sum_{i=1}^N \alpha_i \, p(Y_{jj'} = 1 \mid \text{group } i) = \sum_{i=1}^N \alpha_i \, \sigma(V_j^{(i)} - V_{j'}^{(i)}). $$ Sadly, such mixtures are far from IIA, as we see in the following coding example. @@ -551,15 +561,15 @@ plt.grid(True) plt.show() ``` -Classic ways to solve this concern is to consider a model with explicit representation of heterogeneity, where $y \prec y'$ holds if $\alpha \sim F$ for some distribution $F$, and $(u(y))_{y \in Y} \alpha$ is a random utility model with independent error terms. For example, consider a logit random utility model +Classic ways to solve this concern is to consider a model with explicit representation of heterogeneity, where preferences depend on a latent type $\alpha \sim F$ for some distribution $F$, and $(V_j)_{j=1}^M$ given $\alpha$ follows a random utility model with independent error terms. For example, consider a logit random utility model $$ -u(y) = \beta^\top x + \varepsilon_y +V_j = \beta^\top x_j + \varepsilon_j $$ and assume $\beta \sim N(\mu, \Sigma)$ is a normally distributed vector, a model called the random coefficients logit model. Equivalently, we can view this as a model with correlated utility shocks. ### Similar Options -A second limitation is only relevant if we move beyond binary choices, or observe preference lists. Let $Y = \{y, y', z\}$, where $y$ and $y'$ are (almost) identical and different from $z$. (In the classical example, $y, y'$ are red and blue buses, respectively, and $z$ is a train). Assume an IIA model given by average utility $u \colon Y \to \mathbb{R}$. As $y$ and $y'$ are almost identical, assume $u(y) = u(y')$. We have $\mathbb{P}[z, \{y, z\}] = \mathbb{P}[z, \{y', z\}]$. How do these values compare to $\mathbb{P}[z, \{y, y', z\}]$? It would be intuitive to think that $z$ is chosen with the same frequency, as there should not be more "demand" for object $z$ only because $y$ is cloned. This is not the case. +A second limitation is only relevant if we move beyond binary choices, or observe preference lists. Let $\{1, 2, 3\}$ be three items, where items $1$ and $2$ are (almost) identical and different from item $3$. (In the classical example, items $1, 2$ are red and blue buses, respectively, and item $3$ is a train). Assume an IIA model given by average utilities $(V_j)$. As items $1$ and $2$ are almost identical, assume $V_1 = V_2$. We have $p(3 \mid \{1, 3\}) = p(3 \mid \{2, 3\})$. How do these values compare to $p(3 \mid \{1, 2, 3\})$? It would be intuitive to think that item $3$ is chosen with the same frequency, as there should not be more "demand" for item $3$ only because item $1$ is cloned. This is not the case. ::: {.callout-note title="Code"} @@ -617,56 +627,58 @@ print("After splitting (Car, Red Bus, Blue Bus):", probs_after) print("After splitting, total bus share:", probs_after[1] + probs_after[2]) ``` -The choice probability of `car` is reduced. Why is our intuition making us think that $y$ and $y'$ should split their choice probability? One option is because we assume some correlation: If you like $y$ over $z$ then you should also like $y'$ over $z$, and vice versa. Hence, we would like the correlation between choice probabilities in random utility models. For example, if we allow in a logit random utility model the error terms in $y, y'$ to be perfectly correlated (and we break ties uniformly at random), then +The choice probability of `car` is reduced. Why is our intuition making us think that items $1$ and $2$ should split their choice probability? One option is because we assume some correlation: If you like item $1$ over item $3$ then you should also like item $2$ over item $3$, and vice versa. Hence, we would like correlation between choice probabilities in random utility models. For example, if we allow in a logit random utility model the error terms for items $1, 2$ to be perfectly correlated (and we break ties uniformly at random), then $$ -(\mathbb{P}[y, \{y, y', z\}], \mathbb{P}[y', \{y, y', z\}], \mathbb{P}[z, \{y, y', z\}]) = \left(\frac{\mathbb{P}[y, \{y, z\}]}{2}, \frac{\mathbb{P}[y', \{y', z\}]}{2}, \frac{\mathbb{P}[z, \{y, z\}]}{2}\right), +(p(1 \mid \{1, 2, 3\}), p(2 \mid \{1, 2, 3\}), p(3 \mid \{1, 2, 3\})) = \left(\frac{p(1 \mid \{1, 3\})}{2}, \frac{p(2 \mid \{2, 3\})}{2}, p(3 \mid \{1, 3\})\right), $$ -confirming our intuition. +confirming our intuition. This is the end of our discussion of the Independence of Irrelevant Alternatives. Additional features can be found in [@train2009discrete;@ben1985discrete;@mcfadden1981econometric] and the original paper for logit analysis @mcfadden1972conditional. -The next chapter is the first to study learning of average utility functions from preference data, and assumes that a dataset is given of (average) utility functions $u \colon Y \to \mathbb R$ for different types of sampling and for different notions of "inference". +The next chapter is the first to study learning of average utility functions from preference data, and assumes that a dataset is given of utility vectors $(V_j)_{j=1}^M$ for different types of sampling and for different notions of "inference". | Notation | Meaning | Domain / Type | |---|---|---| -| $Y$ | Finite set of objects/alternatives | $\{y_1,\dots ,y_n\}$ | -| $n$ | Number of objects | $\lvert Y\rvert\in\mathbb N$ | -| $y,y',y''$ | Generic objects in $Y$ | Elements of $Y$ | -| $y_0$ | Outside (“no-choice”) option | Element of $Y$ (reference) | -| $\prec$ / $\succ$ | Weak preference relation / its strict part | Binary relation on $Y$ | -| $\mathord{\prec}$ | Random preference (draw of $\prec$) | RV over total orders on $Y$ | -| $L=(y_1,\dots ,y_n)$ | Full ranking (preference list) | Permutation of $Y$ | -| $(y,Y')$ | Observation that $y$ is chosen from subset $Y'$ | $y\in Y'\subseteq Y$ | -| $x\in X$ | Exogenous context / features | $X$ (arbitrary feature space) | -| $u(y)$ | Mean (deterministic) utility of $y$ | $\mathbb R$ | -| $u_{\mathord{\prec}}(y)$ | Random utility in draw $\mathord{\prec}$ | $\mathbb R$ | -| $\varepsilon_{\mathord{\prec}y}$ | Stochastic utility shock | $\mathbb R$ (i.i.d.) | -| $\mathbb P[\cdot]$ | Probability measure over preferences/choices | $[0,1]$ | -| $\operatorname{softmax}_y\bigl((u(y'))_{y'\in Y'}\bigr)$ | Logit/Plackett-Luce choice probability of $y$ from $Y'$ | $[0,1]$ | -| $\sigma(z)$ | Sigmoid $1/(1+e^{-z})$ | $[0,1]$ | -| $d$ | Dimensionality of feature vectors | $\mathbb N$ | -| $\boldsymbol{x}\in\mathbb R^{d}$ | Feature vector of an object | $\mathbb R^{d}$ | -| $\text{Ackley}(\boldsymbol{x})$ | Ackley test-function value | $\mathbb R$ | +| $M$ | Number of items | $\mathbb{N}$ | +| $j, j', k$ | Item indices | $\{1, \ldots, M\}$ | +| $0$ | Outside ("no-choice") option index | Reference item | +| $\prec$ / $\succ$ | Weak preference relation / its strict part | Binary relation on items | +| $L=(j_1,\dots ,j_M)$ | Full ranking (preference list) | Permutation of items | +| $(j, \mathcal{S})$ | Observation that $j$ is chosen from subset $\mathcal{S}$ | $j \in \mathcal{S} \subseteq \{1,\ldots,M\}$ | +| $Y_{jj'}$ | Binary preference outcome ($j$ vs $j'$) | $\{0, 1\}$ | +| $i$ | User/context index | $\{1, \ldots, N\}$ | +| $N$ | Number of users | $\mathbb{N}$ | +| $U_i$ | User embedding/appetite | $\mathbb{R}^K$ | +| $V_j$ | Item loading/appeal | $\mathbb{R}^K$ (or $\mathbb{R}$ when $K=1$) | +| $Z_j$ | Item offset | $\mathbb{R}$ | +| $H_{ij}$ | Latent utility for user $i$, item $j$ | $\mathbb{R}$ | +| $H_j$ | Latent utility (single-user case) | $\mathbb{R}$ | +| $\varepsilon_j$ | Stochastic utility shock for item $j$ | $\mathbb{R}$ (i.i.d.) | +| $p(\cdot)$ | Probability | $[0,1]$ | +| $p(j \mid \mathcal{S})$ | Logit/Plackett-Luce choice probability of $j$ from $\mathcal{S}$ | $[0,1]$ | +| $\sigma(x)$ | Sigmoid $1/(1+e^{-x})$ | $(0,1)$ | +| $d$ | Dimensionality of feature vectors | $\mathbb{N}$ | +| $\boldsymbol{x}_j \in\mathbb{R}^{d}$ | Feature vector of item $j$ | $\mathbb{R}^{d}$ | +| $\text{Ackley}(\boldsymbol{x})$ | Ackley test-function value | $\mathbb{R}$ | | $a,b,c$ | Ackley parameters | Scalars | -| $k$ | Number of preference samples | $\mathbb N$ | | $\alpha_i$ | Population weight of subgroup $i$ | $(0,1)$ with $\sum_i\alpha_i=1$ | -| $\beta$ | Random-coefficients vector in linear RUM | $\mathbb R^{d}$ | -| $\Sigma$ | Covariance matrix of $\beta$ | $\mathbb R^{d\times d}$ | +| $\beta$ | Random-coefficients vector in linear RUM | $\mathbb{R}^{d}$ | +| $\Sigma$ | Covariance matrix of $\beta$ | $\mathbb{R}^{d\times d}$ | -: **Table 1 — Notation used in Chapter “Background”.** {#tbl-notation} +: **Table 1 — Notation used in Chapter "Background".** {#tbl-notation} ## Discussion Questions -- How does modeling preferences as **random** (rather than deterministic) help us capture real-world choice behavior? -- What is the **Independence of Irrelevant Alternatives** (IIA) axiom, and why does it simplify the estimation of choice models? -- Why do i.i.d. Gumbel shocks in a random utility model lead to the Plackett–Luce (list) and softmax/logit (choice) formulas? -- In what ways can **binary comparisons**, **choice-from-a-set**, and **full rankings** each be seen as observations of the same underlying stochastic preference distribution? -- What are the practical advantages and drawbacks of eliciting **full preference lists** versus **pairwise comparisons** from human subjects? -- How does introducing an “outside option” $y_{0}$ allow us to interpret **accept–reject** data within the same logit framework? -- Why does mixing multiple IIA-satisfying sub-populations generally **violate** IIA at the aggregate level? -- Explain the “**red bus–blue bus**” problem: why does splitting a single alternative into two identical ones distort logit choice probabilities? -- How does context $x$ enter the random utility framework, and what role does it play in generalizing preferences to new situations? -- What identification issues arise from the fact that adding a constant to all utilities $u(y)$ does not change observable choice probabilities? +- How does modeling preferences as **random** (rather than deterministic) help us capture real-world choice behavior? +- What is the **Independence of Irrelevant Alternatives** (IIA) axiom, and why does it simplify the estimation of choice models? +- Why do i.i.d. Gumbel shocks in a random utility model lead to the Plackett–Luce (list) and softmax/logit (choice) formulas? +- In what ways can **binary comparisons**, **choice-from-a-set**, and **full rankings** each be seen as observations of the same underlying stochastic preference distribution? +- What are the practical advantages and drawbacks of eliciting **full preference lists** versus **pairwise comparisons** from human subjects? +- How does introducing an "outside option" (item $0$) allow us to interpret **accept–reject** data within the same logit framework? +- Why does mixing multiple IIA-satisfying sub-populations generally **violate** IIA at the aggregate level? +- Explain the "**red bus–blue bus**" problem: why does splitting a single alternative into two identical ones distort logit choice probabilities? +- How does user/context index $i$ enter the random utility framework, and what role does it play in generalizing preferences to new situations? +- What identification issues arise from the fact that adding a constant to all utilities $V_j$ does not change observable choice probabilities? - In what scenarios would you consider **relaxing** IIA and what additional model complexity does that introduce? ## Exercises @@ -674,13 +686,13 @@ We place ⭐, ⭐⭐, and ⭐⭐⭐ for exercises we deem relatively easy, mediu ### Properties of IIA Models ⭐ -Prove that if a preference model satisfies IIA, it will also satisfy $\mathbb{P}[(y, Y')] \le \mathbb{P}[(y, Y'')]$ for any $y \in Y$ and $Y' \subseteq Y'' \subseteq Y$ (called regularity) and for all $(x,y,z)$, if $\mathbb{P}[(x, \{x,y\})] \ge 0.5$ and $\mathbb{P}[(y, \{y,z\})] \ge 0.5$, then necessarily $\mathbb{P}[(x, \{x,z\})] \ge 0.5$. +Prove that if a preference model satisfies IIA, it will also satisfy $p(j \mid \mathcal{S}) \le p(j \mid \mathcal{S}')$ for any $j$ and $\mathcal{S} \supseteq \mathcal{S}'$ (called regularity) and for all $(j, k, \ell)$, if $p(j \mid \{j, k\}) \ge 0.5$ and $p(k \mid \{k, \ell\}) \ge 0.5$, then necessarily $p(j \mid \{j, \ell\}) \ge 0.5$. ### Discrete Choice Models ⭐⭐ -Consider a linear random utility model $u(y)=\beta_i^\top x+\epsilon_i$ for $i=1, 2, \cdots, N$, where $\varepsilon_y$ is i.i.d. sampled from a Gumbel distribution. We would like to compute $\mathbb{P}[(y, Y)]$ and connect it to multi-class logistic regression. +Consider a linear random utility model $V_j = \beta^\top x_j + \varepsilon_j$ for $j = 1, 2, \ldots, M$, where $\varepsilon_j$ is i.i.d. sampled from a Gumbel distribution. We would like to compute $p(j \mid \{1, \ldots, M\})$ and connect it to multi-class logistic regression. -(a) First $\mathbb{P}[u(y)