diff --git a/.Rbuildignore b/.Rbuildignore deleted file mode 100644 index dff09a7..0000000 --- a/.Rbuildignore +++ /dev/null @@ -1,4 +0,0 @@ -^requirements\.txt$ -^renv$ -^renv\.lock$ -^\.github$ diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 3047166..1c02bb2 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -31,6 +31,13 @@ jobs: - name: Render to html run: quarto render --to html --profile html --no-clean + - name: Ensure PDF is in output + run: | + # Copy PDF to _book if it exists at root level + if [ -f "Machine-Learning-from-Human-Preferences.pdf" ]; then + cp "Machine-Learning-from-Human-Preferences.pdf" "_book/" + fi + - name: Publish to GitHub Pages uses: quarto-dev/quarto-actions/publish@v2 with: diff --git a/.gitignore b/.gitignore index c66a6a8..ff231f2 100644 --- a/.gitignore +++ b/.gitignore @@ -54,7 +54,7 @@ images/management/analysis_pipeline.pdf *.tex *.pdf *.idx -*.ild +*.ilg *.ind *.lock *.ipynb diff --git a/Machine-Learning-from-Human-Preferences.idx b/Machine-Learning-from-Human-Preferences.idx deleted file mode 100644 index 2384389..0000000 --- a/Machine-Learning-from-Human-Preferences.idx +++ /dev/null @@ -1,10 +0,0 @@ -\indexentry{de-identification|hyperindexformat{\seealso{anonymization}}}{232} -\indexentry{anonymization|hyperindexformat{\seealso{de-identification}}}{232} -\indexentry{analytic flexibility|hyperindexformat{\seealso{p-hacking}}}{232} -\indexentry{p-hacking|hyperindexformat{\seealso{analytic flexibility}}}{232} -\indexentry{Cohen's d|hyperindexformat{\seealso{standardized mean difference (SMD)}}}{232} -\indexentry{standardized mean difference (SMD)|hyperindexformat{\seealso{Cohen's d}}}{232} -\indexentry{APA|hyperindexformat{\see{American Psychological Association (APA)}}}{232} -\indexentry{CDI|hyperindexformat{\see{Communicative Development Inventory}}}{232} -\indexentry{DAG|hyperindexformat{\see{directed acyclic graph (DAG)}}}{232} -\indexentry{blinding|hyperindexformat{\see{masking}}}{232} diff --git a/Machine-Learning-from-Human-Preferences.ilg b/Machine-Learning-from-Human-Preferences.ilg deleted file mode 100644 index f80a623..0000000 --- a/Machine-Learning-from-Human-Preferences.ilg +++ /dev/null @@ -1,6 +0,0 @@ -This is makeindex, version 2.17 [TeX Live 2024] (kpathsea + Thai support). -Scanning input file Machine-Learning-from-Human-Preferences.idx....done (10 entries accepted, 0 rejected). -Sorting entries....done (34 comparisons). -Generating output file Machine-Learning-from-Human-Preferences.ind....done (37 lines written, 0 warnings). -Output written in Machine-Learning-from-Human-Preferences.ind. -Transcript written in Machine-Learning-from-Human-Preferences.ilg. diff --git a/Machine-Learning-from-Human-Preferences.ind b/Machine-Learning-from-Human-Preferences.ind deleted file mode 100644 index 2b1b4f7..0000000 --- a/Machine-Learning-from-Human-Preferences.ind +++ /dev/null @@ -1,37 +0,0 @@ -\begin{theindex} - - \item analytic flexibility, - \hyperindexformat{\seealso{p-hacking}}{232} - \item anonymization, - \hyperindexformat{\seealso{de-identification}}{232} - \item APA, - \hyperindexformat{\see{American Psychological Association (APA)}}{232} - - \indexspace - - \item blinding, \hyperindexformat{\see{masking}}{232} - - \indexspace - - \item CDI, - \hyperindexformat{\see{Communicative Development Inventory}}{232} - \item Cohen's d, - \hyperindexformat{\seealso{standardized mean difference (SMD)}}{232} - - \indexspace - - \item DAG, \hyperindexformat{\see{directed acyclic graph (DAG)}}{232} - \item de-identification, - \hyperindexformat{\seealso{anonymization}}{232} - - \indexspace - - \item p-hacking, - \hyperindexformat{\seealso{analytic flexibility}}{232} - - \indexspace - - \item standardized mean difference (SMD), - \hyperindexformat{\seealso{Cohen's d}}{232} - -\end{theindex} diff --git a/Machine-Learning-from-Human-Preferences.tex b/Machine-Learning-from-Human-Preferences.tex deleted file mode 100644 index c1169e5..0000000 --- a/Machine-Learning-from-Human-Preferences.tex +++ /dev/null @@ -1,15404 +0,0 @@ -% Options for packages loaded elsewhere -\PassOptionsToPackage{unicode}{hyperref} -\PassOptionsToPackage{hyphens}{url} -\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor} -% -\documentclass[ - letterpaper, - numbers=noenddot, - DIV=11]{scrreprt} - -\usepackage{amsmath,amssymb} -\usepackage{iftex} -\ifPDFTeX - \usepackage[T1]{fontenc} - \usepackage[utf8]{inputenc} - \usepackage{textcomp} % provide euro and other symbols -\else % if luatex or xetex - \usepackage{unicode-math} - \defaultfontfeatures{Scale=MatchLowercase} - \defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1} -\fi -\usepackage{lmodern} -\ifPDFTeX\else - % xetex/luatex font selection -\fi -% Use upquote if available, for straight quotes in verbatim environments -\IfFileExists{upquote.sty}{\usepackage{upquote}}{} -\IfFileExists{microtype.sty}{% use microtype if available - \usepackage[]{microtype} - \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts -}{} -\makeatletter -\@ifundefined{KOMAClassName}{% if non-KOMA class - \IfFileExists{parskip.sty}{% - \usepackage{parskip} - }{% else - \setlength{\parindent}{0pt} - \setlength{\parskip}{6pt plus 2pt minus 1pt}} -}{% if KOMA class - \KOMAoptions{parskip=half}} -\makeatother -\usepackage{xcolor} -\setlength{\emergencystretch}{3em} % prevent overfull lines -\setcounter{secnumdepth}{5} -% Make \paragraph and \subparagraph free-standing -\makeatletter -\ifx\paragraph\undefined\else - \let\oldparagraph\paragraph - \renewcommand{\paragraph}{ - \@ifstar - \xxxParagraphStar - \xxxParagraphNoStar - } - \newcommand{\xxxParagraphStar}[1]{\oldparagraph*{#1}\mbox{}} - \newcommand{\xxxParagraphNoStar}[1]{\oldparagraph{#1}\mbox{}} -\fi -\ifx\subparagraph\undefined\else - \let\oldsubparagraph\subparagraph - \renewcommand{\subparagraph}{ - \@ifstar - \xxxSubParagraphStar - \xxxSubParagraphNoStar - } - \newcommand{\xxxSubParagraphStar}[1]{\oldsubparagraph*{#1}\mbox{}} - \newcommand{\xxxSubParagraphNoStar}[1]{\oldsubparagraph{#1}\mbox{}} -\fi -\makeatother - -\usepackage{color} -\usepackage{fancyvrb} -\newcommand{\VerbBar}{|} -\newcommand{\VERB}{\Verb[commandchars=\\\{\}]} -\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}} -% Add ',fontsize=\small' for more characters per line -\usepackage{framed} -\definecolor{shadecolor}{RGB}{241,243,245} -\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}} -\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}} -\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.40,0.45,0.13}{#1}} -\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\BuiltInTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}} -\newcommand{\CharTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}} -\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}} -\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}} -\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}} -\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{\textbf{#1}}} -\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}} -\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\ExtensionTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}} -\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.28,0.35,0.67}{#1}} -\newcommand{\ImportTok}[1]{\textcolor[rgb]{0.00,0.46,0.62}{#1}} -\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}} -\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{\textbf{#1}}} -\newcommand{\NormalTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}} -\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}} -\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}} -\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.68,0.00,0.00}{#1}} -\newcommand{\RegionMarkerTok}[1]{\textcolor[rgb]{0.00,0.23,0.31}{#1}} -\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}} -\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}} -\newcommand{\StringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}} -\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.07,0.07,0.07}{#1}} -\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.13,0.47,0.30}{#1}} -\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}} - -\providecommand{\tightlist}{% - \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}\usepackage{longtable,booktabs,array} -\usepackage{calc} % for calculating minipage widths -% Correct order of tables after \paragraph or \subparagraph -\usepackage{etoolbox} -\makeatletter -\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{} -\makeatother -% Allow footnotes in longtable head/foot -\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}} -\makesavenoteenv{longtable} -\usepackage{graphicx} -\makeatletter -\newsavebox\pandoc@box -\newcommand*\pandocbounded[1]{% scales image to fit in text height/width - \sbox\pandoc@box{#1}% - \Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}% - \Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}% - \ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi% select the smaller of both - \ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}% - \else\usebox{\pandoc@box}% - \fi% -} -% Set default figure placement to htbp -\def\fps@figure{htbp} -\makeatother -% definitions for citeproc citations -\NewDocumentCommand\citeproctext{}{} -\NewDocumentCommand\citeproc{mm}{% - \begingroup\def\citeproctext{#2}\cite{#1}\endgroup} -\makeatletter - % allow citations to break across lines - \let\@cite@ofmt\@firstofone - % avoid brackets around text for \cite: - \def\@biblabel#1{} - \def\@cite#1#2{{#1\if@tempswa , #2\fi}} -\makeatother -\newlength{\cslhangindent} -\setlength{\cslhangindent}{1.5em} -\newlength{\csllabelwidth} -\setlength{\csllabelwidth}{3em} -\newenvironment{CSLReferences}[2] % #1 hanging-indent, #2 entry-spacing - {\begin{list}{}{% - \setlength{\itemindent}{0pt} - \setlength{\leftmargin}{0pt} - \setlength{\parsep}{0pt} - % turn on hanging indent if param 1 is 1 - \ifodd #1 - \setlength{\leftmargin}{\cslhangindent} - \setlength{\itemindent}{-1\cslhangindent} - \fi - % set entry spacing - \setlength{\itemsep}{#2\baselineskip}}} - {\end{list}} -\usepackage{calc} -\newcommand{\CSLBlock}[1]{\hfill\break\parbox[t]{\linewidth}{\strut\ignorespaces#1\strut}} -\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{\strut#1\strut}} -\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{\strut#1\strut}} -\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1} - -%%%%%%%%%%%%%%%%%%%% -% start preamble.tex -%%%%%%%%%%%%%%%%%%%% - -\input{resources/tex/boxes.tex} - -% page layout -\usepackage{geometry} -\geometry{ - dvips=false, pdftex=false, vtex=false, % drivers can have unexpected behaviors - papersize={8in,10in}, % size specified by MIT Press - centering, % split margins equally - margin=.6in, % margins (must all be at least .5in) - includemp, includehead % include sidenotes & header in body - % showframe % show page structure (for debugging) -} - -% set fonts -% \setmainfont[]{ETbb} -\setmainfont{ETbb}[ - UprightFont = {*-Regular}, - BoldFont = {*-Bold}, - ItalicFont = {*-Italic}, - BoldItalicFont = {*-BoldItalic}, - Path = {./resources/fonts/ETbb/}, - Extension = {.otf} -] - -\setsansfont{SourceSansPro}[ - UprightFont = {*-Regular}, - % BoldFont = {*-Bold}, - % ItalicFont = {*-Italic}, - Path = {./resources/fonts/}, - Extension = {.ttf} -] - -% set font specifications -\setkomafont{disposition}{\rmfamily\itshape} -\addtokomafont{part}{\sffamily\scshape} -\addtokomafont{partnumber}{\sffamily\scshape} -\addtokomafont{chapter}{\sffamily\scshape} -\setkomafont{partentry}{\sffamily\scshape} -\setkomafont{chapterentry}{\sffamily\scshape} -\addtokomafont{title}{\sffamily}%\scshape} -\addtokomafont{subtitle}{\sffamily}%\scshape} -% \addtokomafont{author}{\sffamily} -\addtokomafont{pagehead}{\sffamily\scshape} -\addtokomafont{pagenumber}{\sffamily\scshape} - -\usepackage{amsmath} -\usepackage{unicode-math} - -% adjust spacing around section headers -\RedeclareSectionCommand[ - runin=false, - afterskip=0pt % remove extra space after for section -]{section} -\RedeclareSectionCommand[ - runin=false, - afterskip=0pt % remove extra space after for subsection -]{subsection} - -% only part number on part title pages -\renewcommand{\partformat}{\thepart} - -% headers/footers -\usepackage{scrlayer-scrpage} -\KOMAoptions{headwidth=textwithmarginpar} % make header full width -\automark{chapter} -\clearpairofpagestyles -\renewcommand{\chaptermark}[1]{\markboth{#1}{}} % prevent chaptermark from uppercasing -\ihead{% - \ifnum\value{chapter}>0 \thechapter\hspace{3pt} \fi % include chapter number if not 0 - \textsc{\leftmark} % then chapter name -} -\ohead{\pagemark} -\pagestyle{scrheadings} - -% table of contents -\usepackage[titles]{tocloft} -\renewcommand{\cftpartfont}{\sffamily\scshape\Large} % part title -\renewcommand{\cftpartpagefont}{\sffamily\scshape\large} % part page number -\setlength{\cftbeforepartskip}{1.25em} % part vspace before -\renewcommand{\cftchapfont}{\sffamily\scshape\large} % chapter title -\renewcommand{\cftchappagefont}{\sffamily\scshape\large} % chapter page number -\setlength{\cftbeforechapskip}{.05em} % chapter vspace before - -% set chapter numbers flushright -\newcommand{\chapnumlen}{.5em} -\renewcommand{\cftchappresnum}{\hfill} -\renewcommand{\cftchapaftersnum}{\hspace*{\chapnumlen}} -\addtolength{\cftchapnumwidth}{\chapnumlen} -% \renewcommand{\cftchapnumwidth}{\chapnumlen} -% \addtolength{\cftchapindent}{2em} - -% \setlength{\cftbeforechapskip}{.25em} -% \setlength{\cftbeforepartskip}{1.5em} - -\newcommand{\partnumlen}{.75em} -\renewcommand{\cftpartpresnum}{\hfill} -\renewcommand{\cftpartaftersnum}{\hspace*{\partnumlen}} -% \addtolength{\cftpartnumwidth}{\partnumlen} -\setlength{\cftpartindent}{0em} -% \renewcommand{\cftpartnumwidth}{\partnumlen} - -% \renewcommand{\cftpartnumwidth}{\cftpartpagewidth} -% \renewcommand{\cftpartnumformat}[1]{\hfill{\bfseries #1}} % Adjust font weight/style if necessary -% \renewcommand{\cftpartnumwidth}{\numlen} % Adjust this width as needed -% \renewcommand{\cftpartleader}{\hfill} % Use this to add the space before the number - -% lists -\usepackage{enumitem} -\setlist[itemize]{ - label={--} % en-dash as bullet symbol -} - -\usepackage{threeparttable} % for papaja apa tables -\setlength{\tabcolsep}{4pt} % horizontal space between table columns - -% styling for captions -\usepackage[format=plain]{caption} -\usepackage{marginfix} % load before sidenotes to improve sidenote positioning -\usepackage{sidenotes} -\usepackage{marginnote} -\DeclareCaptionFont{caps}{\footnotesize} - -\captionsetup{ - labelfont=caps, - textfont=caps, - skip=0pt, - belowskip=-6pt, - labelsep=newline -} -\DeclareCaptionStyle{sidecaption}{labelfont=caps,textfont=caps,skip=6pt,belowskip=0pt,labelsep=newline} -\DeclareCaptionStyle{marginfigure}{labelfont=caps,textfont=caps,skip=6pt,belowskip=0pt,labelsep=newline} -\DeclareCaptionStyle{margintable}{labelfont=caps,textfont=caps,skip=6pt,labelsep=newline} -\DeclareCaptionStyle{longtable}{labelfont=caps,textfont=caps,skip=6pt,labelsep=newline} - -% reset sidenote counter at start of each chapter -\let\oldchapter\chapter -\def\chapter{% - \setcounter{sidenote}{1}% - \oldchapter -} - -\usepackage{bbm} -\usepackage{unicode-math} - -\usepackage{fvextra} -\DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}} - -% space above and below equations -% \setlength{\abovedisplayskip}{0pt} -% \setlength{\belowdisplayskip}{0pt} -\usepackage[nodisplayskipstretch]{setspace} - - % override quarto box settings -\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[enhanced, borderline west={3pt}{0pt}{shadecolor}, breakable, interior hidden, frame hidden, boxrule=0pt, sharp corners]}{\end{tcolorbox}}\fi - -% index -\usepackage{imakeidx} -\makeindex[intoc=true] %, columns=3, columnseprule=true, options=-s latex/indexstyles.ist] - -% temporary settings for copyediting -% \setstretch{2} -% \usepackage{lineno} -% \linenumbers - -%%%%%%%%%%%%%%%%%% -% end preamble.tex -%%%%%%%%%%%%%%%%%% -\makeatletter -\@ifpackageloaded{tcolorbox}{}{\usepackage[skins,breakable]{tcolorbox}} -\@ifpackageloaded{fontawesome5}{}{\usepackage{fontawesome5}} -\definecolor{quarto-callout-color}{HTML}{909090} -\definecolor{quarto-callout-note-color}{HTML}{0758E5} -\definecolor{quarto-callout-important-color}{HTML}{CC1914} -\definecolor{quarto-callout-warning-color}{HTML}{EB9113} -\definecolor{quarto-callout-tip-color}{HTML}{00A047} -\definecolor{quarto-callout-caution-color}{HTML}{FC5300} -\definecolor{quarto-callout-color-frame}{HTML}{acacac} -\definecolor{quarto-callout-note-color-frame}{HTML}{4582ec} -\definecolor{quarto-callout-important-color-frame}{HTML}{d9534f} -\definecolor{quarto-callout-warning-color-frame}{HTML}{f0ad4e} -\definecolor{quarto-callout-tip-color-frame}{HTML}{02b875} -\definecolor{quarto-callout-caution-color-frame}{HTML}{fd7e14} -\makeatother -\makeatletter -\@ifpackageloaded{bookmark}{}{\usepackage{bookmark}} -\makeatother -\makeatletter -\@ifpackageloaded{caption}{}{\usepackage{caption}} -\AtBeginDocument{% -\ifdefined\contentsname - \renewcommand*\contentsname{Table of contents} -\else - \newcommand\contentsname{Table of contents} -\fi -\ifdefined\listfigurename - \renewcommand*\listfigurename{List of Figures} -\else - \newcommand\listfigurename{List of Figures} -\fi -\ifdefined\listtablename - \renewcommand*\listtablename{List of Tables} -\else - \newcommand\listtablename{List of Tables} -\fi -\ifdefined\figurename - \renewcommand*\figurename{Figure} -\else - \newcommand\figurename{Figure} -\fi -\ifdefined\tablename - \renewcommand*\tablename{Table} -\else - \newcommand\tablename{Table} -\fi -} -\@ifpackageloaded{float}{}{\usepackage{float}} -\floatstyle{ruled} -\@ifundefined{c@chapter}{\newfloat{codelisting}{h}{lop}}{\newfloat{codelisting}{h}{lop}[chapter]} -\floatname{codelisting}{Listing} -\newcommand*\listoflistings{\listof{codelisting}{List of Listings}} -\usepackage{amsthm} -\theoremstyle{definition} -\newtheorem{definition}{Definition}[chapter] -\theoremstyle{plain} -\newtheorem{proposition}{Proposition}[chapter] -\theoremstyle{remark} -\AtBeginDocument{\renewcommand*{\proofname}{Proof}} -\newtheorem*{remark}{Remark} -\newtheorem*{solution}{Solution} -\newtheorem{refremark}{Remark}[chapter] -\newtheorem{refsolution}{Solution}[chapter] -\makeatother -\makeatletter -\makeatother -\makeatletter -\@ifpackageloaded{caption}{}{\usepackage{caption}} -\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}} -\makeatother -\makeatletter -\@ifpackageloaded{algorithm}{}{\usepackage{algorithm}} -\makeatother -\makeatletter -\@ifpackageloaded{algpseudocode}{}{\usepackage{algpseudocode}} -\makeatother -\makeatletter -\@ifpackageloaded{caption}{}{\usepackage{caption}} -\makeatother - -\usepackage{bookmark} - -\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available -\urlstyle{same} % disable monospaced font for URLs -% Make links footnotes instead of hotlinks: -\DeclareRobustCommand{\href}[2]{#2\footnote{\url{#1}}} -\hypersetup{ - pdftitle={Machine Learning from Human Preferences}, - pdfauthor={Sang T. Truong; Andreas Haupt; Sanmi Koyejo}, - colorlinks=true, - linkcolor={DarkBlue}, - filecolor={Maroon}, - citecolor={DarkGreen}, - urlcolor={DarkGreen}, - pdfcreator={LaTeX via pandoc}} - - -\title{Machine Learning from Human Preferences} -\author{Sang T. Truong \and Andreas Haupt \and Sanmi Koyejo} -\date{2025-05-20} - -\begin{document} -\newgeometry{} - -\begin{titlepage} -\end{titlepage} - -\begin{titlepage} - \centering - {\usekomafont{title}\scshape\Huge Machine Learning from Human -Preferences\par}\clearpage -\end{titlepage} - -\begin{titlepage} - \begin{center} - {\usekomafont{title}\scshape\Huge Machine Learning from Human -Preferences\par} - \vskip 1em - {\usekomafont{subtitle}\LARGE \par} - \vskip 1em - \setstretch{1.5} - {\usekomafont{author} Sang T. Truong, Andreas Haupt, and~Sanmi -Koyejo \par} - \vfill - {\rmfamily\large Stanford University\\Stanford, CA} - \end{center} -\end{titlepage} - -\begin{titlepage} - \vspace*{\fill} - {\rmfamily\scriptsize - © 2025 Stanford University\par - All rights reserved. - } - \vspace*{\fill} -\end{titlepage} - -\restoregeometry{} -\RecustomVerbatimEnvironment{verbatim}{Verbatim}{ - showspaces = false, - showtabs = false, - breaksymbolleft={}, - breaklines -} -\numberwithin{algorithm}{chapter} -\algrenewcommand{\algorithmiccomment}[1]{\hskip3em$\rightarrow$ #1} - -\floatname{algorithm}{Algorithm} - -\numberwithin{algorithm}{chapter} - -\renewcommand*\contentsname{Table of Contents} -{ -\hypersetup{linkcolor=} -\setcounter{tocdepth}{1} -\tableofcontents -} - -\phantomsection\label{sec-intro} -\bookmarksetup{startatroot} - -\chapter*{Introduction} -\addcontentsline{toc}{chapter}{Introduction} - -\markboth{Introduction}{Introduction} - -Machine learning is increasingly shaping various aspects of our lives, -from education and healthcare to scientific discovery. A key challenge -in developing trustworthy intelligent systems is ensuring they align -with human preferences. Learning from human feedback offers a promising -solution to this challenge. This book introduces the foundations and -practical applications of machine learning from human preferences. -Instead of manually predefining the learning goal, the book presents -preference-based learning that incorporates human feedback to guide the -learning process, drawing insights from related fields such as -economics, psychology, and human-computer interaction. By the end of -this book, readers will be equipped with the key concepts and tools -needed to design systems that effectively align with human preferences. - -The book is intended for researchers, practitioners, and students who -are interested in intergrating machine learning with human-centered -application. We assume some basic knowledge of probability and -statistics, but provides sufficient background and references for the -readers to follow the main ideas. The book also provides illustrative -program examples and datasets. The field of machine learning from human -preference is a vibrant area of research and practice with many open -challenges and opportunities, and we hope that this book will inspire -readers to further explore and advance this exciting field. - -We hope with the present book to both allow more use of human -preferences in machine learning, and new data modalities as Artificial -Intelligence systems become increasingly important. - -Stanford, May 2025, THK - -\section*{Structure of this book}\label{structure-of-this-book} -\addcontentsline{toc}{section}{Structure of this book} - -\markright{Structure of this book} - -The book has three parts which introduce fundamental models, present -learning paradigms, and discuss assumptions. - -\subsection*{Background}\label{background} -\addcontentsline{toc}{subsection}{Background} - -We provide background on axioms underlying comparisons in -\textbf{Chapter 1}. We discover key modeling assumptions It covers -random preference models the Independence of Irrelevant Alternatives -(IIA), and types of comparison data (binary rankings, accept-reject, -lists). The chapter also discusses the main limitations of IIA based on -heterogeneity. - -\subsection*{Learning}\label{learning} -\addcontentsline{toc}{subsection}{Learning} - -The second part introduces several approaches to learning from -comparisons. - -\begin{itemize} -\item - \textbf{Chapter 2} considers a setting where comparison data is given - and studies both maximum likelihood and posterior-based learning of - comparison models. It uses case studies from language modeling and - robotics. We discuss the challenges in learning - multimodal/heterogenous rewards that fail to satisfy IIA. -\item - \textbf{Chapter 3} considers active data collection of comparisons - with the goal of optimal inference on comparison models using Various - strategies are explored, including reducing the learner's variance, - exploiting ambiguity and domain knowledge in ranking, with a case - study from robotics. -\item - \textbf{Chapter 4} studies processes where comparisons are used to - guide decisions. We first set up the bandit approach to recommending - maximal objects with respect to comparisons, and discuss dueling - bandits. We then consider as well as reinforcement learning from human - feedback (RLHF) to align language models that decide on which text to - generate. We highlight the role of uncertainty quantification and - exploration for decision-making. -\item - \textbf{Chapter 5} considers decision-making in the presence of - heterogeneity. We first focus on dealing with heterogeneity to - maximize average utility using \textbf{personalization}. We then - discuss aggregation mechanisms that are voting-based and decisions - that are independent of some some features of the outcome. -\end{itemize} - -\subsection*{Reflection}\label{reflection} -\addcontentsline{toc}{subsection}{Reflection} - -The final part of the book discusses limitations of comparison data, and -opportunities resulting from stated preference data. - -\begin{itemize} -\item - \textbf{Chapter 6} critiques machine learning from comparisons. It - takes different disciplinary lenses, from social psychology, - philosophy, and critical studies, to highlight where comparisons are - limited in the expression of human preferences, and what are - alternatives. -\item - \textbf{Chapter 7} considers models that are broader than comparisons - in our model, many of which we can think of as \textbf{stated - preferences}. These are models in which value judgments are given in - terms of Likert scales or textual descriptions. We propose ways in how - such feedback can be merged with comparison data to better express - preferences. -\end{itemize} - -\section*{How to engage with this -book}\label{how-to-engage-with-this-book} -\addcontentsline{toc}{section}{How to engage with this book} - -\markright{How to engage with this book} - -Threre are three models of reading, and teaching with, this book. -Chapter 1 is underlying all of the book, so is part of all of these -pathways. - -\begin{itemize} -\item - For practitioners and those teaching applied AI content, we recommend - a reading of Chapters 1, 2, 4, and 7, which can be used as a sequence - in an early graduate course on Machine Learning. It allows to - highlight human data sources in an introductory machine learning - course. -\item - For people with background in discrete choice, we propose to skim - Chapter 1, and study Chapters 2 and 4. These studies allow readers to - integrate machine learning in their studies of discrete choice, demand - models, and Industrial Organization. -\item - For those with deep background in machine learning, we propose to - study chapters 2-4 and 7. These chapters maximize the amount of - machine learning covered, and is suitable for a deep learning-based - course of machine learning. -\item - For those interested in the methodological and theoretical foundations - of machine leraning from comparisons, we recommend a reading of - chapters 1, 5, 6, and 7. Chapter 1 and 5 study the underpinnings of - revealed preferences and aggregation, chapter 6 critiques these - assumptions, and chapter 7 looks at broader ways of eliciting - preferences. It is suitable for critical study in a course on - Computation and Society. -\end{itemize} - -\section*{Prior knowledge}\label{prior-knowledge} -\addcontentsline{toc}{section}{Prior knowledge} - -\markright{Prior knowledge} - -The book assumes knowledge of the fundamentals of statistics, linear -algebra and machine learning. Many example code excerpts are written in -\texttt{python}, and make experience in the \texttt{python} programming -language valuable for readers. - -\section*{Additional Materials}\label{additional-materials} -\addcontentsline{toc}{section}{Additional Materials} - -\markright{Additional Materials} - -Every chapter has problems for readers and slides for teaching of the -material available. They are available on the -\href{mlhp.stanford.edu}{book's website}. - -\bookmarksetup{startatroot} - -\chapter{Background}\label{background-1} - -Human preference modeling aims to capture humans' decision making -processes in a probabilistic framework. Many problems would benefit from -a quantitative perspective, enabling an understanding of how humans -engage with the world. In this chapter, we will explore how one can -model human preferences, including different formulations of such -models, how one can optimize these models given data, and considerations -one should understand to create such systems. - -\section{The Construction of Preference}\label{sec-foundations} - -\subsection{Axiom 1. Construction of Choices Set: Luce's Choice Axiom -(Luce, 1959)}\label{axiom-1-preference-models-model-choice} - -Preference models model the preferred choices amongst a set of items. -Preference models must enumerate the set of all possible choices -included in a human decision. As such, we must ensure that the choices -we enumerate capture the entire domain (collectively exhaustive) but are -distinct (mutually exclusive) choices. A discrete set of choices is a -constraint we canonically impose to ensure we can tractably model -preferences and aptly estimate the parameters of preference models. We -assume that if a new item is added to the choice set, the relative -probabilities of choosing between the original items remain unchanged. -This is known as the Independence of Irrelevant Alternatives (IIA) -property from Luce's axiom of choices (\citeproc{ref-Luce1977}{Luce -1977}). - -\subsection{Axiom 2. Preference Centers around Utility: Reciprocity -(Block \& Marschak, -1960)}\label{axiom-2-preference-centers-around-reward} - -Preference models are centered around the notion of reward, a scalar -quantity representing the benefit or value an individual attains from -selecting a given choice. We assume that the underlying reward mechanism -of a human preference model captures the final decision output from a -human. We use the notation \(u_{i,j}\) as the reward of person \(i\) -choosing item \(j\). The reward is a random variable, decomposing into -true reward \(u_{i,j}^*\) and a random noise \(\epsilon_{i,j}\): -\(u_{i,j} = u_{i,j}^* + \epsilon_{i,j}\). McFadden -(\citeproc{ref-mcfadden_conditional_1974}{1974}) posits that reward can -further be decomposed into user-specific reward \(\theta_i\) and -item-specific reward \(z_j\): \(u_{i,j}^* = \theta_i + z_j\). This -decomposition indicates that for a single user, only the relative -difference in reward matters to predict the choice among items, and the -scale of rewards is important when comparing across users. - -\subsection{Axiom 3. Preference captures decision-making: Wins as a -Sufficient Statistic (Bühlmann \& Huber, -1963)}\label{axiom-3-preference-captures-decision-making} - -Human preferences are classified into two categories: revealed -preferences and stated preferences. Revealed preferences are those one -can observe retroactively from existing data. The implicit -decision-making knowledge can be captured via learnable parameters and -their usage in models that represent relationships between input -decision attributes that may have little interpretability but enable -powerful models of human preference. Such data may be easier to acquire -and can reflect real-world outcomes (since they are, at least -theoretically, inherently based on human preferences). However, if we -fail to capture sufficient context in such data, human preference models -may not sufficiently capture human preferences. Stated preferences are -those individuals explicitly indicate in potentially experimental -conditions. The explicit knowledge may be leveraged by including -inductive biases during modeling (for example, the context used in a -model), which are reasonable assumptions for how a human would consider -a set of items. This may include controlled experiments or studies. This -may be harder to obtain and somewhat biased, as they can be hypothetical -or only accurately reflect a piece of the overall context of a decision. -However, they enable greater control of the decision-making process. - -\subsection{Axiom 4. Rationality: The Transitivity of odds (Good, -1955)}\label{human-rationality} - -The preference model assumes that humans are rational. Perfect -rationality posits that individuals make decisions that maximize their -reward, assuming they have complete information and the cognitive -ability to process this information to make optimal choices. Numerous -studies have shown that this assumption frequently fails to describe -actual human behavior. Bounded rationality acknowledges that individuals -operate within the limits of their information and cognitive -capabilities (\citeproc{ref-simon1972theories}{Simon 1972}). Here, -decisions are influenced by noise, resulting in probabilistic choice -behavior: while individuals aim to maximize their reward, noise can lead -to deviations from perfectly rational choices -(\citeproc{ref-miljkovic2005rational}{Miljkovic 2005}). Instead of -deterministic reward maximization, the decision maker will choose an -item with probability proportional to the reward they receive for that -item. This probabilistic model can be operationalized with Boltzmann -distribution. Utility of person \(i\) on item \(j\) is computed by a -function \(f_i: e_j \rightarrow \mathbb{R}\), where -\(e_j \in \mathbb{R}^d\) is an embedding of item \(j\). The probability -of item \(j\) being preferred by person \(i\) over all other -alternatives in the choice set \(\mathcal{C}\) is - -\[ -p_{ij} = p_i(j \succ j': j' \neq j \forall j' \in \mathcal{C}) = Z_i^{-1} \exp \circ f_i(e_j) \text{ where } Z_i = \sum_{j' \in \mathcal{C}} \exp \circ f_i(e_{j'}) -\] - -One can extend the above model in various ways. For example, the above -model does not account for similar actions. Consider the following -example when choosing a mode of transportation: car and train, with no -particular preference for either choice. The preferred probability is -50\% for either item. However, if we have 99 cars and one train in the -choice set, we would have a 99\% probability of choosing a car. To -address this issue, various extensions have been proposed. For example, -we can introduce a similarity metric to cluster items. We want a metric -that acts more as a distance in the feature space with the following -properties: Identity (an item is most similar to itself), symmetric (the -similarity of item \(j\) to \(j'\) is the same as that of \(j'\) to -\(j\)), and positive semidefinite (similarity metric is non-negative). -Under this extension, the probablity of item \(j\) being preferred over -all other alternatives by person \(i\) is -\(p_{ij} / w_j, \text{ where } w_j = \sum_{j' \in \mathcal{C}} s(e_j, e_{j'})\). -This de-weights similar items, which is the desired effect for human -decision-making. - -\section{Models of Preferences and Decisions}\label{preference-model} - -Next, we explore ways humans can express their preferences, including -accept-reject sampling, pairwise sampling, rank-order sampling, -rating-scale sampling, best-worst scaling, and multiple-choice samples. -We will understand the process of collecting data through simulation -and, when appropriate, discuss the real-world application of these -models. Each item \(i\) is represented by a \(d=2\) dimensional vector -\(x^i\). There is only one user in the simulation, and they have a -latent reward function \(f\) that they use to compute the latent reward -of an item from the features. Here, the latent reward function is the -Ackley function \cite{ackley1987}. - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{import numpy as np} -\NormalTok{np.random.seed(0)} - -\NormalTok{def ackley(X, a=20, b=0.2, c=2*np.pi):} -\NormalTok{ """} -\NormalTok{ Compute the Ackley function.} -\NormalTok{ Parameters:} -\NormalTok{ X: A NumPy array of shape (n, d) where each row is a d{-}dimensional point.} -\NormalTok{ a, b, c: Parameters of the Ackley function.} -\NormalTok{ Returns:} -\NormalTok{ A NumPy array of function values.} -\NormalTok{ """} -\NormalTok{ X = np.atleast\_2d(X)} -\NormalTok{ d = X.shape[1]} -\NormalTok{ sum\_sq = np.sum(X ** 2, axis=1)} -\NormalTok{ term1 = {-}a * np.exp({-}b * np.sqrt(sum\_sq / d))} -\NormalTok{ term2 = {-}np.exp(np.sum(np.cos(c * X), axis=1) / d)} -\NormalTok{ return term1 + term2 + a + np.e} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\ImportTok{import}\NormalTok{ numpy }\ImportTok{as}\NormalTok{ np} -\NormalTok{np.random.seed(}\DecValTok{0}\NormalTok{)} - -\KeywordTok{def}\NormalTok{ ackley(X, a}\OperatorTok{=}\DecValTok{20}\NormalTok{, b}\OperatorTok{=}\FloatTok{0.2}\NormalTok{, c}\OperatorTok{=}\DecValTok{2}\OperatorTok{*}\NormalTok{np.pi):} - \CommentTok{"""} -\CommentTok{ Compute the Ackley function.} -\CommentTok{ Parameters:} -\CommentTok{ X: A NumPy array of shape (n, d) where each row is a d{-}dimensional point.} -\CommentTok{ a, b, c: Parameters of the Ackley function.} -\CommentTok{ Returns:} -\CommentTok{ A NumPy array of function values.} -\CommentTok{ """} -\NormalTok{ X }\OperatorTok{=}\NormalTok{ np.atleast\_2d(X)} -\NormalTok{ d }\OperatorTok{=}\NormalTok{ X.shape[}\DecValTok{1}\NormalTok{]} -\NormalTok{ sum\_sq }\OperatorTok{=}\NormalTok{ np.}\BuiltInTok{sum}\NormalTok{(X }\OperatorTok{**} \DecValTok{2}\NormalTok{, axis}\OperatorTok{=}\DecValTok{1}\NormalTok{)} -\NormalTok{ term1 }\OperatorTok{=} \OperatorTok{{-}}\NormalTok{a }\OperatorTok{*}\NormalTok{ np.exp(}\OperatorTok{{-}}\NormalTok{b }\OperatorTok{*}\NormalTok{ np.sqrt(sum\_sq }\OperatorTok{/}\NormalTok{ d))} -\NormalTok{ term2 }\OperatorTok{=} \OperatorTok{{-}}\NormalTok{np.exp(np.}\BuiltInTok{sum}\NormalTok{(np.cos(c }\OperatorTok{*}\NormalTok{ X), axis}\OperatorTok{=}\DecValTok{1}\NormalTok{) }\OperatorTok{/}\NormalTok{ d)} - \ControlFlowTok{return}\NormalTok{ term1 }\OperatorTok{+}\NormalTok{ term2 }\OperatorTok{+}\NormalTok{ a }\OperatorTok{+}\NormalTok{ np.e} -\end{Highlighting} -\end{Shaded} - -We next define a function to visualize the surface: - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{import matplotlib.pyplot as plt} -\NormalTok{from matplotlib.colors import LinearSegmentedColormap} -\NormalTok{ccmap = LinearSegmentedColormap.from\_list("ackley", ["\#f76a05", "\#FFF2C9"])} -\NormalTok{plt.rcParams.update(\{} -\NormalTok{ "font.size": 14,} -\NormalTok{ "axes.labelsize": 16,} -\NormalTok{ "xtick.labelsize": 14,} -\NormalTok{ "ytick.labelsize": 14,} -\NormalTok{ "legend.fontsize": 14,} -\NormalTok{ "axes.titlesize": 16,} -\NormalTok{\})} - -\NormalTok{def draw\_surface():} -\NormalTok{ inps = np.linspace({-}2, 2, 100)} -\NormalTok{ X, Y = np.meshgrid(inps, inps)} -\NormalTok{ grid = np.column\_stack([X.ravel(), Y.ravel()])} -\NormalTok{ Z = ackley(grid).reshape(X.shape)} - -\NormalTok{ plt.figure(figsize=(6, 5))} -\NormalTok{ contour = plt.contourf(X, Y, Z, 50, cmap=ccmap)} -\NormalTok{ plt.contour(X, Y, Z, levels=15, colors=\textquotesingle{}black\textquotesingle{}, linewidths=0.5, alpha=0.6)} -\NormalTok{ plt.colorbar(contour, label=r\textquotesingle{}$f(x)$\textquotesingle{}, ticks=[0, 3, 6])} -\NormalTok{ plt.xlim({-}2, 2)} -\NormalTok{ plt.ylim({-}2, 2)} -\NormalTok{ plt.xticks([{-}2, 0, 2])} -\NormalTok{ plt.yticks([{-}2, 0, 2])} -\NormalTok{ plt.xlabel(r\textquotesingle{}$x\_1$\textquotesingle{})} -\NormalTok{ plt.ylabel(r\textquotesingle{}$x\_2$\textquotesingle{})} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\ImportTok{import}\NormalTok{ matplotlib.pyplot }\ImportTok{as}\NormalTok{ plt} -\ImportTok{from}\NormalTok{ matplotlib.colors }\ImportTok{import}\NormalTok{ LinearSegmentedColormap} -\NormalTok{ccmap }\OperatorTok{=}\NormalTok{ LinearSegmentedColormap.from\_list(}\StringTok{"ackley"}\NormalTok{, [}\StringTok{"\#f76a05"}\NormalTok{, }\StringTok{"\#FFF2C9"}\NormalTok{])} -\NormalTok{plt.rcParams.update(\{} - \StringTok{"font.size"}\NormalTok{: }\DecValTok{14}\NormalTok{,} - \StringTok{"axes.labelsize"}\NormalTok{: }\DecValTok{16}\NormalTok{,} - \StringTok{"xtick.labelsize"}\NormalTok{: }\DecValTok{14}\NormalTok{,} - \StringTok{"ytick.labelsize"}\NormalTok{: }\DecValTok{14}\NormalTok{,} - \StringTok{"legend.fontsize"}\NormalTok{: }\DecValTok{14}\NormalTok{,} - \StringTok{"axes.titlesize"}\NormalTok{: }\DecValTok{16}\NormalTok{,} -\NormalTok{\})} -\NormalTok{plt.rcParams[}\StringTok{\textquotesingle{}text.usetex\textquotesingle{}}\NormalTok{] }\OperatorTok{=} \VariableTok{True} - -\KeywordTok{def}\NormalTok{ draw\_surface():} -\NormalTok{ inps }\OperatorTok{=}\NormalTok{ np.linspace(}\OperatorTok{{-}}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{, }\DecValTok{100}\NormalTok{)} -\NormalTok{ X, Y }\OperatorTok{=}\NormalTok{ np.meshgrid(inps, inps)} -\NormalTok{ grid }\OperatorTok{=}\NormalTok{ np.column\_stack([X.ravel(), Y.ravel()])} -\NormalTok{ Z }\OperatorTok{=}\NormalTok{ ackley(grid).reshape(X.shape)} - -\NormalTok{ plt.figure(figsize}\OperatorTok{=}\NormalTok{(}\DecValTok{6}\NormalTok{, }\DecValTok{5}\NormalTok{))} -\NormalTok{ contour }\OperatorTok{=}\NormalTok{ plt.contourf(X, Y, Z, }\DecValTok{50}\NormalTok{, cmap}\OperatorTok{=}\NormalTok{ccmap)} -\NormalTok{ plt.contour(X, Y, Z, levels}\OperatorTok{=}\DecValTok{15}\NormalTok{, colors}\OperatorTok{=}\StringTok{\textquotesingle{}black\textquotesingle{}}\NormalTok{, linewidths}\OperatorTok{=}\FloatTok{0.5}\NormalTok{, alpha}\OperatorTok{=}\FloatTok{0.6}\NormalTok{)} -\NormalTok{ plt.colorbar(contour, label}\OperatorTok{=}\VerbatimStringTok{r\textquotesingle{}$f(x)$\textquotesingle{}}\NormalTok{, ticks}\OperatorTok{=}\NormalTok{[}\DecValTok{0}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{6}\NormalTok{])} -\NormalTok{ plt.xlim(}\OperatorTok{{-}}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{)} -\NormalTok{ plt.ylim(}\OperatorTok{{-}}\DecValTok{2}\NormalTok{, }\DecValTok{2}\NormalTok{)} -\NormalTok{ plt.xticks([}\OperatorTok{{-}}\DecValTok{2}\NormalTok{, }\DecValTok{0}\NormalTok{, }\DecValTok{2}\NormalTok{])} -\NormalTok{ plt.yticks([}\OperatorTok{{-}}\DecValTok{2}\NormalTok{, }\DecValTok{0}\NormalTok{, }\DecValTok{2}\NormalTok{])} -\NormalTok{ plt.xlabel(}\VerbatimStringTok{r\textquotesingle{}$x\_1$\textquotesingle{}}\NormalTok{)} -\NormalTok{ plt.ylabel(}\VerbatimStringTok{r\textquotesingle{}$x\_2$\textquotesingle{}}\NormalTok{)} -\end{Highlighting} -\end{Shaded} - -\subsection{Item-wise Model}\label{item-wise-model} - -One method for data collection is accept-reject sampling, where the user -considers one item at a time and decides if they like it. Below is an -example survey using accept-reject sampling: - -We will use a simulation to familiarize ourselves with accept-reject -sampling. On the surface below, blue and red points correspond to accept -or reject points. - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{d = 2} -\NormalTok{n\_items = 800} -\NormalTok{items = np.random.randn(n\_items, d)*0.5 + np.ones((n\_items, d))*0.5} -\NormalTok{rewards = ackley(items)} -\NormalTok{y = (rewards \textgreater{} rewards.mean())} -\NormalTok{draw\_surface()} -\NormalTok{plt.scatter(items[:, 0], items[:, 1], c=y, cmap=\textquotesingle{}coolwarm\textquotesingle{}, alpha=0.5)} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{d }\OperatorTok{=} \DecValTok{2} -\NormalTok{n\_items }\OperatorTok{=} \DecValTok{800} -\NormalTok{items }\OperatorTok{=}\NormalTok{ np.random.randn(n\_items, d)}\OperatorTok{*}\FloatTok{0.5} \OperatorTok{+}\NormalTok{ np.ones((n\_items, d))}\OperatorTok{*}\FloatTok{0.5} -\NormalTok{rewards }\OperatorTok{=}\NormalTok{ ackley(items)} -\NormalTok{y }\OperatorTok{=}\NormalTok{ (rewards }\OperatorTok{\textgreater{}}\NormalTok{ rewards.mean())} -\NormalTok{draw\_surface()} -\NormalTok{plt.scatter(items[:, }\DecValTok{0}\NormalTok{], items[:, }\DecValTok{1}\NormalTok{], c}\OperatorTok{=}\NormalTok{y, cmap}\OperatorTok{=}\StringTok{\textquotesingle{}coolwarm\textquotesingle{}}\NormalTok{, alpha}\OperatorTok{=}\FloatTok{0.5}\NormalTok{)} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\pandocbounded{\includegraphics[keepaspectratio]{src/chap2_files/figure-pdf/cell-4-output-1.pdf}} - -The binary choice model centers around one item. The model predicts, for -that item, after observing user choices in the past, whether that item -will be chosen. We use binary variable \(y \in \{0, 1\}\) to represent -whether the user will pick that choice in the next selection phase. We -denote \(P = p(y = 1)\). We can formally model \(y\) as a function of -the reward of the positive choice: \(y = \mathbb{I}[U>0]\). We explore -two cases based on the noise distribution. \(\psi\) is the logistic -function or the standard normal cumulative distribution function if -noise follows logistic distribution and the standard normal -distribution, respectively: \[ -p(u_{i,j} > 0) = p(u_{i,j}^* + \epsilon > 0) = 1 - p( \epsilon < -u_{i,j}^*) = \psi(u_{i,j}^*). -\] - -A generalization of accept-reject sampling is rating-scale sampling. -Rating-scale sampling, such as the Likert scale, is a method in which -participants rate items on a fixed-point scale (e.g., 1 to 5, ``Strongly -Disagree'' to ``Strongly Agree'') to measure levels of preference -towards items (\citeproc{ref-harpe2015}{Harpe 2015}). Participants can -also mark a point on a continuous rating scale to indicate their -preference or attitude. Commonly used in surveys, product reviews, and -psychological assessments, this method provides a more nuanced measure -than discrete scales. Rating-scale sampling is simple for participants -to understand and use, provides rich data on the intensity of -preferences, and is flexible enough for various measurements (e.g., -agreement, satisfaction). However, rating-scale sampling methods also -have limitations. Ratings can be influenced by personal biases and -interpretations of scales, leading to subjectivity. There is a central -tendency bias, where participants may avoid extreme ratings, resulting -in clustering responses around the middle. Different participants might -interpret scale points differently, and fixed-point scales may not -capture the full nuance of participants' preferences or attitudes. - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{from matplotlib.colors import LinearSegmentedColormap} -\NormalTok{likert\_cmap = LinearSegmentedColormap.from\_list("likert\_scale", ["red", "blue"], N=5)} -\NormalTok{normalized = (rewards {-} rewards.min()) / (rewards.max() {-} rewards.min())} -\NormalTok{ratings = np.round(normalized * 4).squeeze()} - -\NormalTok{draw\_surface()} -\NormalTok{scatter = plt.scatter(items[:, 0], items[:, 1], c=ratings, cmap=likert\_cmap, alpha=0.5)} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\ImportTok{from}\NormalTok{ matplotlib.colors }\ImportTok{import}\NormalTok{ LinearSegmentedColormap} -\NormalTok{likert\_cmap }\OperatorTok{=}\NormalTok{ LinearSegmentedColormap.from\_list(}\StringTok{"likert\_scale"}\NormalTok{, [}\StringTok{"red"}\NormalTok{, }\StringTok{"blue"}\NormalTok{], N}\OperatorTok{=}\DecValTok{5}\NormalTok{)} -\NormalTok{normalized }\OperatorTok{=}\NormalTok{ (rewards }\OperatorTok{{-}}\NormalTok{ rewards.}\BuiltInTok{min}\NormalTok{()) }\OperatorTok{/}\NormalTok{ (rewards.}\BuiltInTok{max}\NormalTok{() }\OperatorTok{{-}}\NormalTok{ rewards.}\BuiltInTok{min}\NormalTok{())} -\NormalTok{ratings }\OperatorTok{=}\NormalTok{ np.}\BuiltInTok{round}\NormalTok{(normalized }\OperatorTok{*} \DecValTok{4}\NormalTok{).squeeze()} - -\NormalTok{draw\_surface()} -\NormalTok{scatter }\OperatorTok{=}\NormalTok{ plt.scatter(items[:, }\DecValTok{0}\NormalTok{], items[:, }\DecValTok{1}\NormalTok{], c}\OperatorTok{=}\NormalTok{ratings, cmap}\OperatorTok{=}\NormalTok{likert\_cmap, alpha}\OperatorTok{=}\FloatTok{0.5}\NormalTok{)} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\pandocbounded{\includegraphics[keepaspectratio]{src/chap2_files/figure-pdf/cell-5-output-1.pdf}} - -Suppose we have a single example with attributes \(z_i\) and wish to -know which of \(J\) rating scales an individual will choose from. We can -define \(J - 1\) parameters, which act as thresholds on the reward -computed by \(u_i = u_{i,j}^*\) to classify the predicted choice between -these items. For example, if there are three predefined items, we can -define parameters \(a, b \in \mathbb{R}\) such that \[ -y_i = -\begin{cases} - 1 & u < a \\ - 2 & a \le u < b \\ - 3 & \text{else} -\end{cases} -\] - -By assuming the noise distribution to be either logistic or standard -normal, we have \[ -\begin{split} - p(y_i = 1) & = p(u < a) = p(u_{i,j}^* + \epsilon < a) = \psi(a-u_{i,j}^*) \\ - p(y_i = 2) & = p(a \le u < b) = p(a - u_{i,j}^* \le \epsilon < b - u_{i,j}^*) = \psi(b-u_{i,j}^*) - \psi(u_{i,j}^*-a) \\ - p(y_i = 3) & = p(u > b) = p(u_{i,j}^* + \epsilon > b ) = p( \epsilon > b - u_{i,j}^*) = \psi(b-u_{i,j}^*) -\end{split} -\] - -Having the model, we next explore the estimation of model parameters. A -common approach for parameter estimation is maximum likelihood -(\citeproc{ref-book_estimation_casella}{Casella and Berger 1990}; -\citeproc{ref-book_estimation_bock}{Bock et al. 2015}). The likelihood -of a model is the probability of the observed data given the model -parameters; intuitively, we wish to maximize this likelihood, as that -would mean that our model associates observed human preferences with -high probability. Assuming our data is independent and identically -distributed (iid), the likelihood over the entire dataset is the joint -probability of all observed data as defined by the binary choice model -with logistic noise is - -\[\mathcal{L}(z, Y; \beta) = \prod_{i = 1}^J p(y = y_i | z_i; \beta) = \prod_{i = 1}^J \frac{1}{1 + \exp^{-u_{i,j}^*}}\] - -This objective can be optimized with a gradient-based method, such as -gradient descent (\citeproc{ref-gradient_descent}{Ruder 2016}). Gradient -descent operates by computing the gradient of the objective with respect -to the parameters of the model, which provides a signal of the direction -in which the parameters must move to minimize the objective. Then, SGD -makes an update step by subtracting this gradient from the parameters -(most often with a scale factor called a learning rate) to move the -parameters in a direction that minimizes the objective. In the case of -logistic and Gaussian models, SGD may yield a challenging optimization -problem as its stochasticity can lead to noisy updates, for example, if -certain examples or batches of examples are biased. Mitigations include -batched SGD, in which multiple samples are randomly sampled from the -dataset at each iteration; learning rates, which reduce the impact of -noisy gradient updates, and momentum and higher-order optimizers, which -reduce noise by using moving averages of gradients or provide better -estimates of the best direction in which to update the gradients. - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{import numpy as np} -\NormalTok{from scipy.optimize import minimize} -\NormalTok{from sklearn.metrics import roc\_auc\_score} -\NormalTok{from tqdm import tqdm} - -\NormalTok{\# Set random seed for reproducibility (optional)} -\NormalTok{np.random.seed(42)} - -\NormalTok{\# Number of users and items} -\NormalTok{num\_users = 50} -\NormalTok{num\_items = 100} - -\NormalTok{\# Generate user{-}specific and item{-}specific rewards} -\NormalTok{theta\_true = np.random.randn(num\_users)} -\NormalTok{z\_true = np.random.randn(num\_items)} - -\NormalTok{\# Define the logistic (sigmoid) function} -\NormalTok{def sigmoid(x):} -\NormalTok{ return 1.0 / (1.0 + np.exp({-}x))} - -\NormalTok{\# Generate observed choices using the logistic function} -\NormalTok{\# Compute probability matrix: shape (num\_users, num\_items)} -\NormalTok{probs = sigmoid(theta\_true[:, None] {-} z\_true[None, :])} -\NormalTok{\# Sample binary responses (0 or 1) from a Bernoulli distribution} -\NormalTok{data = np.random.binomial(1, probs)} - -\NormalTok{\# Mask out a fraction of the response matrix (80\% observed, 20\% missing)} -\NormalTok{mask = np.random.rand(num\_users, num\_items) \textgreater{} 0.2 \# boolean mask} -\NormalTok{\# Create a version of the data with missing values (not needed for optimization, but for reference)} -\NormalTok{data\_masked = data.copy().astype(float)} -\NormalTok{data\_masked[\textasciitilde{}mask] = np.nan} - -\NormalTok{\# Count of observed entries (used for averaging)} -\NormalTok{observed\_count = np.sum(mask)} - -\NormalTok{\# We will optimize over parameters theta and z.} -\NormalTok{\# Initialize estimates (random starting points)} -\NormalTok{theta\_init = np.random.randn(num\_users)} -\NormalTok{z\_init = np.random.randn(num\_items)} - -\NormalTok{\# Pack parameters into a single vector for the optimizer.} -\NormalTok{\# First num\_users elements are theta\_est, next num\_items are z\_est.} -\NormalTok{params\_init = np.concatenate([theta\_init, z\_init])} - -\NormalTok{def objective(params):} -\NormalTok{ """} -\NormalTok{ Computes the loss and gradient for the current parameters.} -\NormalTok{ Loss is defined as the negative log likelihood (averaged over observed entries).} -\NormalTok{ """} -\NormalTok{ \# Unpack parameters} -\NormalTok{ theta = params[:num\_users]} -\NormalTok{ z = params[num\_users:]} - -\NormalTok{ \# Compute difference and estimated probabilities} -\NormalTok{ diff = theta[:, None] {-} z[None, :] \# shape: (num\_users, num\_items)} -\NormalTok{ sigma = sigmoid(diff)} - -\NormalTok{ \# To avoid log(0), clip probabilities a little bit} -\NormalTok{ eps = 1e{-}8} -\NormalTok{ sigma = np.clip(sigma, eps, 1 {-} eps)} - -\NormalTok{ \# Compute negative log likelihood only on observed entries} -\NormalTok{ \# For each observed entry: if data == 1 then {-}log(sigma) else {-}log(1{-}sigma)} -\NormalTok{ log\_likelihood = data * np.log(sigma) + (1 {-} data) * np.log(1 {-} sigma)} -\NormalTok{ loss = {-}np.sum(mask * log\_likelihood) / observed\_count} - -\NormalTok{ \# Compute gradient with respect to the difference x = theta\_i {-} z\_j} -\NormalTok{ \# d(loss)/d(x) = sigma {-} data (for observed entries, zero otherwise)} -\NormalTok{ diff\_grad = (sigma {-} data) * mask \# shape: (num\_users, num\_items)} - -\NormalTok{ \# Gradients for theta: sum over items (axis 1)} -\NormalTok{ grad\_theta = np.sum(diff\_grad, axis=1) / observed\_count} -\NormalTok{ \# Gradients for z: negative sum over users (axis 0)} -\NormalTok{ grad\_z = {-}np.sum(diff\_grad, axis=0) / observed\_count} - -\NormalTok{ \# Pack gradients back into a single vector} -\NormalTok{ grad = np.concatenate([grad\_theta, grad\_z])} -\NormalTok{ return loss, grad} - -\NormalTok{\# Callback to track progress (optional)} -\NormalTok{iteration\_progress = tqdm()} - -\NormalTok{def callback(xk):} -\NormalTok{ iteration\_progress.update(1)} - -\NormalTok{\# Optimize using L{-}BFGS{-}B} -\NormalTok{result = minimize(} -\NormalTok{ fun=lambda params: objective(params),} -\NormalTok{ x0=params\_init,} -\NormalTok{ method="L{-}BFGS{-}B",} -\NormalTok{ jac=True,} -\NormalTok{ callback=callback,} -\NormalTok{ options=\{"maxiter": 100, "disp": True\}} -\NormalTok{)} -\NormalTok{iteration\_progress.close()} - -\NormalTok{\# Extract the estimated parameters} -\NormalTok{theta\_est = result.x[:num\_users]} -\NormalTok{z\_est = result.x[num\_users:]} - -\NormalTok{\# Compute final estimated probabilities} -\NormalTok{probs\_final = sigmoid(theta\_est[:, None] {-} z\_est[None, :])} - -\NormalTok{\# Compute AUC ROC on observed (training) and missing (test) entries} -\NormalTok{train\_probs = probs\_final[mask]} -\NormalTok{test\_probs = probs\_final[\textasciitilde{}mask]} -\NormalTok{train\_labels = data[mask]} -\NormalTok{test\_labels = data[\textasciitilde{}mask]} - -\NormalTok{auc\_train = roc\_auc\_score(train\_labels, train\_probs)} -\NormalTok{auc\_test = roc\_auc\_score(test\_labels, test\_probs)} - -\NormalTok{print(f"Train AUC: \{auc\_train:.4f\}")} -\NormalTok{print(f"Test AUC: \{auc\_test:.4f\}")} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\ImportTok{import}\NormalTok{ torch} -\ImportTok{import}\NormalTok{ torch.nn }\ImportTok{as}\NormalTok{ nn} -\ImportTok{import}\NormalTok{ torch.optim }\ImportTok{as}\NormalTok{ optim} -\ImportTok{from}\NormalTok{ torch.distributions }\ImportTok{import}\NormalTok{ Bernoulli} -\ImportTok{from}\NormalTok{ tqdm }\ImportTok{import}\NormalTok{ tqdm} - -\CommentTok{\# Set device} -\NormalTok{device }\OperatorTok{=}\NormalTok{ torch.device(}\StringTok{"cuda"} \ControlFlowTok{if}\NormalTok{ torch.cuda.is\_available() }\ControlFlowTok{else} \StringTok{"cpu"}\NormalTok{)} - -\CommentTok{\# Number of users and items} -\NormalTok{num\_users }\OperatorTok{=} \DecValTok{50} -\NormalTok{num\_items }\OperatorTok{=} \DecValTok{100} - -\CommentTok{\# Generate user{-}specific and item{-}specific rewards} -\NormalTok{theta }\OperatorTok{=}\NormalTok{ torch.randn(num\_users, device}\OperatorTok{=}\NormalTok{device, requires\_grad}\OperatorTok{=}\VariableTok{True}\NormalTok{)} -\NormalTok{z }\OperatorTok{=}\NormalTok{ torch.randn(num\_items, device}\OperatorTok{=}\NormalTok{device, requires\_grad}\OperatorTok{=}\VariableTok{True}\NormalTok{)} - -\CommentTok{\# Generate observed choices using logistic function} -\NormalTok{probs }\OperatorTok{=}\NormalTok{ torch.sigmoid(theta[:, }\VariableTok{None}\NormalTok{] }\OperatorTok{{-}}\NormalTok{ z[}\VariableTok{None}\NormalTok{, :])} -\NormalTok{data }\OperatorTok{=}\NormalTok{ Bernoulli(probs}\OperatorTok{=}\NormalTok{probs).sample()} - -\CommentTok{\# Mask out a fraction of the response matrix} -\NormalTok{mask }\OperatorTok{=}\NormalTok{ torch.rand\_like(data) }\OperatorTok{\textgreater{}} \FloatTok{0.2} \CommentTok{\# 80\% observed, 20\% missing} -\NormalTok{data\_masked }\OperatorTok{=}\NormalTok{ data.clone()} -\NormalTok{data\_masked[}\OperatorTok{\textasciitilde{}}\NormalTok{mask] }\OperatorTok{=} \BuiltInTok{float}\NormalTok{(}\StringTok{\textquotesingle{}nan\textquotesingle{}}\NormalTok{)} - -\CommentTok{\# Initialize parameters for EM algorithm} -\NormalTok{theta\_est }\OperatorTok{=}\NormalTok{ torch.randn(num\_users, device}\OperatorTok{=}\NormalTok{device, requires\_grad}\OperatorTok{=}\VariableTok{True}\NormalTok{)} -\NormalTok{z\_est }\OperatorTok{=}\NormalTok{ torch.randn(num\_items, device}\OperatorTok{=}\NormalTok{device, requires\_grad}\OperatorTok{=}\VariableTok{True}\NormalTok{)} - -\CommentTok{\# Optimizer} -\NormalTok{optimizer }\OperatorTok{=}\NormalTok{ optim.LBFGS([theta\_est, z\_est], lr}\OperatorTok{=}\FloatTok{0.1}\NormalTok{, max\_iter}\OperatorTok{=}\DecValTok{20}\NormalTok{, history\_size}\OperatorTok{=}\DecValTok{10}\NormalTok{, line\_search\_fn}\OperatorTok{=}\StringTok{"strong\_wolfe"}\NormalTok{)} - -\KeywordTok{def}\NormalTok{ closure():} -\NormalTok{ optimizer.zero\_grad()} -\NormalTok{ probs\_est }\OperatorTok{=}\NormalTok{ torch.sigmoid(theta\_est[:, }\VariableTok{None}\NormalTok{] }\OperatorTok{{-}}\NormalTok{ z\_est[}\VariableTok{None}\NormalTok{, :])} -\NormalTok{ loss }\OperatorTok{=} \OperatorTok{{-}}\NormalTok{(Bernoulli(probs}\OperatorTok{=}\NormalTok{probs\_est).log\_prob(data) }\OperatorTok{*}\NormalTok{ mask).mean()} -\NormalTok{ loss.backward()} - \ControlFlowTok{return}\NormalTok{ loss} - -\CommentTok{\# EM Algorithm} -\NormalTok{pbar }\OperatorTok{=}\NormalTok{ tqdm(}\BuiltInTok{range}\NormalTok{(}\DecValTok{100}\NormalTok{))} -\ControlFlowTok{for}\NormalTok{ iteration }\KeywordTok{in}\NormalTok{ pbar:} - \ControlFlowTok{if}\NormalTok{ iteration }\OperatorTok{\textgreater{}} \DecValTok{0}\NormalTok{:} -\NormalTok{ previous\_theta }\OperatorTok{=}\NormalTok{ theta\_est.clone()} -\NormalTok{ previous\_z }\OperatorTok{=}\NormalTok{ z\_est.clone()} -\NormalTok{ previous\_loss }\OperatorTok{=}\NormalTok{ loss.clone()} - -\NormalTok{ loss }\OperatorTok{=}\NormalTok{ optimizer.step(closure)} - - \ControlFlowTok{if}\NormalTok{ iteration }\OperatorTok{\textgreater{}} \DecValTok{0}\NormalTok{:} -\NormalTok{ d\_loss }\OperatorTok{=}\NormalTok{ (previous\_loss }\OperatorTok{{-}}\NormalTok{ loss).item()} -\NormalTok{ d\_theta }\OperatorTok{=}\NormalTok{ torch.norm(previous\_theta }\OperatorTok{{-}}\NormalTok{ theta\_est, p}\OperatorTok{=}\DecValTok{2}\NormalTok{).item()} -\NormalTok{ d\_z }\OperatorTok{=}\NormalTok{ torch.norm(previous\_z }\OperatorTok{{-}}\NormalTok{ z\_est, p}\OperatorTok{=}\DecValTok{2}\NormalTok{).item()} -\NormalTok{ grad\_norm }\OperatorTok{=}\NormalTok{ torch.norm(optimizer.param\_groups[}\DecValTok{0}\NormalTok{][}\StringTok{"params"}\NormalTok{][}\DecValTok{0}\NormalTok{].grad, p}\OperatorTok{=}\DecValTok{2}\NormalTok{).item()} -\NormalTok{ grad\_norm }\OperatorTok{+=}\NormalTok{ torch.norm(optimizer.param\_groups[}\DecValTok{0}\NormalTok{][}\StringTok{"params"}\NormalTok{][}\DecValTok{1}\NormalTok{].grad, p}\OperatorTok{=}\DecValTok{2}\NormalTok{).item()} -\NormalTok{ pbar.set\_postfix(\{}\StringTok{"grad\_norm"}\NormalTok{: grad\_norm, }\StringTok{"d\_theta"}\NormalTok{: d\_theta, }\StringTok{"d\_z"}\NormalTok{: d\_z, }\StringTok{"d\_loss"}\NormalTok{: d\_loss\})} - \ControlFlowTok{if}\NormalTok{ d\_loss }\OperatorTok{\textless{}} \FloatTok{1e{-}5} \KeywordTok{and}\NormalTok{ d\_theta }\OperatorTok{\textless{}} \FloatTok{1e{-}5} \KeywordTok{and}\NormalTok{ d\_z }\OperatorTok{\textless{}} \FloatTok{1e{-}5} \KeywordTok{and}\NormalTok{ grad\_norm }\OperatorTok{\textless{}} \FloatTok{1e{-}5}\NormalTok{:} - \ControlFlowTok{break} - -\CommentTok{\# Compute AUC ROC on observed and inferred data} -\ImportTok{from}\NormalTok{ torchmetrics }\ImportTok{import}\NormalTok{ AUROC} -\NormalTok{auroc }\OperatorTok{=}\NormalTok{ AUROC(task}\OperatorTok{=}\StringTok{"binary"}\NormalTok{)} -\NormalTok{probs\_final }\OperatorTok{=}\NormalTok{ torch.sigmoid(theta\_est[:, }\VariableTok{None}\NormalTok{] }\OperatorTok{{-}}\NormalTok{ z\_est[}\VariableTok{None}\NormalTok{, :])} -\NormalTok{train\_probs }\OperatorTok{=}\NormalTok{ probs\_final[mask]} -\NormalTok{test\_probs }\OperatorTok{=}\NormalTok{ probs\_final[}\OperatorTok{\textasciitilde{}}\NormalTok{mask]} -\NormalTok{train\_labels }\OperatorTok{=}\NormalTok{ data[mask]} -\NormalTok{test\_labels }\OperatorTok{=}\NormalTok{ data[}\OperatorTok{\textasciitilde{}}\NormalTok{mask]} -\NormalTok{auc\_train }\OperatorTok{=}\NormalTok{ auroc(train\_probs, train\_labels)} -\NormalTok{auc\_test }\OperatorTok{=}\NormalTok{ auroc(test\_probs, test\_labels)} -\BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"train auc: }\SpecialCharTok{\{}\NormalTok{auc\_train}\SpecialCharTok{\}}\SpecialStringTok{"}\NormalTok{)} -\BuiltInTok{print}\NormalTok{(}\SpecialStringTok{f"test auc: }\SpecialCharTok{\{}\NormalTok{auc\_test}\SpecialCharTok{\}}\SpecialStringTok{"}\NormalTok{)} -\end{Highlighting} -\end{Shaded} - -\begin{verbatim} -train auc: 0.8305394053459167 -test auc: 0.7656601071357727 -\end{verbatim} - -\subsection{Pairwise Model}\label{pairwise-model} - -In \emph{pairwise sampling}, participants compare two items to determine -which is preferred. One of the major advantages of this method is the -low cognitive demand for raters. Its disadvantage is the limited amount -of information content elicited by a sample. Below is a survey based on -pairwise sampling: - -\begin{tcolorbox}[colframe=.grey, title=\faCode \enspace Code] - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{n\_pairs = 10000} -\NormalTok{pair\_indices = np.random.randint(0, n\_items, size=(n\_pairs, 2))} -\NormalTok{\# Exclude pairs where both indices are the same} -\NormalTok{mask = pair\_indices[:, 0] != pair\_indices[:, 1]} -\NormalTok{pair\_indices = pair\_indices[mask]} - -\NormalTok{scores = np.zeros(n\_items, dtype=int)} -\NormalTok{wins = rewards[pair\_indices[:, 0]] \textgreater{} rewards[pair\_indices[:, 1]]} - -\NormalTok{\# For pairs where the first item wins:} -\NormalTok{\# {-} Increase score for the first item by 1} -\NormalTok{\# {-} Decrease score for the second item by 1} -\NormalTok{np.add.at(scores, pair\_indices[wins, 0], 1)} -\NormalTok{np.add.at(scores, pair\_indices[wins, 1], {-}1)} - -\NormalTok{\# For pairs where the second item wins or it\textquotesingle{}s a tie:} -\NormalTok{\# {-} Decrease score for the first item by 1} -\NormalTok{\# {-} Increase score for the second item by 1} -\NormalTok{np.add.at(scores, pair\_indices[\textasciitilde{}wins, 0], {-}1)} -\NormalTok{np.add.at(scores, pair\_indices[\textasciitilde{}wins, 1], 1)} - -\NormalTok{\# Determine preferred and non{-}preferred items based on scores} -\NormalTok{preferred = scores \textgreater{} 0} -\NormalTok{non\_preferred = scores \textless{} 0} - -\NormalTok{draw\_surface()} -\NormalTok{plt.scatter(items[preferred, 0], items[preferred, 1], c=\textquotesingle{}blue\textquotesingle{}, label=\textquotesingle{}Preferred\textquotesingle{}, alpha=0.5)} -\NormalTok{plt.scatter(items[non\_preferred, 0], items[non\_preferred, 1], c=\textquotesingle{}purple\textquotesingle{}, label=\textquotesingle{}Non{-}preferred\textquotesingle{}, alpha=0.5)} -\NormalTok{plt.legend()} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\end{tcolorbox} - -\begin{Shaded} -\begin{Highlighting}[numbers=left,,] -\NormalTok{n\_pairs }\OperatorTok{=} \DecValTok{10000} -\NormalTok{pair\_indices }\OperatorTok{=}\NormalTok{ np.random.randint(}\DecValTok{0}\NormalTok{, n\_items, size}\OperatorTok{=}\NormalTok{(n\_pairs, }\DecValTok{2}\NormalTok{))} -\CommentTok{\# Exclude pairs where both indices are the same} -\NormalTok{mask }\OperatorTok{=}\NormalTok{ pair\_indices[:, }\DecValTok{0}\NormalTok{] }\OperatorTok{!=}\NormalTok{ pair\_indices[:, }\DecValTok{1}\NormalTok{]} -\NormalTok{pair\_indices }\OperatorTok{=}\NormalTok{ pair\_indices[mask]} - -\NormalTok{scores }\OperatorTok{=}\NormalTok{ np.zeros(n\_items, dtype}\OperatorTok{=}\BuiltInTok{int}\NormalTok{)} -\NormalTok{wins }\OperatorTok{=}\NormalTok{ rewards[pair\_indices[:, }\DecValTok{0}\NormalTok{]] }\OperatorTok{\textgreater{}}\NormalTok{ rewards[pair\_indices[:, }\DecValTok{1}\NormalTok{]]} - -\CommentTok{\# For pairs where the first item wins:} -\CommentTok{\# {-} Increase score for the first item by 1} -\CommentTok{\# {-} Decrease score for the second item by 1} -\NormalTok{np.add.at(scores, pair\_indices[wins, }\DecValTok{0}\NormalTok{], }\DecValTok{1}\NormalTok{)} -\NormalTok{np.add.at(scores, pair\_indices[wins, }\DecValTok{1}\NormalTok{], }\OperatorTok{{-}}\DecValTok{1}\NormalTok{)} - -\CommentTok{\# For pairs where the second item wins or it\textquotesingle{}s a tie:} -\CommentTok{\# {-} Decrease score for the first item by 1} -\CommentTok{\# {-} Increase score for the second item by 1} -\NormalTok{np.add.at(scores, pair\_indices[}\OperatorTok{\textasciitilde{}}\NormalTok{wins, }\DecValTok{0}\NormalTok{], }\OperatorTok{{-}}\DecValTok{1}\NormalTok{)} -\NormalTok{np.add.at(scores, pair\_indices[}\OperatorTok{\textasciitilde{}}\NormalTok{wins, }\DecValTok{1}\NormalTok{], }\DecValTok{1}\NormalTok{)} - -\CommentTok{\# Determine preferred and non{-}preferred items based on scores} -\NormalTok{preferred }\OperatorTok{=}\NormalTok{ scores }\OperatorTok{\textgreater{}} \DecValTok{0} -\NormalTok{non\_preferred }\OperatorTok{=}\NormalTok{ scores }\OperatorTok{\textless{}} \DecValTok{0} - -\NormalTok{draw\_surface()} -\NormalTok{plt.scatter(items[preferred, }\DecValTok{0}\NormalTok{], items[preferred, }\DecValTok{1}\NormalTok{], c}\OperatorTok{=}\StringTok{\textquotesingle{}blue\textquotesingle{}}\NormalTok{, label}\OperatorTok{=}\StringTok{\textquotesingle{}Preferred\textquotesingle{}}\NormalTok{, alpha}\OperatorTok{=}\FloatTok{0.5}\NormalTok{)} -\NormalTok{plt.scatter(items[non\_preferred, }\DecValTok{0}\NormalTok{], items[non\_preferred, }\DecValTok{1}\NormalTok{], c}\OperatorTok{=}\StringTok{\textquotesingle{}purple\textquotesingle{}}\NormalTok{, label}\OperatorTok{=}\StringTok{\textquotesingle{}Non{-}preferred\textquotesingle{}}\NormalTok{, alpha}\OperatorTok{=}\FloatTok{0.5}\NormalTok{)} -\NormalTok{plt.legend()} -\NormalTok{plt.show()} -\end{Highlighting} -\end{Shaded} - -\pandocbounded{\includegraphics[keepaspectratio]{src/chap2_files/figure-pdf/cell-7-output-1.pdf}} - -The Bradley-Terry model compares the reward of choice over all others -(\citeproc{ref-bradley-terry-model}{Bradley and Terry 1952}) in the set -of \(J\) choices \(i \in \{1, 2, \dots, J\}\). Each choice can also have -its unique random noise variable representing the unobserved factor. -However, we can also choose to have all choices' unobserved factors -follow the same distribution (e.g., independent and identically -distributed, IID). The noise is represented as an extreme value -distribution, although we can choose alternatives such as a multivariate -Gaussian distribution: \(\epsilon \sim \mathcal{N}(0, \Sigma)\). If -\(\Sigma\) is not a diagonal matrix, we effectively model correlations -in the noise across choices, enabling us to avoid the IID assumption. In -the case of the extreme value distribution, we model the probability of -a user preferring choice \(i\), which we denote as -\(P_i = Z^{-1}\exp(u_{i,j}^*)\) where -\(Z = \sum_{j = 1}^{J} \exp(u_{i,j}^*)\). - -We can model an open-ended ranking of the available items with the -Plackett-Luce model, in which we jointly model the full sequence of -choice ordering (\citeproc{ref-plackett_luce}{Plackett 1975}). The -general form models the joint distribution as the product of conditional -probabilities, where each is conditioned on the preceding ranking terms. -Given an ordering of \(J\) choices \(\{y_1, \dots, y_J\}\), we factorize -the joint probability into conditionals. Each conditional follows the -Bradley-Terry model: \[ -p(y_1, \dots, y_J) = p(y_1) p(y_2 | y_1) ... p(y_J | y_{1:{J - 1}}) = \prod_{i = 1}^J \frac{\exp(u_{i,j}^*)}{\sum_{j \ge i} \exp(u_{i,j}^*)} -\] - -Pairwise sampling has proven useful in aligning large language models -(LLM) with human preference. An LLM, such as GPT-4, Llama 3.2, and BERT, -typically refers to a large and pre-trained neural network that serves -as the basis for various downstream tasks. They are pre-trained on a -massive corpus of text data, learning to understand language and -context. They are capable of multiple language-related tasks such as -text classification, language generation, and question answering. A LLM -should be aligned to respond correctly based on human preferences. A -promising approach is to train LLMs using reinforcement learning (RL) -with the reward model (RM) learned from human preference data, providing -a mechanism to score the quality of the generated text. This approach, -known as RL from human feedback (RLHF), leverages human feedback to -guide model training, allowing LLMs to better align with human -expectations while continuously improving performance. - -We discuss the reward model used in the Llama2 model. The Llama2 RM -(\citeproc{ref-2307.09288}{Touvron et al. 2023}) is initialized from the -pretrained Llama2 LLM. In the LLM, the last layer is a mapping -\(L: \mathbb{R}^D \rightarrow \mathbb{R}^V\), where \(D\) is the -embedding dimension from the transformer decoder stack and \(V\) is the -vocabulary size. To get the RM, we replace that last layer with a -randomly initialized scalar head that maps -\(L: \mathbb{R}^D \rightarrow \mathbb{R}^1\). It's important to -initialize the RM from the LLM it's meant to evaluate. The RM will have -the same ``knowledge'' as the LLM. This is particularly useful for -evaluation objectives such as ``Does the LLM know when it doesn't -know?''. However, in cases where the RM is simply evaluating helpfulness -or factuality, it may be helpful to have the RM know more. In addition, -the RM is on distribution for the LLM - it is initialized in a way where -it semantically understands the LLM's outputs. An RM is trained with -paired preferences (prompt history, accepted response, rejected -response). Prompt history is a multiturn history of user prompts and -model generations; the accepted response is the preferred final model -generation by an annotator, and the rejected response is the unpreferred -response. The RM is trained with maximum likelihood under the -Bradley-Terry model with an optional margin term m(r): - -\[p(y_c \succ y_r | x) = \sigma(r_\theta(x,y_c) - r_\theta(x,y_r) - m(r))\] - -The margin term increases the distance in scores specifically for -preference pairs annotators rate as easier to separate. Margins were -designed primarily based on the sigmoid function, which is used to -normalize the raw reward model score flattens out beyond the range of -\([-4, 4]\). Thus, the maximum possible margin is eight. A small -regularization term is often added to center the score distribution on -0. We consider two variants of preference rating-based margin. When the -preference rating-based margin is small, outcomes are rated as -``Significantly Better'' (1), ``Better'' (2 out of 3), and ``Slightly -Better'' (1 out of 3), and ``Negligibly Better or Unsure'' (0 out of 3). -In contrast, when the margin is large, outcomes are rated as -``Significantly Better'' (3), ``Better'' (2), and ``Slightly Better'' -(1), and ``Negligibly Better or Unsure'' (0 out of 3). - -\subsection{List-wise Model}\label{list-wise-model} - -\emph{Multiple-choice sampling} involves participants selecting one item -from a set of alternatives. Multiple-choice sampling is simple for -participants to understand and reflect on realistic decision-making -scenarios where individuals choose one item from many. It is beneficial -in complex choice scenarios, such as modes of transportation, where -choices are not independent (\citeproc{ref-bolt2009}{Bolt and Wollack -2009}). Multiple-choice sampling often relies on simplistic assumptions -such as the independence of irrelevant alternatives (IIA), which may not -always hold true. This method may also fail to capture the variation in -preferences among different individuals, as it typically records only -the most preferred choice without accounting for the relative importance -of other items. In \emph{rank-order sampling}, participants rank items -from most to least preferred. Used in voting, market research, and -psychology, it provides rich preference data but is more complex and -cognitively demanding than pairwise comparisons, especially for large -item sets. Participants may also rank inconsistently -(\citeproc{ref-ragain2019}{Ragain and Ugander 2019}). \emph{In -Best-worst scaling} (BWS), participants are presented with items and -asked to identify the most and least preferred items. The primary -objective of BWS is to discern the relative importance or preference of -items, making it widely applicable in various fields such as market -research, health economics, and social sciences -(\citeproc{ref-campbell2015}{Campbell and Erdem 2015}). BWS provides -rich data on the relative importance of items, helps clarify -preferences, reduces biases found in traditional rating scales, and -results in rewards that are easy to interpret. However, BWS also has -limitations, including potential scale interpretation differences among -participants and design challenges to avoid biases, such as the order -effect or the context in which items are presented. - -\section{The Utility Function Class}\label{function-class} - -\subsection{Parametric and Nonparametric Function -Class}\label{parametric-and-nonparametric-function-class} - -The reward of the item can take parametric form, such as -\(z_j = f_{\theta}(x_j)\). It can also take the nonparametric form, -which is commonly used in the ideal point model, where the reward of an -item \(j\) is calculated by the distance from the item to the human in -some embedding space(\citeproc{ref-huber1976ideal}{Huber 1976}). Given -vector representation \(e_i\) of choice \(i\) and a vector \(v_n\) -representing an individual \(n\), we can use a distance function \(K\) -to model a stochastic reward function with the unobserved factors -following a specified distribution: -\(u_{n, i} = K(e_i, v_n) + \epsilon_{n, i}\). The intuition is that -vectors exist in a shared \(n\)-dimensional space, and as such, we can -use geometry to match choices whose representations are closest to that -of a given individual (\citeproc{ref-ideal_point}{Jamieson and Nowak -2011}; \citeproc{ref-tatli2022distancepreferences}{Tatli, Nowak, and -Vinayak 2022}) when equipped with a distance metric. Certain distance -metrics, such as Euclidian distance or inner product, can easily be -biased by the scale of vectors. A distance measure such as cosine -similarity, which compensates for scale by normalizing the inner product -of two vectors by the product of their magnitudes, can mitigate this -bias yet may discard valuable information encoded by the length of the -vectors. Beyond the distance metric alone, this model places a strong -inductive bias that the individual and choice representations share a -common embedding space. In some contexts, this can be a robust bias to -add to the model (\citeproc{ref-idealpoints}{Greiner 2005}), but it is a -key factor one must consider before employing such a model, and it is a -key design choice for modeling. - -\subsection{Unimodal and Multimodal Function -Class}\label{unimodal-and-multimodal-function-class} - -So far, we have considered learning from data from one person with a -particular set of preferences or a group with similar preferences, but -this is not always the case. Consider a scenario where a user turns left -at an intersection (\citeproc{ref-myers2021learning}{Myers et al. -2021}). What would they do if they saw a car speeding down the road -approaching them? Following a timid driving pattern, some vehicles would -stop to let the other car go, preventing a collision. Other vehicles -would be more aggressive and try to make the turn before colliding with -the oncoming vehicle. Given the data of one of these driving patterns, -the model can make an appropriate decision. However, what if the model -was given data from both aggressive and timid drivers and does not know -which data corresponds to which type of driver? A naive preference -learning approach would result in a model trying to find a policy close -enough to both driving patterns. The group label is often unobserved -because it is expensive to obtain or a data point cannot be cleanly -separated into any group (e.g., a more timid driver can be aggressive -when they are in a hurry). - -Myers et al. (\citeproc{ref-myers2022learning}{2022}) formulates this -problem as learning a mixture of \(M\) linear reward functions on the -embedding space, where \(M\) is given. The reward of item \(j\) given by -the expert \(i\) is given by: \(f_i(e_j) = w^\top_i e_j,\) where \(w_m\) -is a vector of parameters corresponding to the \(m\)-th expert's -preferences. An unknown distribution over the reward parameters exists, -and we can represent this distribution with convex mixing coefficients -\(\alpha = [\alpha_1, ..., \alpha_M]\). Consider a robot that performs -the following trajectories and asks a user to rank all the trajectories. -The robot will be given back a set of trajectory rankings from M humans, -and the objective is to learn the underlying reward function. Given the -ranking \((j_1 \succ ... \succ j_K | m)\) of expert \(m\) and define -\(\theta = \{w_{1:M}, \alpha_{1:M}\}\), the probability of item \(j\) -being preferred by \(m\) over all other alternatives is - -\[p(j_1 \succ ... \succ j_K | \theta) = \sum_{i = 1}^M \alpha_i \prod_{j = 1}^K p_{ij}\] - -Then the parameters posterior is -\(p(\theta | Q_{1:T}, x_{1:T}) \propto p(\theta) \prod_t p(x_t | Q_{\leq t}, \theta) = p(\theta) \prod_t p(x_t | \theta, Q_t)\). -The first proportionality is from the Bayes rule and the assumption that -the queries at timestamp \(t\) are conditionally independent of the -parameters given history. This assumption is reasonable because the -previous queries \& rankings ideally give all the information to inform -the choice of the next set. The last proportionality term comes from the -assumption that the ranked queries are conditionally independent given -the parameters. The prior distribution is dependent on the use case. For -example, in the user studies conducted by the authors to verify this -method, they use a standard Gaussian for the reward weights and the -mixing coefficients to be uniform on a \(M - 1\) simplex to ensure that -they add up to 1. Then, we can use maximum likelihood estimation to -compute the parameters with the simplified posterior. - -Another example setting multimodal preference is negotiations -(\citeproc{ref-kwon2021targeted}{Kwon et al. 2021}). Let's say there are -some shared items and two people with different utilities and desires -for items, where each person only knows their utility. In a specific -case of \textbf{?@fig-negotiation}, Bob as a proposing agent and Alice -as a controlled agent who has many different ways of responding to Bob's -proposals. Different methods can be used to design Alice as an AI agent. -The first idea is reinforcement learning, where multiple rounds of -negotiations are done, the model simulates game theory and sees how Bob -reacts. Authors of this setting (\citeproc{ref-kwon2021targeted}{Kwon et -al. 2021}) show that over time the model learns to ask for the same -thing over and over again, as Alice is not trained to be human-like or -negotiable, and just tries to maximize Alice's utility. The second -approach is supervised learning, where the model can be trained on some -dataset, learning the history of negotiations. This results in Alice -being very agreeable, which demonstrates two polar results of the two -approaches, and it would be ideal to find a middle ground and combine -both of them. The authors proposed the Targeted acquisition approach, -which is based on active learning ideas. The model asks diverse -questions at different cases and stages of negotiations like humans, -determining which questions are more valuable to be asked throughout -learning. Such an approach ended up in more fair and optimal results -than supervised or reinforcement learning -(\citeproc{ref-kwon2021targeted}{Kwon et al. 2021}). - -\subsection{Single Objective and Multi-Objective -Utility}\label{single-objective-and-multi-objective-utility} - -The industry has centered around optimizing for two primary reward -signals: helpfulness and harmlessness (safety). There are also other -axes, such as factuality, reasoning, tool use, code, and -multilingualism, but these are out of scope for us. The Llama2 paper -collected preference data from humans for each quality, with separate -guidelines. This presents a challenge for co-optimizing the final LLM -towards both goals. Two main approaches can be taken for RLHF in this -context. Train a unified reward model that integrates both datasets or -train two separate reward models, one for each quality, and optimize the -LLM toward both. Option 1 is difficult because of the tension between -helpfulness and harmlessness. They trade off against each other, -confusing an RM trained in both. The chosen solution was item 2, where -two RMs are used to train the LLM piecewise. The helpfulness RM is used -as the primary optimization term, while the harmlessness RM acts as a -penalty term, driving the behavior of the LLM away from unsafe territory -only when the LLM veers beyond a certain threshold. This is formalized -as follows, where \(R_s\), \(R_h\), and \(R_c\) are the safety, -helpfulness, and combined reward, respectively. \(g\) and \(p\) are the -model generation and the user prompt: - -\[ -\begin{aligned} - R_c(g \mid p) = - \begin{cases} - R_s(g \mid p) & \text{if } \text{is\_safety}(p) \text{ or } R_s(g \mid p) < 0.15 \\ - R_h(g \mid p) & \text{otherwise} - \end{cases} -\end{aligned} -\] - -\subsection{Pretraining}\label{pretraining} - -RL often stumbles when it comes to devising reward functions aligning -with human intentions. Preference-based RL aims to solve this by -learning from human feedback, but this often demands a \emph{highly -impractical number of queries} or leads to oversimplified reward -functions that don't hold up in real-world tasks. As discussed in the -previous section, one may apply meta-learning so that the RL agent can -adapt to new tasks with fewer human queries to address the impractical -requirement of human queries. (\citeproc{ref-hejna2023few}{Hejna III and -Sadigh 2023}) proposes pre-training models on previous tasks with the -meta-learning method MAML (\citeproc{ref-finn2017model}{Finn, Abbeel, -and Levine 2017}), and then the meta-trained model can adapt to new -tasks with fewer queries. We consider settings where a state is denoted -as \(s\in S\), and action is denoted as \(a\in A\), for state space -\(S\) and action space \(A\). The reward function -\(r: S\times A \to \mathbb{R}\) is unknown and needs to be learned from -eliciting human preferences. There are multiple tasks, each with its own -reward function and transition probabilities. The reward model is -parameterized by \(\psi\). We denote \(\hat{r}_\psi(s, a)\) to be a -learned estimate of an unknown ground-truth reward function \(r(s, a)\), -parameterized by \(\psi\). Accordingly, a reward model determines an RL -policy \(\phi\) by maximizing the accumulated rewards. The preferences -is learned via pair. For each pre-training task, there is a dataset -\(D\) consists of binary preference between pair of trajectory. -Bradley-Terry model is used to predict the preferred trajectory. - -To efficiently approximate the reward function \(r_\text{new}\) for a -new task with minimal queries, Hejna III and Sadigh -(\citeproc{ref-hejna2023few}{2023}) utilizes a pre-trained reward -function \(\hat{r}_\psi\) that can be quickly fine-tuned using just a -few preference comparisons by leveraging the common structure across -tasks by pre-training on data from prior tasks. Although any -meta-learning method is compatible, (\citeproc{ref-hejna2023few}{Hejna -III and Sadigh 2023}) opts for Model Agnostic Meta-Learning (MAML) due -to its simplicity. With the aforementioned pre-training with meta -learning, the meta-learned reward model can then be used for few-shot -preference-based RL during an online adaptation phase. Given a -pre-trained reward model \(\psi\), the the active few-shot adaption -iterates between finding informative pair of trajectory to query human -and update reward model and corresponding policy with new data. -Informative pair is selected using the disagreement of an ensemble of -reward functions over the preference predictors. Specifically, -comparisons that maximize \(\mathbb{V}(p(e_j \succ e_{j'}))\) are -selected each time feedback is collected. - -The experiment tests the proposed method on the Meta-World benchmark -(\citeproc{ref-yu2020meta}{Yu et al. 2020}). Three baselines compared -with the proposed method are (1) Soft-Actor Critic (SAC) trained from -ground truth rewards, representing performance upper bound, PEBBLE -(\citeproc{ref-lee2021pebble}{Lee, Smith, and Abbeel 2021}), which does -not use information from prior tasks, and (3) Init, which initializes -the reward model with the pretrained weights from meta learning but -instead of adapting the reward model to the new task, it performs -standard updates as in PEBBLE. The results show that the proposed method -outperforms all of the baseline methods. There are still some drawbacks. -For example, many of the queries the model picks to elicit human -preference are almost identical. Moreover, despite the improved query -complexity, an impractical number of queries still need to be made. In -addition, it is mentioned in the paper that the proposed method may be -even worse than training from scratch if the new task is too -out-of-distribution. Designing a method that automatically balances -between using the prior information or training from scratch is an -important future direction. - -Zhou et al. (\citeproc{ref-zhou2019watch}{2019}) studies a related -problem by asking the question, ``How can we efficiently learn both from -expert demonstrations and from trials where we only get binary feedback -from a human?'' This paper seeks to learn new tasks with the following -general problem setting: We only get one expert demonstration of the -target task; after seeing the expert demonstration, robots try to solve -the task 1 or more times; then the user (or some pre-defined reward -function) annotates each trial as a success/failure; the agent learns -from both the demos and the annotated trials to perform well on the -target task. A task \(i\) is described by the tuple -\(\{S, A, r_i, P_i\}\). \(S\) and \(A\) represents all possible states -and action, respectively. \(r_i\) is the reward function -\(r_i : S \times A \to \mathbb{R}\), and \(P_i\) is the transition -dynamics function. \(S\) and \(A\) are shared across tasks. Learning -occurs in 3 phases. During the watch phase, we give the agent \(K=1\) -demonstrations of the target tasks and all demonstrations are -successful. In the Try phase, we use the agent learned during the Watch -phase to attempt the task for \(L\) trials. After the agent completes -the trials, humans (or pre-programmed reward functions) provide one -binary reward for each trial, indicating whether the trial was -successful. The expected output of this phase is \(L\) trajectories and -corresponding feedback. After completing the trials, the agent must -learn from both the original expert demonstrations and the trials to -solve the target task. - -First, we are given a dataset of expert demonstrations containing -multiple demos for each task and the dataset contains hundreds of tasks. -Importantly, no online interaction is needed for training, and this -method trains only with supervised learning. This section describes how -this paper trains an agent from the given expert demonstrations, and how -to incorporate the trials and human feedback into the loop. What we want -to obtain out of the Watch phase is a policy conditioned on a set of -expert demonstrations via meta-imitation learning. Given the -demonstrations \(\{d_{i,k}\}\) for task \(i\), we sample another -different demonstration coming from the same task \(d_i^{\text{test}}\), -where \(d_i^{\text{test}}\) is an example of optimal behavior given the -demonstrations. The policy is obtained by imitating actions taken on -\(d_i^{\text{test}}\) via maximum likelihood: - -\[\mathcal{L}^\text{watch}(\theta, \mathcal{D}_i^*) = \mathbb{E}_{\{d_{i,k}\} \sim \mathcal{D}_i^*} \mathbb{E}_{\{d_{i,k}^{\text{test}}\} \sim \mathcal{D}_i^* \{d_{i,k}\}} \mathbb{E}_{(s_t, a_t) \sim d_i^{\text{test}}} \log \pi_\theta^{\text{watch}} (a_t | s_t, \{d_{i,k}\})\] - -This corresponds to imitation learning by minimizing the negative -log-likelihood of the test trajectory actions, conditioning the policy -on the entire demo set. However, how is the conditioning on the demo set -achieved? In addition to using features obtained from the images of the -current state, the architecture uses features from frames sampled (in -order) from the demonstration episodes, which are concatenated together. -On the Try phase when the agent is given a set of demonstrations -\(\{d_{i,k}\}\), we deploy the policy -\(\pi_\theta^{\text{watch}}(a | s, \{d_{i,k}\})\) to collect \(L\) -trials. There is no training involved in the Try phase; we simply -condition the policy on the given demonstrations. During the Watch -phase, the objective was to train a policy conditioned on demonstrations -\(\pi_\theta^{\text{watch}}(a | s, \{d_{i,k}\})\). The authors of Watch, -Try, Learn uses a similar strategy as the Watch phase for the Learn -phase. We now want to train a policy that is conditioned on the -demonstrations, as well as the trials and binary feedback. We want to -learn -\(\pi_\phi^{\text{watch}}(a | s, \{d_{i,k}\}, \{\mathbf{\tau}_{i, l}\})\). -To train the policy, we again use meta-imitation learning, where we -additionally sample yet another trajectory from the same task. -Concretely, we train policy parameters \(\phi\) to minimize the -following loss: -\[\mathcal{L}^{\text{learn}}(\phi, \mathcal{D}_i, \mathcal{D}_i^*) = \mathbb{E}_{(\{d_{i,k}\}, \{\mathbf{\tau}_{i,l}\}) \sim \mathcal{D}_i} \mathbb{E}_{\{d_{i,k}^{\text{test}}\} \sim \mathcal{D}_i^* \{d_{i,k}\}} \mathbb{E}_{(s_t, a_t) \sim d_i^{\text{test}}} \big[- \log \pi_\theta^{\text{learn}} (a_t | s_t, \{d_{i,k}\}, \{\tau_{i,l}\}) \big]\] - -Three baselines are considered: (1) behavior cloning is simple imitation -learning based on maximum log-likelihood training using data from all -tasks, (2) meta-imitation learning corresponds to simply running the -policy from the Watch step without using any trial data. We only -condition on the set of expert demonstrations, but no online trials, and -(3) behavior cloning + SAC pretrains a policy with behavior cloning on -all data, and follow that with RL fine-tuning for the specific target -task, using the maximum-entropy algorithm SAC -(\citeproc{ref-haarnoja2018soft}{Haarnoja et al. 2018}). The proposed -approach significantly outperforms baselines on every task family: it is -far superior to behavior cloning and it significantly surpasses -Meta-Imitation Learning on 3 out of 4 task families. - -\subsection{Others Consideration}\label{others-consideration} - -One key challenge is managing the bias and variance trade-off. Bias -refers to assumptions made during model design and training that can -skew predictions. For example, in Ideal Point Models, we make the -assumption that the representations we use for individuals and choices -are aligned in the embedding space and that this representation is -sufficient to capture human preferences using distance metrics. However, -there are myriad cases in which this may break down, for example, if the -two sets of vectors follow different distributions, each with their own -unique biases. If the representations do not come from the same domain, -one may have little visibility into how a distance metric computes the -final reward value for a choice for a given individual. Some ways to -mitigate bias in human preference models include increasing the number -of parameters in a model (allowing for better learning of patterns in -the data) or removing inductive biases based on our assumptions of the -underlying data. On the other hand, variance refers to the model's -sensitivity to small changes in the input, which leads to significant -changes in the output. This phenomenon is often termed `overfitting' or -`overparameterization.' This behavior can occur in models that have many -parameters and learn correlations in the data that do not contribute to -learning human preferences but are artifacts of noise in the dataset -that one should ultimately ignore. One can address variance in models by -reducing the number of parameters or incorporating biases in the model -based on factors we can assume about the data. - -Another important consideration unique to human preference models is -that we wish to model individual preferences, and we may choose to do so -at arbitrary granularity. For example, we can fit models to a specific -individual or even multiple models for an individual, each for different -purposes or contexts. On the other end of the spectrum, we may create a -model to capture human preferences across large populations or the -world. Individual models may prove to be more powerful, as they do not -need to generalize across multiple individuals and can dedicate all of -their parameters to learning the preferences of a single user. In the -context of human behavior, this can be a significant advantage as any -two individuals can be arbitrarily different or even opposite in their -preferences. On the other hand, models that fit only one person can -tremendously overfit the training distribution and capture noise in the -data, which is not truly representative of human preferences. On the end -of the spectrum, models fit to the entire world may be inadequate to -model human preferences for arbitrary individuals, especially those -whose data it has not been fit to. As such, models may underfit the -given training distribution. These models aim to generalize to many -people but may fail to capture the nuances of individual preferences, -especially for those whose data is not represented in the training set. -As a result, they may not perform well for arbitrary individuals within -the target population. Choosing the appropriate scope for a model is -crucial. It must balance the trade-off between overfitting to noise in -highly granular models and underfitting in broader models that may not -capture individual nuances. - -When training or using a reward model, LLM Distribution Shift is an -important factor to consider. With each finetune of the LLM, the RM -should be updated through a collection of fresh human preferences using -generations from the new LLM. This ensures that the RM stays aligned -with the current distribution of the LLM and avoids drifting -off-distribution. In addition, RM and LLM are coupled: An RM is -generally optimized to distinguish human preferences more efficiently -within the specific distribution of the LLM to be optimized. However, -this specialization poses a challenge: such an RM will underperform when -dealing with generations not aligned with this specific LLM -distribution, such as generations from a completely different LLM. Last -but not least, training RMs can be unstable and prone to overfitting, -especially with multiple training epochs. It's generally advisable to -limit the number of epochs during RM training to avoid this issue. - -\section{Exercises}\label{exercises} - -\subsection*{Question 1: Choice Modeling (15 -points)}\label{question-1-choice-modeling-15-points} -\addcontentsline{toc}{subsection}{Question 1: Choice Modeling (15 -points)} - -We discussed discrete choice modeling in the context of reward being a -linear function. Suppose we are deciding between \(N\) choices and that -the reward for each choice is given by -\(U_i=\beta_i\mathbf{x}+\epsilon_i\) for \(i=1, 2, \cdots, N\). We view -\(\mathbf{x}\) as the data point that is being conditioned on for -deciding which choice to select, and \(\beta_i\) as the weights driving -the linear reward model. The noise \(\epsilon_i\) is i.i.d. sampled from -a type of extreme value distribution called the \emph{Gumbel} -distribution. The standard Gumbel distribution is given by the density -function \(f(x)=e^{-(x+e^{-x})}\) and cumulative distribution function -\(F(x)=e^{-e^{-x}}.\) Fix \(i\). Our objective is to calculate -\(p(U_i\,\, \text{has max reward})\). - -\begin{enumerate} -\def\labelenumi{(\alph{enumi})} -\item - \textbf{(Written, 2 points)}. Set \(U_i=t\) and compute \(p(U_j