prefio/README.Rmd at main · fleverest/prefio · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
output: github_document
bibliography: ["readme.bib"]
---

```{r rmd-setup, include = FALSE}
library(prefio)
library(tibble)
library(dplyr)
knitr::opts_chunk$set(fig.path = "man/figures/")
```

# [prefio](https://fleverest.github.io/prefio/) <img src="man/figures/prefio.svg" width="160" align="right" alt="prefio hex sticker" />

<!-- badges: start -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/prefio)](https://cran.r-project.org/package=prefio)
[![R-CMD-check](https://github.com/fleverest/prefio/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/fleverest/prefio/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/fleverest/prefio/branch/main/graph/badge.svg)](https://app.codecov.io/gh/fleverest/prefio?branch=main)
<!-- badges: end -->

## Overview

Preferential datasets are used by many research communities including, but not limited to, those who work with elections, recommender systems, computational social choice, and combinatorial optimization.

**prefio** provides a tidy format for dealing with preferences, along with a set of functions which enable users to perform a wide range of analyses.


## Installation

The package may be installed from CRAN via

```{r, eval = FALSE}
install.packages("prefio")
```

The development version can be installed via
```{r, eval = FALSE}
# install.packages("remotes")
remotes::install_github("fleverest/prefio")
```


## Usage

**prefio** provides a tidy interface for processing data from tabular
formats as well as sourcing data from one of the unified
[PrefLib formats](https://preflib.org/format), including a convenient
method for downloading data files directly from PrefLib to your R session.


#### Casting from character vectors

The easiest way to try things out is to write preferences as strings, then cast to preferences. For example:

```{r}
preferences(c("Apple > Banana > Carrot", "Carrot > Banana = Apple"))
```

#### Processing long-format data

Preferential datasets can come in many forms. A very common way for preferential
datasets to be stored is in a long-format with item/rank columns. For example,
consider a dataset of votes

```{r echo = FALSE, results = 'asis'}
long <- tribble(
  ~ID, ~VoterLocation, ~Candidate, ~Rank,
  1, "Melbourne", "Allie", 1,
  1, "Melbourne", "Beatriz", 2,
  1, "Melbourne", "Charles", 3,
  2, "Wangaratta", "Allie", 3,
  2, "Wangaratta", "Beatriz", 2,
  2, "Wangaratta", "Charles", 1,
  3, "Geelong", "Allie", 2,
  3, "Geelong", "Beatriz", 1,
  3, "Geelong", "Charles", 3
)
knitr::kable(
  long,
  caption = "Three preferential votes, ranking three candidates in long-format."
)
```

Here, we summarise the votes in a new column of type `preferences`.

Note that, since we are just gathering preferential data from across
multiple rows here, the syntax is quite similar to `dplyr::pivot_wider`. Indeed, the function is based on this, and extra arguments will be passed directly to `dplyr::pivot_wider` via `...`.

```{r}
long <- tribble(
  ~ID, ~VoterLocation, ~Candidate, ~Rank,
  1, "Melbourne", "Allie", 1,
  1, "Melbourne", "Beatriz", 2,
  1, "Melbourne", "Charles", 3,
  2, "Wangaratta", "Allie", 3,
  2, "Wangaratta", "Beatriz", 2,
  2, "Wangaratta", "Charles", 1,
  3, "Geelong", "Allie", 2,
  3, "Geelong", "Beatriz", 1,
  3, "Geelong", "Charles", 3
)

long |>
  long_preferences(
    vote,
    id_cols = c(ID, VoterLocation),
    rank_col = Rank,
    item_col = Candidate
  )
```

#### Processing wide-format data

Another common way to store preferential data is in wide-format, where each column represents a candidate/item and the values represent the rank assigned. Let's recreate our previous example but in wide-format:

```{r echo = FALSE, results = 'asis'}
wide <- tribble(
  ~ID, ~VoterLocation, ~Allie, ~Beatriz, ~Charles,
  1, "Melbourne", 1, 2, 3,
  2, "Wangaratta", 3, 2, 1,
  3, "Geelong", 2, 1, 3
)

knitr::kable(
  wide,
  caption = "Three preferential votes, ranking three candidates in wide-format."
)
```

Here, we summarise the votes in a new column of type `preferences`.

```{r}
wide <- tribble(
  ~ID, ~VoterLocation, ~Allie, ~Beatriz, ~Charles,
  1, "Melbourne", 1, 2, 3,
  2, "Wangaratta", 3, 2, 1,
  3, "Geelong", 2, 1, 3
)

wide |>
  wide_preferences(vote, Allie:Charles)
```


#### Reading from PrefLib

The [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize) was a
competition devised by Netflix to improve the accuracy of its recommendation
system. To facilitate this they released ratings about movies from the users of
the system that have been transformed to preference data and are available from
[PrefLib](https://www.preflib.org/data/ED/00004/), [@Bennett2007]. Each data set
comprises rankings of a set of 3 or 4 movies selected at random. Here we
consider rankings for just one set of movies to illustrate the functionality of
**prefio**.

PrefLib datafiles such as these can be downloaded on-the-fly by specifying the
argument `from_preflib = TRUE` in the `read_preflib` function:

```{r}
netflix <- read_preflib("00004 - netflix/00004-00000138.soc", from_preflib = TRUE)
head(netflix)
```

Each row corresponds to a unique ordering of the four movies in the dataset.
The number of Netflix users that assigned each ordering is given in the
`frequency` column. In this case, the most common ordering (with 68 voters
specifying the same preferences) is the following:

```{r}
netflix$preferences[1]
```


#### Writing to Preflib formats

**prefio** provides a convenient interface for writing preferential datasets to
PrefLib formats. To aid the user, the `preferences()` function automatically
calculates metrics of the dataset which are required for producing valid PrefLib
files. For example, we can write our example from earlier to a PrefLib format:

```{r}
long |>
  long_preferences(
    vote,
    id_cols = ID,
    rank_col = Rank,
    item_col = Candidate,
    unused_fn = list(VoterLocation = dplyr::first)
  ) |>
  write_preflib(preferences_col = vote)
```

Note that this produces four warnings. Each warning corresponds to a field which
is required by the official PrefLib format, but may not be necessary for
internal use-cases. If your goal is to publish some data to PrefLib, these
warnings must be resolved.


## Projects using **prefio**

The [PrefLib formatter for New South Wales Legislative Assembly Elections](https://github.com/fleverest/nswla_preflib)
uses **prefio** to process the public election datasets into PrefLib formats.

The R package [elections.dtree](https://github.com/fleverest/elections.dtree) uses **prefio** for tracking
ballots observed by the Dirichlet-tree model.

## References