Math130book/plots1.qmd at main · csucdsi/Math130book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
---
execute:
  fig-height: 4
format:
  html:
    toc-depth: 3
---
# Creating Plots {#sec-plots1}

![A fuzzy monster in a beret and scarf, critiquing their own column graph on a canvas in front of them while other assistant monsters (also in berets) carry over boxes full of elements that can be used to customize a graph (like themes and geometric shapes). In the background is a wall with framed data visualizations. Stylized text reads “ggplot2: build a data masterpiece.”](img/ggplot2.png)

[Learn more about [ggplot2](https://ggplot2.tidyverse.org/)]{.aside}


Visualizing your data is hands down the most important thing you can learn to do. Seeing is critical to understanding. There are two audiences in mind when creating data visualizations:

1. **For your eyes only:** These are quick and dirty plots, without annotation. Meant to be looked at once or twice.
2. **To share with others:** These should have informative captions, axes labels, titles, colors as needed, etc. We'll see how to add these features throughout this course.

The functions from the `ggplot2` package, along with derivatives such as `ggpubr` and `sjPlot`, automatically do a lot of this work for you. While all of these can be made with base R plotting functions, we are intentionally choosing to highlight function that create good quality plots with very little code and are quite extensible and flexible.


:::{.callout-note title = "🎓 Learning Objectives" icon=false}

After completing this lesson students will be able to create basic statistical data visualizations for one and two variables, using multiple approaches.

:::

:::{.callout-tip title = "👉 Prepare" icon=false}

1.  Open your Math 130 R Project.
2.  Right click and "save as" this lessons [[Quarto notes file]](notes/plots1_notes.qmd) and save into your `Math130/notes` folder.
3.  In the *Files* pane, open this Quarto file and Render this file.

:::


## The syntax of `ggplot`

The reason we use the functions in `ggplot2` is for consistency in the structure of it's arguments. Here is a bare bones generic plotting function:

```r
ggplot(data, aes(x=x, y=y, col=col, fill=fill, group=group)) +  geom_THING()
```

### Required arguments {.unnumbered}

* `data`: What data set is this plot using? This is ALWAYS the first argument.
* `aes()`: This is the _aesthetics_ of the plot. What variable is on the x, and what is on the y? Do you want to color by another variable, perhaps fill some box by the value of another variable, or group by a variable.
* `geom_THING()`: Every plot has to have a geometry. What is the shape of the thing you want to plot? Do you want to plot point? Use `geom_points()`. Want to connect those points with a line? Use `geom_lines()`. We will see many varieties in this lesson.

:::{.callout-note icon = false title = "Meet the Penguins"}
The `palmerpenguins` data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
[Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. <a href= "https://allisonhorst.github.io/palmerpenguins/">https://allisonhorst.github.io/palmerpenguins/</a>]{.aside}

```{r}
library(ggplot2); library(sjPlot)
library(ggpubr);  library(gtsummary)
pen <- palmerpenguins::penguins
```
:::

I am loading the `penguins` data set out of the `palmerpenguins` package and storing it into a data frame named `pen`. We will be exploring variables such as species, body weight, the island and flipper lengths.

```{r}
str(pen)
```


## One categorical variable

Both Nominal and Ordinal data types can be visualized using tables, barcharts or pie charts.

### Barchart

A Barchart or barplot takes these frequencies, and draws bars along the X-axis where the height of the bars is determined by the frequencies seen in the table.

::: {.panel-tabset}

## `ggplot`

Using `ggplot2` with the `geom_bar()` geometry layer gives us actual wide bars, and better axis labels.
```{r}
ggplot(pen, aes(x=species)) + geom_bar()
```

## `sjPlot`

Using the `plot_frq` function from the `sjPlot` package builds on the `geom_bar()` type plot from `ggplot`, but adds frequencies and relative percentages on the plot.
```{r}
plot_frq(pen, "species")
```

This single graph provides a lot of good information and is a recommended choice to use.

:::

### Pie charts

A pie chart is a circular statistical graphic which is divided into slices to illustrate percentages out of a whole. While pie charts are very widely used in the media and business, there are some major drawbacks in that "_humans are pretty bad at reading angles_" [(Ref: The Issue with Pie Chart)](https://www.data-to-viz.com/caveat/pie.html)

The approach is to pipe the results of a `table` to a `pie` using base R. But the results are kinda "meh".

:::: {.columns}
::: {.column width="50%"}

```{r}
#| eval: false
table(pen$species) |> pie()
```

Nicer pie charts using `ggplot2` or `ggpubr` functions require the data set to be pre-aggregated, and so we will come back to these approaches in a later lesson.

:::

::: {.column width="50%"}
```{r}
#| echo: false
table(pen$species) |> pie()
```
:::
::::


## One continuous variable

We will examine the chonkiness of the penguin (`body_mass_g`) using several types of appropriate visualizations including histograms, density plots, boxplots and violin plots.

### Histogram

Rather than showing the value of each observation, we prefer to think of the value as belonging to a _bin_. The height of the bars in a histogram display the frequency of values that fall into those of those bins.

Since the x-axis is continuous the bars touch. This is unlike the barchart that has a categorical x-axis, and vertical bars that are separated.

::: {.panel-tabset}
## `ggplot2`

Using the `ggplot2` package we can create a histogram by adding the layer `geom_histogram()`.
```{r}
ggplot(pen, aes(x=body_mass_g)) + geom_histogram()
```

## `ggpubr`

In contrast to `ggplot2`'s common starter code and different `geom`etries, the  `ggpubr` package uses specific functions for each type of plot. The `gghistogram` package makes a histogram very similar to the `ggplot2` default, but with a different theme applied (different appearance). Otherwise it's the same.

```{r}
gghistogram(pen, x="body_mass_g")
```

:::{.callout-important title = "Variables names in quotes"}
This is a feature of `ggpubr` functions - variable names are always in quotes.
:::

:::

### Density curves

To get a better idea of the true shape of the distribution we can "smooth" out the bins and create what's called a `density` plot or curve. Notice that the shape of this distribution curve is much... "wigglier" than the histogram may have implied.

::: {.panel-tabset}
## `ggplot2`

With `ggplot2` we use the `geom_density()` geometry to produce a nicer looking density plot with minimal additional code.

```{r}
ggplot(pen, aes(x=body_mass_g)) + geom_density()
```

## `ggpubr`

And the `ggdensity` function from the `ggpubr` package creates a very similar density plot with a different default theme.

```{r}
ggdensity(pen, x="body_mass_g")
```

:::

### Boxplots

Another very common way to visualize the distribution of a continuous variable is using a boxplot. Boxplots are useful for quickly identifying where the bulk of your data lie. R specifically draws a "modified" boxplot where values that are considered outliers are plotted as dots.

::: {.panel-tabset}
## `ggplot2`

With `ggplot` you can create either a horizontal or vertical boxplot by specifying your numeric variable to be on either `x` or `y` . Notice the middle of the box is centered on 0, this is just a placeholder. This axis has no inherent meaning.

```{r}
#| layout-ncol: 2
ggplot(pen, aes(x=body_mass_g)) + geom_boxplot() # left
ggplot(pen, aes(y=body_mass_g)) + geom_boxplot() # right
```

## `ggpubr`
You can also make a boxplot using the `ggbpxplot` function from the `ggpubr` package, however it you must specify that the quantitative variable is on the `y` axis otherwise it yells at you.

:::: {.columns}

::: {.column width="50%"}
```{r}
ggboxplot(pen, y="body_mass_g")
```
:::

::: {.column width="50%"}
```{r}
#| error: true
ggboxplot(pen, x="body_mass_g")
```
:::

::::


:::

## Two continuous variables

Visualizing the relationship between two continuous variables is done using a scatterplot. Let's compare the `flipper_length_mm` of a penguin to it's `body_mass_g`.

::: {.panel-tabset}
## `ggplot2`

With ggplot we specify both the x and y variables, and add `geom_point` geometry layer.
```{r}
ggplot(pen, aes(x=flipper_length_mm, y=body_mass_g)) + geom_point()
```

## `ggpubr`

The `ggscatter` function creates a similar scatterplot.
```{r}
ggscatter(pen, x="flipper_length_mm", y="body_mass_g")
```
:::


### Adding trend lines lines

Two most common trend lines added to a scatterplots are the "best fit" straight line and the "loess" (low-ess) smoother line. Adding a trend line to this plot using base R is a bit tricker, so we won't bother.

::: {.panel-tabset}
## `ggplot2`

A trend line can be added by adding a `geom_smooth()` layer.

```{r}
ggplot(pen, aes(x=flipper_length_mm, y=body_mass_g)) + geom_point() +
  geom_smooth()
```

Here the point-wise confidence interval for this loess line is shown in grey. If you want to turn the confidence interval off, use `se=FALSE`.

We can add another `geom_smooth()` layer for the `lm` (linear model) line in blue, and the loess line (by not specifying a method) in red.

```{r}
ggplot(pen, aes(x=flipper_length_mm, y=body_mass_g)) + geom_point() +
  geom_smooth(se=FALSE, method="lm", color="red") +
  geom_smooth(se=FALSE, color="blue")
```


## `ggpubr`

You can add _either_ a linear model line _or_ a loess line to a `ggscatter` using the `add=` argument.
```{r}
#| layout-ncol: 2
ggscatter(pen, x="flipper_length_mm", y="body_mass_g", add = "loess")
ggscatter(pen, x="flipper_length_mm", y="body_mass_g", add = "reg.line")
```
:::

### Correlation Coefficient

The _correlation coefficient_ (denoted $r$) is a summary number that describes the direction and strength of a linear relationship between two continuous variables. The `ggscatter` function can add this number to the plot using the `cor.coef` argument.

```{r}
#| fig-height: 4
#| fig-width: 4
ggscatter(pen, x="flipper_length_mm", y="body_mass_g",
          add = "reg.line", cor.coef = TRUE)
```

General guidelines for interpretation of _strength_:

* $|r| > .7$ - strong relationship
* $0.3 < |r| < 0.7$ - moderate relationship
* $|r| < .3$ - weak relationship

The _direction_ of the relationship is determined by the sign of $r$. A positive value indicates a positive relationship (as x increases, so does y), and a negative value indicates a negative relationship (as x increases, y decreases).

In this case we have a **strong positive relationship** between flipper length and body mass of penguins.

## One continuous vs. one categorical

The tactic here is to create an appropriate plot for a continuous variable, and then `fill` the geometric area or `color` the lines depending on the level of the categorical variable.

### Histograms

Neither `fill`ing or `color`ing the histogram bars using `ggplot2` depending on the group work well due to the overlap.

::: {.panel-tabset}
## `ggplot2`

:::: {.columns}

::: {.column width="50%"}
```{r}
#| source-line-numbers: "2"
ggplot(pen, aes(x=body_mass_g,
                   fill=species)) +
  geom_histogram()

```
:::

::: {.column width="50%"}
```{r}
#| source-line-numbers: "2"
ggplot(pen, aes(x=body_mass_g,
                   color=species)) +
  geom_histogram()
```
:::

::::

## `ggpubr`

The defaults for `gghistogram` automatically adjusts the transparency of the histogram bars to make the overlap a little less troublesome, but it doesn't always work well.

```{r}
gghistogram(pen, x = "body_mass_g", fill = "species")
```

:::


### Density curves

Similar to histograms, you can `fill` or `color` the density curves depending on the group.

::: {.panel-tabset}
## `ggplot-fill`

It's still hard to see some groups due to the overlap, so we adjust the transparency by applying a value to `alpha` inside the `geom_density` layer. Alpha is a measure of transparency, from 0=clear to 1=opaque.
```{r}
#| source-line-numbers: "2"
ggplot(pen, aes(x=body_mass_g, fill=species)) +
  geom_density(alpha=.3)
```

## `ggplot-color`

You could also just color the lines and leave the fill alone.
```{r}
ggplot(pen, aes(x=body_mass_g, color=species)) + geom_density()
```

## `ggpubr`

The `ggdensity` function also has `color` and `fill` options, where the transparency of the density plots are automatically handled.
```{r}
#| layout-ncol: 2
ggdensity(pen, x="body_mass_g", color = "species") # left
ggdensity(pen, x="body_mass_g", fill = "species") # right
```

:::


### Boxplots

To create grouped boxplots, put the continuous variable on one axis, and the categorical on the other axis.

::: {.panel-tabset}
## `ggplot2`


```{r}
#| layout-ncol: 2
ggplot(pen, aes(x=body_mass_g, y=species)) + geom_boxplot() # left
ggplot(pen, aes(x=species, y=body_mass_g)) + geom_boxplot() # right
```

If you want an additional color feature (and the corresponding legend), you can either `fill` or `color` the boxes by the same categorical variable.
```{r}
#| layout-ncol: 2
ggplot(pen, aes(x=body_mass_g, y=species, fill = species)) + geom_boxplot() # left
ggplot(pen, aes(x=species, y=body_mass_g, color = species)) + geom_boxplot() # right
```


## `ggpubr`
Not much difference in the style between the `ggplot2` and `ggboxplot` versions. This method uses _slightly_ less code.
```{r}
#| layout-ncol: 2
ggboxplot(pen, y="body_mass_g", fill = "species") # left
ggboxplot(pen, y="body_mass_g", color = "species") # right
```

:::


## Two categorical variables

Recall from Section @sec-intro-tables that frequency tables are a common way to summarize categorical variables, and that both the frequency and relative percent are important summary numbers. Those percentages are even more important when comparing the joint distribution of two categorical variables.

Cross-tabs, cross-tabulations and two-way tables are different names for the same thing, and can be created by using the `table()` and `tbl_summary` functions. The values in each cell are the number of observations in that combination of characteristics. [We use this tactic in Chapter @sec-dm to check our recodes.]{.aside}

Let's explore the relationship between the penguins sex and species.

### Frequency and proportion tables

::: {.panel-tabset}
## base R

The first argument `species` specifies the levels that show on the rows, the second argument  `sex` specifies the columns.

```{r}
table(pen$species, pen$sex)
```

## `tbl_summary`

To achieve the same ordering with species on the rows and sex on the columns, we `include="species"` and set `by = "sex"`.
```{r}
tbl_summary(pen, include = "species", by = "sex")
```
:::

There are 73 female Adelie penguins, and 61 male Gentoo penguins.

#### Proportions

By default, when we ask for proportions we get the **cell** proportions. That is, the percent out of _all_ penguins in that data set that have that combination of traits. The percents add up to 1 across the entire table.

::: {.panel-tabset}
## base R

```{r}
table(pen$species, pen$sex) |> prop.table()
```

## `tbl_summary`

Note that `gtsummary` tables tend to round pretty heavily.

```{r}
tbl_summary(pen, include = "species", by = "sex", percent = "cell")
```

:::

21.9% of all penguins are Adelie females, 18.3% of all penguins are male Gentoo's.

:::{.callout-note title = "Comparing percentages"}
More often than not, we want to compare percents of one group -- within each level of the other group. For example is the male to female ratio the same for each species?
:::

#### Row percents {#sec-row-pct-table}

To compare the distribution of sex (columns) within each of the species (rows) we need row percentages. The percentages now add up to 1 across the rows and the comparison groups are each species.

::: {.panel-tabset}
## base R

Specify `margin=1` inside the `prop.table()`
```{r}
table(pen$species, pen$sex) |> prop.table(margin=1)  |> round(3)
```

I added a `round` function to the end of this because no one needs that many decimal places.

## `tbl_summary`

```{r}
tbl_summary(pen, include = "species", by = "sex", percent = "row")
```
:::

50% _of Adelie penguins_ are male, but 51.3% _of Gentoo penguins_ are male.


#### Column percents

To compare the distribution of species (rows) within each of the columns (sex) we need column percentages. The percentages now add up to 1 down the columns and the comparison groups are male and female.

::: {.panel-tabset}
## base R

Specify `margin=2` in `prop.table()`
```{r}
table(pen$species, pen$sex) |> prop.table(margin=2) |> round(3)
```

## `tbl_summary`

```{r}
tbl_summary(pen, include = "species", by = "sex", percent = "column")
```
:::

44% _of female penguins_ are Adelie species, 36% _of male penguins_ are Gentoo.


### Stacked bar charts

Sometimes pictures are better than tables, so let's try to apply the same `fill` tactic that we used in the last section.

```{r}
ggplot(pen, aes(x=species, fill=sex)) + geom_bar()
```

:::{.callout-important title = "Stacked barcharts are the ggplot2 default"}
Stacked barcharts are generally only useful when comparing percents out of a whole. To get the correct view you can add a `position = "fill"` argument to the `geom_bar()` layer.
:::

```{r}
#| source-line-numbers: "2"
ggplot(pen, aes(x=species, fill=sex)) +
  geom_bar(position = "fill") +
  ylab("Proportion")
```

### Side by side bar charts

::: {.panel-tabset}
## `ggplot2`

Add the argument `position=dodge` inside the `geom_bar` layer to put the bars side by side.
```{r}
#| source-line-numbers: "2"
ggplot(pen, aes(x=species, fill=sex)) +
  geom_bar(position = "dodge")
```

## `sjPlot`

The `plot_xtab` function is the two-way table analogy to `plot_frq` to create a barchart with clear labels on the bars for the N and % (and NA values dropped). Note you have to use dollar sign notation here for the variables.

```{r}
plot_xtab(x = pen$species, grp = pen$sex)
```

By default this plots the vertical axis as percents, not counts, and it shows the marginal total for the variable that's on the x-axis. We can remove the total by setting `show.total` to `false`.

```{r}
#| source-line-numbers: "2"
plot_xtab(x = pen$species, grp = pen$sex,
          show.total = "false")
```
:::

As before, typically we aren't interested in comparing proportions out of the whole, but proportions out of one of the two margins (variables).


### Comparing Percents

Our eyes make comparisons the best when the bars are physically close to each other. So generally you want to put the groups we want to compare _within_ on the x-axis, and have separate bars for each level of the variable we want to compare _across_.

See each tab on how to compare the distribution of `sex` within each `species`. This corresponds to the **row** percents from @sec-row-pct-table .

::: {.panel-tabset}
## `ggplot2`

We have to pre-aggregate the data to calculate the grouped percents before we create the plot. Notice how this changes the variable names. Put `species` (`Var1`) on the x-axis,  `fill` by `sex` (`Var2`), and `weight` the `Freq` variable. [I also changed the y axis label here to make it more clear that the heights of the bars are a percent]{.aside}
```{r}
(pen.table <- table(pen$species, pen$sex) |> as.data.frame()) # Printing to show the variable names have changed

ggplot(pen.table, aes(x=Var1, fill=Var2, weight = Freq)) +
  geom_bar(position = "dodge")  + ylab("Proportion")
```

## `sjPlot`

Put `species` on the x-axis and `grp` by `sex`, and specify `margin = "row"`.
```{r}
#| source-line-numbers: "2"
plot_xtab(x = pen$species, grp = pen$sex, show.total = "false",
          margin = "row")
```

:::

:::{.callout-tip title = "👉 Your Turn" icon=false}
Modify the code above to create a plot to compare the distribution of species within each sex.

<details>
  <summary> ggplot2 solution </summary>
```{r}
ggplot(pen.table, aes(x=Var2, fill=Var1, weight = Freq)) +
  geom_bar(position = "dodge")  + ylab("Proportion")
```
</details>

<details>
  <summary> sjPlot solution </summary>
```{r}
plot_xtab(x = pen$sex, grp = pen$species, show.total = "false", margin = "row")
```
</details>

:::