Simplify `expand` #290

eutwt · 2021-08-28T17:34:18Z

It looks like expand can be simplified so that it's just a CJ in j (plus careful name construction). I changed expected expressions in the tests, but all the outputs are unchanged.

I see that you did the original implementation @markfairbanks, so I'd appreciate it if you could take a look and tell me whether there's some cases not handled correctly in the simplified version.

Example with a lot of groups:

# expand1 = dtplyr master
# expand2 = This PR

library(dplyr)
library(bench)
library(data.table)
library(dtplyr)

df <-
  data.table(grpvar = rep(1:1e4, each = 10)) %>% 
    group_by(grpvar) %>%
    mutate(numvar = sample(50, n(), T),
           charvar = sample(c(rownames(mtcars), state.name), n(), TRUE)) %>%
    as.data.table()

#### Example with group_by

out_expand1 <- df %>% group_by(grpvar) %>% expand1(numvar, charvar) %>% as.data.table()
out_expand2 <- df %>% group_by(grpvar) %>% expand2(numvar, charvar) %>% as.data.table()

identical(out_expand1, out_expand2)
# [1] TRUE

mark(
  out_expand1 = df %>% group_by(grpvar) %>% expand1(numvar, charvar) %>% as.data.table(),
  out_expand2 = df %>% group_by(grpvar) %>% expand2(numvar, charvar) %>% as.data.table(),
  check = FALSE
)
# # A tibble: 2 × 13
#   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>  <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 out_expand1    9.32s  9.32s     0.107    1.01GB     1.39     1    13      9.32s <NULL> <Rpro…
# 2 out_expand2    1.67s  1.67s     0.597  202.68MB     2.39     1     4      1.67s <NULL> <Rpro…
# # … with 2 more variables: time <list>, gc <list>


#### Example without group_by

out_expand1 <- df %>% expand1(grpvar, numvar, charvar) %>% as.data.table()
out_expand2 <- df %>% expand2(grpvar, numvar, charvar) %>% as.data.table()

identical(out_expand1, setkey(out_expand2, NULL))
# [1] TRUE

mark(
  out_expand1 = df %>% expand1(grpvar, numvar, charvar) %>% as.data.table(),
  out_expand2 = df %>% expand2(grpvar, numvar, charvar) %>% as.data.table(),
  check = FALSE
)
# # A tibble: 2 × 13
#   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>  <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 out_expand1    2.35s  2.35s     0.426    2.14GB        0     1     0      2.35s <NULL> <Rpro…
# 2 out_expand2    1.15s  1.15s     0.873   783.3MB        0     1     0      1.15s <NULL> <Rpro…
# # … with 2 more variables: time <list>, gc <list>

eutwt · 2021-08-28T21:41:02Z

I think the section in the description below should be removed because it isn't accurate for dtplyr. But that's unrelated to this PR so I'll leave it.

#' # Note that all defined, but not necessarily present, levels of the
#' # factor variable `size` are retained.

eutwt · 2021-08-28T22:06:14Z

Previously I added a commit to remove grouping variables from CJ input. But I think that's a separate issue so I've removed it from this PR. For reference, here's how that works in dtplyr master, with data.frames, and in this PR:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(dtplyr)

##### dtplyr master
lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) 
#> Error in colnamesInt(x, names(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[1]='a'

##### data.frame output
data.frame(a = 1, b = 2) %>% 
  group_by(a) %>% 
  expand(a, b) 
#> # A tibble: 1 × 2
#> # Groups:   a [1]
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

##### This PR
lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) %>% print %>% 
  as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#> 
#>       a     a     b
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.

hadley · 2021-08-30T12:58:42Z

Did the previous implementation expand factor levels? Or has the documentation always been wrong?

eutwt · 2021-08-30T13:11:30Z

The previous implementation doesn't expand factor levels. Example below

library(tidyr)
library(dtplyr)
df <- lazy_dt(data.frame(a = factor('a', levels = c('a', 'b'))))

df %>% 
  expand(a) %>% 
  as_tibble()
#> # A tibble: 1 × 1
#>   a    
#>   <fct>
#> 1 a

^{Created on 2021-08-30 by the reprex package (v2.0.1)}

hadley · 2021-08-30T13:20:55Z

Ok, in that case, can you please file an issue to either fix it or make it clear in the docs that it's not currently supported?

eutwt · 2021-08-30T19:36:17Z

I'd like to see whether this simplification can be merged before fixing the factor issue, because I think that will be easier to fix if we can use the simpler logic.

eutwt · 2021-08-31T13:23:59Z

I just looked over #250 and actually the same simplification is used there, but only when there are no groups. If there are groups #250 falls back to the current implementation. I think we can use the simpler version in both cases though.

markfairbanks · 2021-09-02T14:25:53Z

Sorry about the late response on this one, just got back from vacation last night. I'll look over this today.

markfairbanks · 2021-09-02T14:41:11Z

@eutwt This implementation looks better at first glance. I had to implement expand() differently than I did in tidytable so it would work with dtplyr's lazy evaluation, so it's good to see there's a better way to do it.

One thing I noticed - in the case where expand() is applied to a group variable, it looks like your implementation duplicates the grouping column.

lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) %>% print %>% 
  as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#> 
#>       a     a     b
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.

There are now two "a" columns, and using as_tibble() leads to an error in this case.

eutwt · 2021-09-02T14:56:56Z

Thanks for the review. Yeah I wasn't sure whether to deal with the case of supplying group vars to expand in this PR. That also gives an error with the current implementation, but for a different reason (variable doesn't exist). On second thought I probably should get rid of the error though. Will come back and do that when I have more time. Or feel free to modify the PR yourself, either way.

One thing to note is that you can also redefine the grouping vars in tidyr, so we can't just remove the newly supplied field if a group var is supplied. Definitely possible to deal with this, just requires some additional logic.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

data.frame(a = 1, b = 2) %>% 
  group_by(a) %>% 
  expand(a = 5, b) 
#> # A tibble: 1 × 2
#> # Groups:   a [1]
#>       a     b
#>   <dbl> <dbl>
#> 1     5     2

^{Created on 2021-09-02 by the reprex package (v2.0.1)}

markfairbanks · 2021-09-02T15:47:01Z

One thing to note is that you can also redefine the grouping vars in tidyr

Ah yep, that's a good catch.

One handy thing here about data.table is we can delete column(s) using dt[, c("col1", "col2") := NULL]. When you do this it deletes only the first occurrence of that column. It looks like your expand() implementation creates the new column as the second column, so we can use this method to delete the duplicates.

Another advantage of this method is it will delete it by reference, so no extra copy is made by using select() language.

It looks like data.table errors when trying to delete grouping columns (which makes sense), so we just have to add a little bit of extra logic to ungroup the output then regroup after the duplicates are deleted.

dots_names <- names(dots)

out <- step_subset_j(
   data,
   j = expr(CJ(!!!dots, unique = TRUE)),
   vars = c(data$groups, dots_names)
 )

# Delete duplicate columns if group vars are expanded
if (any(dots_names %in% out$groups)) {
  group_vars <- out$groups
  expanded_group_vars <- dots_names[dots_names %in% group_vars]

  out <- ungroup(out)
  out <- step_subset(out, j = expr(!!expanded_group_vars := NULL))
  out <- group_by(out, !!!syms(group_vars))
}

out

Which leads to this output

lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand2(a = 5, b)
#> Source: local data table [1 x 2]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = 5, b = b, unique = TRUE), keyby = .(a)][, `:=`("a", 
#>     NULL)]
#> 
#>       a     b
#>   <dbl> <dbl>
#> 1     5     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

eutwt · 2021-09-02T17:23:08Z

Thanks, that does seem like the right way to implement it. I went ahead and added that implementation plus a test for expanding group vars.

Co-Authored-By: Mark Fairbanks <[email protected]>

markfairbanks · 2021-09-02T18:05:20Z

@hadley Do you want to review before it's merged?

eutwt · 2021-09-04T03:40:04Z

One last commit. I had to change c(data$groups, dots_names) to union(data$groups, dots_names) so that out$vars doesn't have duplicates.

- Change `c(data$groups, dots_names)` to `union(data$groups, dots_names)` so that `out$names` doesn't have duplicates. - Rearrange args so they match order in function signature

hadley · 2021-09-07T12:24:53Z

@markfairbanks you're welcome to merge it since you reviewed it 😄

markfairbanks · 2021-09-07T12:45:58Z

Thanks @eutwt 👍

eutwt added 2 commits August 28, 2021 00:19

simplify

c5812ca

update query expressions in tests

467b292

eutwt changed the title ~~Simplify expand~~ Simplify expand Aug 28, 2021

use step_subset_j

7c5934c

eutwt force-pushed the simplify_expand branch 2 times, most recently from 94b28e9 to 536aa3c Compare August 28, 2021 21:49

hadley requested a review from markfairbanks August 30, 2021 12:58

eutwt mentioned this pull request Aug 30, 2021

Document expand() output with unrealized factor levels #292

Closed

eutwt force-pushed the simplify_expand branch 2 times, most recently from 68ff367 to 7c5934c Compare September 2, 2021 17:30

Handle expanding group vars + test

9e16137

Co-Authored-By: Mark Fairbanks <[email protected]>

markfairbanks approved these changes Sep 2, 2021

View reviewed changes

eutwt force-pushed the simplify_expand branch from 248255c to c2c8998 Compare September 4, 2021 03:37

Prevent duplicates in out$vars, c -> union

cf7870a

- Change `c(data$groups, dots_names)` to `union(data$groups, dots_names)` so that `out$names` doesn't have duplicates. - Rearrange args so they match order in function signature

eutwt force-pushed the simplify_expand branch from c2c8998 to cf7870a Compare September 4, 2021 03:41

markfairbanks merged commit 6f14068 into tidyverse:master Sep 7, 2021

Simplify expand #290

Simplify expand #290

Uh oh!

Conversation

eutwt commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eutwt commented Aug 28, 2021

Uh oh!

eutwt commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Aug 30, 2021

Uh oh!

eutwt commented Aug 30, 2021

Uh oh!

hadley commented Aug 30, 2021

Uh oh!

eutwt commented Aug 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eutwt commented Aug 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markfairbanks commented Sep 2, 2021

Uh oh!

markfairbanks commented Sep 2, 2021

Uh oh!

eutwt commented Sep 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markfairbanks commented Sep 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eutwt commented Sep 2, 2021

Uh oh!

markfairbanks commented Sep 2, 2021

Uh oh!

eutwt commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Sep 7, 2021

Uh oh!

markfairbanks commented Sep 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Simplify `expand` #290

Simplify `expand` #290

eutwt commented Aug 28, 2021 •

edited

Loading

eutwt commented Aug 28, 2021 •

edited

Loading

eutwt commented Aug 30, 2021 •

edited

Loading

eutwt commented Aug 31, 2021 •

edited

Loading

eutwt commented Sep 2, 2021 •

edited

Loading

markfairbanks commented Sep 2, 2021 •

edited

Loading

eutwt commented Sep 4, 2021 •

edited

Loading