Skip to content

Conversation

@eutwt
Copy link
Collaborator

@eutwt eutwt commented Aug 28, 2021

It looks like expand can be simplified so that it's just a CJ in j (plus careful name construction). I changed expected expressions in the tests, but all the outputs are unchanged.

I see that you did the original implementation @markfairbanks, so I'd appreciate it if you could take a look and tell me whether there's some cases not handled correctly in the simplified version.

Example with a lot of groups:

# expand1 = dtplyr master
# expand2 = This PR

library(dplyr)
library(bench)
library(data.table)
library(dtplyr)

df <-
  data.table(grpvar = rep(1:1e4, each = 10)) %>% 
    group_by(grpvar) %>%
    mutate(numvar = sample(50, n(), T),
           charvar = sample(c(rownames(mtcars), state.name), n(), TRUE)) %>%
    as.data.table()

#### Example with group_by

out_expand1 <- df %>% group_by(grpvar) %>% expand1(numvar, charvar) %>% as.data.table()
out_expand2 <- df %>% group_by(grpvar) %>% expand2(numvar, charvar) %>% as.data.table()

identical(out_expand1, out_expand2)
# [1] TRUE

mark(
  out_expand1 = df %>% group_by(grpvar) %>% expand1(numvar, charvar) %>% as.data.table(),
  out_expand2 = df %>% group_by(grpvar) %>% expand2(numvar, charvar) %>% as.data.table(),
  check = FALSE
)
# # A tibble: 2 × 13
#   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>  <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 out_expand1    9.32s  9.32s     0.107    1.01GB     1.39     1    13      9.32s <NULL> <Rpro…
# 2 out_expand2    1.67s  1.67s     0.597  202.68MB     2.39     1     4      1.67s <NULL> <Rpro…
# # … with 2 more variables: time <list>, gc <list>


#### Example without group_by

out_expand1 <- df %>% expand1(grpvar, numvar, charvar) %>% as.data.table()
out_expand2 <- df %>% expand2(grpvar, numvar, charvar) %>% as.data.table()

identical(out_expand1, setkey(out_expand2, NULL))
# [1] TRUE

mark(
  out_expand1 = df %>% expand1(grpvar, numvar, charvar) %>% as.data.table(),
  out_expand2 = df %>% expand2(grpvar, numvar, charvar) %>% as.data.table(),
  check = FALSE
)
# # A tibble: 2 × 13
#   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>  <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 out_expand1    2.35s  2.35s     0.426    2.14GB        0     1     0      2.35s <NULL> <Rpro…
# 2 out_expand2    1.15s  1.15s     0.873   783.3MB        0     1     0      1.15s <NULL> <Rpro…
# # … with 2 more variables: time <list>, gc <list>

@eutwt eutwt changed the title Simplify expand Simplify expand Aug 28, 2021
@eutwt
Copy link
Collaborator Author

eutwt commented Aug 28, 2021

I think the section in the description below should be removed because it isn't accurate for dtplyr. But that's unrelated to this PR so I'll leave it.

#' # Note that all defined, but not necessarily present, levels of the
#' # factor variable `size` are retained.

@eutwt eutwt force-pushed the simplify_expand branch 2 times, most recently from 94b28e9 to 536aa3c Compare August 28, 2021 21:49
@eutwt
Copy link
Collaborator Author

eutwt commented Aug 28, 2021

Previously I added a commit to remove grouping variables from CJ input. But I think that's a separate issue so I've removed it from this PR. For reference, here's how that works in dtplyr master, with data.frames, and in this PR:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(dtplyr)

##### dtplyr master
lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) 
#> Error in colnamesInt(x, names(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[1]='a'

##### data.frame output
data.frame(a = 1, b = 2) %>% 
  group_by(a) %>% 
  expand(a, b) 
#> # A tibble: 1 × 2
#> # Groups:   a [1]
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

##### This PR
lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) %>% print %>% 
  as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#> 
#>       a     a     b
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.

@hadley
Copy link
Member

hadley commented Aug 30, 2021

Did the previous implementation expand factor levels? Or has the documentation always been wrong?

@hadley hadley requested a review from markfairbanks August 30, 2021 12:58
@eutwt
Copy link
Collaborator Author

eutwt commented Aug 30, 2021

The previous implementation doesn't expand factor levels. Example below

library(tidyr)
library(dtplyr)
df <- lazy_dt(data.frame(a = factor('a', levels = c('a', 'b'))))

df %>% 
  expand(a) %>% 
  as_tibble()
#> # A tibble: 1 × 1
#>   a    
#>   <fct>
#> 1 a

Created on 2021-08-30 by the reprex package (v2.0.1)

@hadley
Copy link
Member

hadley commented Aug 30, 2021

Ok, in that case, can you please file an issue to either fix it or make it clear in the docs that it's not currently supported?

@eutwt
Copy link
Collaborator Author

eutwt commented Aug 30, 2021

I'd like to see whether this simplification can be merged before fixing the factor issue, because I think that will be easier to fix if we can use the simpler logic.

@eutwt
Copy link
Collaborator Author

eutwt commented Aug 31, 2021

I just looked over #250 and actually the same simplification is used there, but only when there are no groups. If there are groups #250 falls back to the current implementation. I think we can use the simpler version in both cases though.

@markfairbanks
Copy link
Collaborator

Sorry about the late response on this one, just got back from vacation last night. I'll look over this today.

@markfairbanks
Copy link
Collaborator

@eutwt This implementation looks better at first glance. I had to implement expand() differently than I did in tidytable so it would work with dtplyr's lazy evaluation, so it's good to see there's a better way to do it.

One thing I noticed - in the case where expand() is applied to a group variable, it looks like your implementation duplicates the grouping column.

lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand(a, b) %>% print %>% 
  as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#> 
#>       a     a     b
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.

There are now two "a" columns, and using as_tibble() leads to an error in this case.

@eutwt
Copy link
Collaborator Author

eutwt commented Sep 2, 2021

Thanks for the review. Yeah I wasn't sure whether to deal with the case of supplying group vars to expand in this PR. That also gives an error with the current implementation, but for a different reason (variable doesn't exist). On second thought I probably should get rid of the error though. Will come back and do that when I have more time. Or feel free to modify the PR yourself, either way.

One thing to note is that you can also redefine the grouping vars in tidyr, so we can't just remove the newly supplied field if a group var is supplied. Definitely possible to deal with this, just requires some additional logic.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

data.frame(a = 1, b = 2) %>% 
  group_by(a) %>% 
  expand(a = 5, b) 
#> # A tibble: 1 × 2
#> # Groups:   a [1]
#>       a     b
#>   <dbl> <dbl>
#> 1     5     2

Created on 2021-09-02 by the reprex package (v2.0.1)

@markfairbanks
Copy link
Collaborator

markfairbanks commented Sep 2, 2021

One thing to note is that you can also redefine the grouping vars in tidyr

Ah yep, that's a good catch.

One handy thing here about data.table is we can delete column(s) using dt[, c("col1", "col2") := NULL]. When you do this it deletes only the first occurrence of that column. It looks like your expand() implementation creates the new column as the second column, so we can use this method to delete the duplicates.

Another advantage of this method is it will delete it by reference, so no extra copy is made by using select() language.

It looks like data.table errors when trying to delete grouping columns (which makes sense), so we just have to add a little bit of extra logic to ungroup the output then regroup after the duplicates are deleted.

dots_names <- names(dots)

out <- step_subset_j(
   data,
   j = expr(CJ(!!!dots, unique = TRUE)),
   vars = c(data$groups, dots_names)
 )

# Delete duplicate columns if group vars are expanded
if (any(dots_names %in% out$groups)) {
  group_vars <- out$groups
  expanded_group_vars <- dots_names[dots_names %in% group_vars]

  out <- ungroup(out)
  out <- step_subset(out, j = expr(!!expanded_group_vars := NULL))
  out <- group_by(out, !!!syms(group_vars))
}

out

Which leads to this output

lazy_dt(data.frame(a = 1, b = 2)) %>% 
  group_by(a) %>% 
  expand2(a = 5, b)
#> Source: local data table [1 x 2]
#> Groups: a
#> Call:   `_DT1`[, CJ(a = 5, b = b, unique = TRUE), keyby = .(a)][, `:=`("a", 
#>     NULL)]
#> 
#>       a     b
#>   <dbl> <dbl>
#> 1     5     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

@eutwt
Copy link
Collaborator Author

eutwt commented Sep 2, 2021

Thanks, that does seem like the right way to implement it. I went ahead and added that implementation plus a test for expanding group vars.

@eutwt eutwt force-pushed the simplify_expand branch 2 times, most recently from 68ff367 to 7c5934c Compare September 2, 2021 17:30
@markfairbanks
Copy link
Collaborator

@hadley Do you want to review before it's merged?

@eutwt
Copy link
Collaborator Author

eutwt commented Sep 4, 2021

One last commit. I had to change c(data$groups, dots_names) to union(data$groups, dots_names) so that out$vars doesn't have duplicates.

- Change `c(data$groups, dots_names)` to `union(data$groups, dots_names)` so
  that `out$names` doesn't have duplicates.

- Rearrange args so they match order in function signature
@hadley
Copy link
Member

hadley commented Sep 7, 2021

@markfairbanks you're welcome to merge it since you reviewed it 😄

@markfairbanks markfairbanks merged commit 6f14068 into tidyverse:master Sep 7, 2021
@markfairbanks
Copy link
Collaborator

Thanks @eutwt 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants