-
Notifications
You must be signed in to change notification settings - Fork 59
Simplify expand
#290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify expand
#290
Conversation
|
I think the section in the description below should be removed because it isn't accurate for dtplyr. But that's unrelated to this PR so I'll leave it. #' # Note that all defined, but not necessarily present, levels of the
#' # factor variable `size` are retained. |
94b28e9 to
536aa3c
Compare
|
Previously I added a commit to remove grouping variables from library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(dtplyr)
##### dtplyr master
lazy_dt(data.frame(a = 1, b = 2)) %>%
group_by(a) %>%
expand(a, b)
#> Error in colnamesInt(x, names(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[1]='a'
##### data.frame output
data.frame(a = 1, b = 2) %>%
group_by(a) %>%
expand(a, b)
#> # A tibble: 1 × 2
#> # Groups: a [1]
#> a b
#> <dbl> <dbl>
#> 1 1 2
##### This PR
lazy_dt(data.frame(a = 1, b = 2)) %>%
group_by(a) %>%
expand(a, b) %>% print %>%
as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call: `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#>
#> a a b
#> <dbl> <dbl> <dbl>
#> 1 1 1 2
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair. |
|
Did the previous implementation expand factor levels? Or has the documentation always been wrong? |
|
The previous implementation doesn't expand factor levels. Example below library(tidyr)
library(dtplyr)
df <- lazy_dt(data.frame(a = factor('a', levels = c('a', 'b'))))
df %>%
expand(a) %>%
as_tibble()
#> # A tibble: 1 × 1
#> a
#> <fct>
#> 1 aCreated on 2021-08-30 by the reprex package (v2.0.1) |
|
Ok, in that case, can you please file an issue to either fix it or make it clear in the docs that it's not currently supported? |
|
I'd like to see whether this simplification can be merged before fixing the factor issue, because I think that will be easier to fix if we can use the simpler logic. |
|
Sorry about the late response on this one, just got back from vacation last night. I'll look over this today. |
|
@eutwt This implementation looks better at first glance. I had to implement One thing I noticed - in the case where lazy_dt(data.frame(a = 1, b = 2)) %>%
group_by(a) %>%
expand(a, b) %>% print %>%
as_tibble()
#> Source: local data table [1 x 3]
#> Groups: a
#> Call: `_DT1`[, CJ(a = a, b = b, unique = TRUE), keyby = .(a)]
#>
#> a a b
#> <dbl> <dbl> <dbl>
#> 1 1 1 2
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.There are now two "a" columns, and using |
|
Thanks for the review. Yeah I wasn't sure whether to deal with the case of supplying group vars to expand in this PR. That also gives an error with the current implementation, but for a different reason (variable doesn't exist). On second thought I probably should get rid of the error though. Will come back and do that when I have more time. Or feel free to modify the PR yourself, either way. One thing to note is that you can also redefine the grouping vars in tidyr, so we can't just remove the newly supplied field if a group var is supplied. Definitely possible to deal with this, just requires some additional logic. library(dplyr, warn.conflicts = FALSE)
library(tidyr)
data.frame(a = 1, b = 2) %>%
group_by(a) %>%
expand(a = 5, b)
#> # A tibble: 1 × 2
#> # Groups: a [1]
#> a b
#> <dbl> <dbl>
#> 1 5 2Created on 2021-09-02 by the reprex package (v2.0.1) |
Ah yep, that's a good catch. One handy thing here about data.table is we can delete column(s) using Another advantage of this method is it will delete it by reference, so no extra copy is made by using It looks like data.table errors when trying to delete grouping columns (which makes sense), so we just have to add a little bit of extra logic to ungroup the output then regroup after the duplicates are deleted. dots_names <- names(dots)
out <- step_subset_j(
data,
j = expr(CJ(!!!dots, unique = TRUE)),
vars = c(data$groups, dots_names)
)
# Delete duplicate columns if group vars are expanded
if (any(dots_names %in% out$groups)) {
group_vars <- out$groups
expanded_group_vars <- dots_names[dots_names %in% group_vars]
out <- ungroup(out)
out <- step_subset(out, j = expr(!!expanded_group_vars := NULL))
out <- group_by(out, !!!syms(group_vars))
}
outWhich leads to this output lazy_dt(data.frame(a = 1, b = 2)) %>%
group_by(a) %>%
expand2(a = 5, b)
#> Source: local data table [1 x 2]
#> Groups: a
#> Call: `_DT1`[, CJ(a = 5, b = b, unique = TRUE), keyby = .(a)][, `:=`("a",
#> NULL)]
#>
#> a b
#> <dbl> <dbl>
#> 1 5 2
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results |
|
Thanks, that does seem like the right way to implement it. I went ahead and added that implementation plus a test for expanding group vars. |
68ff367 to
7c5934c
Compare
Co-Authored-By: Mark Fairbanks <[email protected]>
|
@hadley Do you want to review before it's merged? |
248255c to
c2c8998
Compare
|
One last commit. I had to change |
- Change `c(data$groups, dots_names)` to `union(data$groups, dots_names)` so that `out$names` doesn't have duplicates. - Rearrange args so they match order in function signature
c2c8998 to
cf7870a
Compare
|
@markfairbanks you're welcome to merge it since you reviewed it 😄 |
|
Thanks @eutwt 👍 |
It looks like
expandcan be simplified so that it's just aCJinj(plus careful name construction). I changed expected expressions in the tests, but all the outputs are unchanged.I see that you did the original implementation @markfairbanks, so I'd appreciate it if you could take a look and tell me whether there's some cases not handled correctly in the simplified version.
Example with a lot of groups: