Clarification on use of filter_records vs where when calculating proportions

Thank you as always for your hard work on this library! It's making it much easier to work with complex survey stats directly within Python.

I am using svy to calculate confidence intervals and proportions for American Community Survey data from IPUMS at the PUMA level. Here is some relevant setup code (which is based on [this documentation page](https://svylab.com/learn/notes/posts/svy-vs-r-comparison/#successive-difference-replication-sdr):

```
rep_weights_acs = svy.RepWeights(method="sdr", prefix="repwtp", n_reps=80)
design_acs = svy.Design(wgt="perwt", rep_wgts=rep_weights_acs)
sample_acs_unfiltered = svy.Sample(data=df_acs_for_cis, design=design_acs)

dv = 'income_at_least_60k'
iv = 'unique_puma_code'
method = 'replication'

```

[The documentation](https://svylab.com/docs/svy/tutorials/estimation.html) indicates that, when calculating proportions for a subpopulation (in my case, unmarried women aged 25 to 34), I should use the `where` argument within `estimation.prop()` to apply this filter:

```
df_props = sample_acs_unfiltered.estimation.prop(
y= dv, method = method, 
by = iv, drop_nulls=True,
where = {"married":0, 'sex_and_age_cat':[
'Female_25_to_34']}).to_polars().to_pandas().query(f"({dv} == '1') & \
(unique_puma_code in ['101301', '101403', '101801'])")
```
Here is the output of this code:

```
	unique_puma_code	income_at_least_60k	est	se	lci	uci
46	101301	1	0.132114	0.006378	0.119928	0.145333
55	101403	1	0.156039	0.004503	0.147285	0.165213
73	101801	1	0.107120	0.006737	0.094430	0.121287
```

However, I'm finding that this produces standard errors that are much smaller than those I'm obtaining within R. (They actually matched the standard errors that I calculated for an unfiltered copy of the dataset.) 

Here's some relevant R code for reference: (Much of the code derives from [Exploring Complex Survey Data Analysis Using R](https://tidy-survey-r.github.io/tidy-survey-book/c05-descriptive-analysis.html).)

```
df_des <- df %>% as_survey_rep(
weight = perwt, 
repweights = matches("repwtp[0-9]+"), type = "ACS", mse = TRUE)

df_des_filtered <- df_des %>% filter(
(unique_puma_code %in% c("0101301", "0101403", "0101801")) & 
(sex_and_age_cat == "Female_25_to_34") & (married == 0))

df_props <- df_des_filtered %>% group_by(
unique_puma_code, income_at_least_60k) %>% summarize(p = survey_prop(
vartype=c("se", "ci"))) %>% filter(income_at_least_60k == 1)

df_props
```
Output of this R code:
```
    unique_puma_code	income_at_least_60k	p	p_se	p_low	p_upp
    101301	1	0.1321138	0.03715514	0.07340712	0.226304
    101403	1	0.1560392	0.02282513	0.11546344	0.2075286
    101801	1	0.1071201	0.03741219	0.05037602	0.2134165
```

Meanwhile, though, when I use svy's `wrangling.filter_records()` feature to filter the dataset before calling `estimation.prop()`, I obtain confidence intervals that are very close to, though not identical, to R's. (I imagine that the differences are due to variations in the design object/sample setups or the confidence-interval calculation method.)

```
svy_sample = sample_acs_unfiltered.wrangling.filter_records(
{"married":0, 'sex_and_age_cat':['Female_25_to_34']})

df_props = svy_sample.estimation.prop(
y= dv, method = method, 
by = iv, drop_nulls=True).to_polars().to_pandas().query(f"({dv} == '1') & (unique_puma_code in \
['101301', '101403', '101801'])")
```

Output of this `svy` code:

```
	unique_puma_code	income_at_least_60k	est	se	lci	uci
31	101301	1	0.132114	0.037010	0.074133	0.224449	
37	101403	1	0.156039	0.022790	0.115837	0.206928	
49	101801	1	0.107120	0.037339	0.052275	0.206942	
```

Given these comparisons: To achieve accurate confidence intervals for a subpopulation, should I be using the `filter_methods` setup or the `where` setup? The former seems to be a better approach, since its output comes much closer to the R-based output. However, if I'm applying the R code incorrectly, and the `where`-based confidence intervals are more accurate than those as well, do let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on use of filter_records vs where when calculating proportions #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarification on use of filter_records vs where when calculating proportions #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions