Skip to content

Clarification on use of filter_records vs where when calculating proportions #2

@kburchfiel

Description

@kburchfiel

Thank you as always for your hard work on this library! It's making it much easier to work with complex survey stats directly within Python.

I am using svy to calculate confidence intervals and proportions for American Community Survey data from IPUMS at the PUMA level. Here is some relevant setup code (which is based on this documentation page:

rep_weights_acs = svy.RepWeights(method="sdr", prefix="repwtp", n_reps=80)
design_acs = svy.Design(wgt="perwt", rep_wgts=rep_weights_acs)
sample_acs_unfiltered = svy.Sample(data=df_acs_for_cis, design=design_acs)

dv = 'income_at_least_60k'
iv = 'unique_puma_code'
method = 'replication'

The documentation indicates that, when calculating proportions for a subpopulation (in my case, unmarried women aged 25 to 34), I should use the where argument within estimation.prop() to apply this filter:

df_props = sample_acs_unfiltered.estimation.prop(
y= dv, method = method, 
by = iv, drop_nulls=True,
where = {"married":0, 'sex_and_age_cat':[
'Female_25_to_34']}).to_polars().to_pandas().query(f"({dv} == '1') & \
(unique_puma_code in ['101301', '101403', '101801'])")

Here is the output of this code:

	unique_puma_code	income_at_least_60k	est	se	lci	uci
46	101301	1	0.132114	0.006378	0.119928	0.145333
55	101403	1	0.156039	0.004503	0.147285	0.165213
73	101801	1	0.107120	0.006737	0.094430	0.121287

However, I'm finding that this produces standard errors that are much smaller than those I'm obtaining within R. (They actually matched the standard errors that I calculated for an unfiltered copy of the dataset.)

Here's some relevant R code for reference: (Much of the code derives from Exploring Complex Survey Data Analysis Using R.)

df_des <- df %>% as_survey_rep(
weight = perwt, 
repweights = matches("repwtp[0-9]+"), type = "ACS", mse = TRUE)

df_des_filtered <- df_des %>% filter(
(unique_puma_code %in% c("0101301", "0101403", "0101801")) & 
(sex_and_age_cat == "Female_25_to_34") & (married == 0))

df_props <- df_des_filtered %>% group_by(
unique_puma_code, income_at_least_60k) %>% summarize(p = survey_prop(
vartype=c("se", "ci"))) %>% filter(income_at_least_60k == 1)

df_props

Output of this R code:

    unique_puma_code	income_at_least_60k	p	p_se	p_low	p_upp
    101301	1	0.1321138	0.03715514	0.07340712	0.226304
    101403	1	0.1560392	0.02282513	0.11546344	0.2075286
    101801	1	0.1071201	0.03741219	0.05037602	0.2134165

Meanwhile, though, when I use svy's wrangling.filter_records() feature to filter the dataset before calling estimation.prop(), I obtain confidence intervals that are very close to, though not identical, to R's. (I imagine that the differences are due to variations in the design object/sample setups or the confidence-interval calculation method.)

svy_sample = sample_acs_unfiltered.wrangling.filter_records(
{"married":0, 'sex_and_age_cat':['Female_25_to_34']})

df_props = svy_sample.estimation.prop(
y= dv, method = method, 
by = iv, drop_nulls=True).to_polars().to_pandas().query(f"({dv} == '1') & (unique_puma_code in \
['101301', '101403', '101801'])")

Output of this svy code:

	unique_puma_code	income_at_least_60k	est	se	lci	uci
31	101301	1	0.132114	0.037010	0.074133	0.224449	
37	101403	1	0.156039	0.022790	0.115837	0.206928	
49	101801	1	0.107120	0.037339	0.052275	0.206942	

Given these comparisons: To achieve accurate confidence intervals for a subpopulation, should I be using the filter_methods setup or the where setup? The former seems to be a better approach, since its output comes much closer to the R-based output. However, if I'm applying the R code incorrectly, and the where-based confidence intervals are more accurate than those as well, do let me know.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions