Thank you as always for your hard work on this library! It's making it much easier to work with complex survey stats directly within Python.
I am using svy to calculate confidence intervals and proportions for American Community Survey data from IPUMS at the PUMA level. Here is some relevant setup code (which is based on this documentation page:
rep_weights_acs = svy.RepWeights(method="sdr", prefix="repwtp", n_reps=80)
design_acs = svy.Design(wgt="perwt", rep_wgts=rep_weights_acs)
sample_acs_unfiltered = svy.Sample(data=df_acs_for_cis, design=design_acs)
dv = 'income_at_least_60k'
iv = 'unique_puma_code'
method = 'replication'
The documentation indicates that, when calculating proportions for a subpopulation (in my case, unmarried women aged 25 to 34), I should use the where argument within estimation.prop() to apply this filter:
df_props = sample_acs_unfiltered.estimation.prop(
y= dv, method = method,
by = iv, drop_nulls=True,
where = {"married":0, 'sex_and_age_cat':[
'Female_25_to_34']}).to_polars().to_pandas().query(f"({dv} == '1') & \
(unique_puma_code in ['101301', '101403', '101801'])")
Here is the output of this code:
unique_puma_code income_at_least_60k est se lci uci
46 101301 1 0.132114 0.006378 0.119928 0.145333
55 101403 1 0.156039 0.004503 0.147285 0.165213
73 101801 1 0.107120 0.006737 0.094430 0.121287
However, I'm finding that this produces standard errors that are much smaller than those I'm obtaining within R. (They actually matched the standard errors that I calculated for an unfiltered copy of the dataset.)
Here's some relevant R code for reference: (Much of the code derives from Exploring Complex Survey Data Analysis Using R.)
df_des <- df %>% as_survey_rep(
weight = perwt,
repweights = matches("repwtp[0-9]+"), type = "ACS", mse = TRUE)
df_des_filtered <- df_des %>% filter(
(unique_puma_code %in% c("0101301", "0101403", "0101801")) &
(sex_and_age_cat == "Female_25_to_34") & (married == 0))
df_props <- df_des_filtered %>% group_by(
unique_puma_code, income_at_least_60k) %>% summarize(p = survey_prop(
vartype=c("se", "ci"))) %>% filter(income_at_least_60k == 1)
df_props
Output of this R code:
unique_puma_code income_at_least_60k p p_se p_low p_upp
101301 1 0.1321138 0.03715514 0.07340712 0.226304
101403 1 0.1560392 0.02282513 0.11546344 0.2075286
101801 1 0.1071201 0.03741219 0.05037602 0.2134165
Meanwhile, though, when I use svy's wrangling.filter_records() feature to filter the dataset before calling estimation.prop(), I obtain confidence intervals that are very close to, though not identical, to R's. (I imagine that the differences are due to variations in the design object/sample setups or the confidence-interval calculation method.)
svy_sample = sample_acs_unfiltered.wrangling.filter_records(
{"married":0, 'sex_and_age_cat':['Female_25_to_34']})
df_props = svy_sample.estimation.prop(
y= dv, method = method,
by = iv, drop_nulls=True).to_polars().to_pandas().query(f"({dv} == '1') & (unique_puma_code in \
['101301', '101403', '101801'])")
Output of this svy code:
unique_puma_code income_at_least_60k est se lci uci
31 101301 1 0.132114 0.037010 0.074133 0.224449
37 101403 1 0.156039 0.022790 0.115837 0.206928
49 101801 1 0.107120 0.037339 0.052275 0.206942
Given these comparisons: To achieve accurate confidence intervals for a subpopulation, should I be using the filter_methods setup or the where setup? The former seems to be a better approach, since its output comes much closer to the R-based output. However, if I'm applying the R code incorrectly, and the where-based confidence intervals are more accurate than those as well, do let me know.
Thank you as always for your hard work on this library! It's making it much easier to work with complex survey stats directly within Python.
I am using svy to calculate confidence intervals and proportions for American Community Survey data from IPUMS at the PUMA level. Here is some relevant setup code (which is based on this documentation page:
The documentation indicates that, when calculating proportions for a subpopulation (in my case, unmarried women aged 25 to 34), I should use the
whereargument withinestimation.prop()to apply this filter:Here is the output of this code:
However, I'm finding that this produces standard errors that are much smaller than those I'm obtaining within R. (They actually matched the standard errors that I calculated for an unfiltered copy of the dataset.)
Here's some relevant R code for reference: (Much of the code derives from Exploring Complex Survey Data Analysis Using R.)
Output of this R code:
Meanwhile, though, when I use svy's
wrangling.filter_records()feature to filter the dataset before callingestimation.prop(), I obtain confidence intervals that are very close to, though not identical, to R's. (I imagine that the differences are due to variations in the design object/sample setups or the confidence-interval calculation method.)Output of this
svycode:Given these comparisons: To achieve accurate confidence intervals for a subpopulation, should I be using the
filter_methodssetup or thewheresetup? The former seems to be a better approach, since its output comes much closer to the R-based output. However, if I'm applying the R code incorrectly, and thewhere-based confidence intervals are more accurate than those as well, do let me know.