Good morning,
I'm now updating my logistic-regression workflow to use svy rather than samplics. As a test, I wrote the following code to produce a logistic regression in svy:
import pandas as pd # I'm more familiar with Pandas than Polars, but the code could of course be rewritten to eliminate the need for Pandas
import polars as pl
import svy
# The data within this .csv file is entirely fictional
df_survey = pd.read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
# Confirming that there are no NaN values in the dataset:
print(df_survey.isna().sum().sum()) # returns 0, confirming that no NaN values exist
# The following code is based on:
# https://svylab.com/docs/svy/tutorials/sample_quicktour.html
design = svy.Design(wgt = 'Weight')
sample = svy.Sample(pl.DataFrame(df_survey), design=design)
print(sample)
unfiltered_logit_model = sample.glm.fit(y = "Enjoy_Driving_Fast_Strongly_Agree",
x = [svy.Cat('Car_Color')],
family = 'binomial',
link = 'logit')
print(unfiltered_logit_model)
df_unfiltered_logit_model = unfiltered_logit_model.to_polars().to_pandas()
Here's the first part of the output of print(unfiltered_logit_model):
GLM: Binomial (logit)
Modeling : Enjoy_Driving_Fast_Strongly_Agree
n : 1059 DF Residuals : 1058
Deviance : 233.4935 Scale : 1.0000
AIC : 239.4935 BIC : 254.3888
R-squared : 0.07168 R-sq (adj) : 0.06992
F-stat (adj) : 26.66671 Prob (F-adj) : <0.001
And here's a look at df_unfiltered_logit_model:
term estimate std_err conf_low conf_high statistic p_value df
0 _intercept_ -0.097241 0.114417 -0.321751 0.127270 -0.849880 3.955839e-01 1058
1 Car_Color_Red 0.022538 0.174665 -0.320192 0.365268 0.129035 8.973545e-01 1058
2 Car_Color_White -1.295713 0.193773 -1.675936 -0.915491 -6.686768 3.687131e-11 1058
Meanwhile, here's my equivalent R code:
library(tidyverse)
library(survey)
library(srvyr)
library(broom)
df_survey <- read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
survey_des <- df_survey %>% as_survey_design(
weights = 'Weight',
# pps = "brewer",
# variance = "YG",
nest = TRUE)
unfiltered_regression <- survey_des %>% svyglm(design = .,
formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color, family = quasibinomial)
Here's what unfiltered_regression looks like:
Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
- ids: `1`
- weights: Weight
Call: svyglm(formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color,
design = ., family = quasibinomial)
Coefficients:
(Intercept) Car_ColorRed Car_ColorWhite
-0.09724 0.02254 -1.29571
Degrees of Freedom: 1058 Total (i.e. Null); 1056 Residual
Null Deviance: 1415
Residual Deviance: 1334 AIC: NA
And here's the output of tidy(unfiltered_regression):
term<chr> estimate<dbl> std.error<dbl> statistic<dbl> p.value<dbl>
(Intercept) -0.09724094 0.1144172 -0.8498803 3.955843e-01
Car_ColorRed 0.02253795 0.1746653 0.129035 8.973545e-01
Car_ColorWhite -1.29571315 0.1937727 -6.6867684 3.690394e-11
These results are nearly identical. However, svy's Car_ColorWhite p-value (3.687131e-11) is slightly lower than R's (3.690394e-11). I also noticed that the residual degrees of freedom according to R (1056) appear to differ from those in svy (1058), though perhaps I'm just misunderstanding the output.
Do you know what might be causing this discrepancy? The estimates, standard errors, and test statistics for the Car_ColorWhite coefficient all match exactly; it's just the p-value that differs. This makes me wonder whether R and svy might be calculating p-values slightly differently in this case.
Good morning,
I'm now updating my logistic-regression workflow to use svy rather than samplics. As a test, I wrote the following code to produce a logistic regression in svy:
Here's the first part of the output of
print(unfiltered_logit_model):And here's a look at df_unfiltered_logit_model:
Meanwhile, here's my equivalent R code:
Here's what
unfiltered_regressionlooks like:And here's the output of
tidy(unfiltered_regression):These results are nearly identical. However, svy's
Car_ColorWhitep-value (3.687131e-11) is slightly lower than R's (3.690394e-11). I also noticed that the residual degrees of freedom according to R (1056) appear to differ from those in svy (1058), though perhaps I'm just misunderstanding the output.Do you know what might be causing this discrepancy? The estimates, standard errors, and test statistics for the
Car_ColorWhitecoefficient all match exactly; it's just the p-value that differs. This makes me wonder whether R and svy might be calculating p-values slightly differently in this case.