Skip to content

Getting slightly different p values in R and svy within a logistic regression #5

@kburchfiel

Description

@kburchfiel

Good morning,

I'm now updating my logistic-regression workflow to use svy rather than samplics. As a test, I wrote the following code to produce a logistic regression in svy:

import pandas as pd # I'm more familiar with Pandas than Polars, but the code could of course be rewritten to eliminate the need for Pandas
import polars as pl
import svy
# The data within this .csv file is entirely fictional
df_survey = pd.read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
# Confirming that there are no NaN values in the dataset:
print(df_survey.isna().sum().sum()) # returns 0, confirming that no NaN values exist

# The following code is based on:
# https://svylab.com/docs/svy/tutorials/sample_quicktour.html
design = svy.Design(wgt = 'Weight')
sample = svy.Sample(pl.DataFrame(df_survey), design=design)
print(sample)

unfiltered_logit_model = sample.glm.fit(y = "Enjoy_Driving_Fast_Strongly_Agree",
                                        x = [svy.Cat('Car_Color')],
                                        family = 'binomial',
                                        link = 'logit')
print(unfiltered_logit_model)

df_unfiltered_logit_model = unfiltered_logit_model.to_polars().to_pandas()

Here's the first part of the output of print(unfiltered_logit_model):

GLM: Binomial (logit)
  Modeling : Enjoy_Driving_Fast_Strongly_Agree
  n            : 1059          DF Residuals : 1058
  Deviance     : 233.4935      Scale        : 1.0000
  AIC          : 239.4935      BIC          : 254.3888
  R-squared    : 0.07168       R-sq (adj)   : 0.06992
  F-stat (adj) : 26.66671      Prob (F-adj) : <0.001

And here's a look at df_unfiltered_logit_model:

	term	estimate	std_err	conf_low	conf_high	statistic	p_value	df
0	_intercept_	-0.097241	0.114417	-0.321751	0.127270	-0.849880	3.955839e-01	1058
1	Car_Color_Red	0.022538	0.174665	-0.320192	0.365268	0.129035	8.973545e-01	1058
2	Car_Color_White	-1.295713	0.193773	-1.675936	-0.915491	-6.686768	3.687131e-11	1058

Meanwhile, here's my equivalent R code:

library(tidyverse)
library(survey)
library(srvyr)
library(broom)
df_survey <- read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
survey_des <- df_survey %>% as_survey_design(
  weights = 'Weight',
  # pps = "brewer",
  # variance = "YG",
  nest = TRUE)
unfiltered_regression <- survey_des %>% svyglm(design = ., 
formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color, family = quasibinomial)

Here's what unfiltered_regression looks like:

Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
  - ids: `1` 
  - weights: Weight 

Call:  svyglm(formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color, 
    design = ., family = quasibinomial)

Coefficients:
   (Intercept)    Car_ColorRed  Car_ColorWhite  
      -0.09724         0.02254        -1.29571  

Degrees of Freedom: 1058 Total (i.e. Null);  1056 Residual
Null Deviance:	    1415 
Residual Deviance: 1334 	AIC: NA

And here's the output of tidy(unfiltered_regression):

term<chr>	estimate<dbl>	std.error<dbl>	statistic<dbl>	p.value<dbl>
(Intercept)	-0.09724094	0.1144172	-0.8498803	3.955843e-01
Car_ColorRed	0.02253795	0.1746653	0.129035	8.973545e-01
Car_ColorWhite	-1.29571315	0.1937727	-6.6867684	3.690394e-11

These results are nearly identical. However, svy's Car_ColorWhite p-value (3.687131e-11) is slightly lower than R's (3.690394e-11). I also noticed that the residual degrees of freedom according to R (1056) appear to differ from those in svy (1058), though perhaps I'm just misunderstanding the output.

Do you know what might be causing this discrepancy? The estimates, standard errors, and test statistics for the Car_ColorWhite coefficient all match exactly; it's just the p-value that differs. This makes me wonder whether R and svy might be calculating p-values slightly differently in this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions