Getting slightly different p values in R and svy within a logistic regression

Good morning, 

I'm now updating my logistic-regression workflow to use svy rather than samplics. As a test, I wrote the following code to produce a logistic regression in svy:

```
import pandas as pd # I'm more familiar with Pandas than Polars, but the code could of course be rewritten to eliminate the need for Pandas
import polars as pl
import svy
# The data within this .csv file is entirely fictional
df_survey = pd.read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
# Confirming that there are no NaN values in the dataset:
print(df_survey.isna().sum().sum()) # returns 0, confirming that no NaN values exist

# The following code is based on:
# https://svylab.com/docs/svy/tutorials/sample_quicktour.html
design = svy.Design(wgt = 'Weight')
sample = svy.Sample(pl.DataFrame(df_survey), design=design)
print(sample)

unfiltered_logit_model = sample.glm.fit(y = "Enjoy_Driving_Fast_Strongly_Agree",
                                        x = [svy.Cat('Car_Color')],
                                        family = 'binomial',
                                        link = 'logit')
print(unfiltered_logit_model)

df_unfiltered_logit_model = unfiltered_logit_model.to_polars().to_pandas()
```

Here's the first part of the output of `print(unfiltered_logit_model)`:

```
GLM: Binomial (logit)
  Modeling : Enjoy_Driving_Fast_Strongly_Agree
  n            : 1059          DF Residuals : 1058
  Deviance     : 233.4935      Scale        : 1.0000
  AIC          : 239.4935      BIC          : 254.3888
  R-squared    : 0.07168       R-sq (adj)   : 0.06992
  F-stat (adj) : 26.66671      Prob (F-adj) : <0.001
```

And here's a look at df_unfiltered_logit_model:

```
	term	estimate	std_err	conf_low	conf_high	statistic	p_value	df
0	_intercept_	-0.097241	0.114417	-0.321751	0.127270	-0.849880	3.955839e-01	1058
1	Car_Color_Red	0.022538	0.174665	-0.320192	0.365268	0.129035	8.973545e-01	1058
2	Car_Color_White	-1.295713	0.193773	-1.675936	-0.915491	-6.686768	3.687131e-11	1058
```

Meanwhile, here's my equivalent R code:

```
library(tidyverse)
library(survey)
library(srvyr)
library(broom)
df_survey <- read_csv('https://raw.githubusercontent.com/ifstudies/carsurveydata/refs/heads/main/car_survey_updated.csv')
survey_des <- df_survey %>% as_survey_design(
  weights = 'Weight',
  # pps = "brewer",
  # variance = "YG",
  nest = TRUE)
unfiltered_regression <- survey_des %>% svyglm(design = ., 
formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color, family = quasibinomial)
```

Here's what `unfiltered_regression` looks like:

```
Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
  - ids: `1` 
  - weights: Weight 

Call:  svyglm(formula = Enjoy_Driving_Fast_Strongly_Agree ~ Car_Color, 
    design = ., family = quasibinomial)

Coefficients:
   (Intercept)    Car_ColorRed  Car_ColorWhite  
      -0.09724         0.02254        -1.29571  

Degrees of Freedom: 1058 Total (i.e. Null);  1056 Residual
Null Deviance:	    1415 
Residual Deviance: 1334 	AIC: NA
```

And here's the output of `tidy(unfiltered_regression)`:

```
term<chr>	estimate<dbl>	std.error<dbl>	statistic<dbl>	p.value<dbl>
(Intercept)	-0.09724094	0.1144172	-0.8498803	3.955843e-01
Car_ColorRed	0.02253795	0.1746653	0.129035	8.973545e-01
Car_ColorWhite	-1.29571315	0.1937727	-6.6867684	3.690394e-11
```

These results are nearly identical. However, svy's `Car_ColorWhite` p-value (3.687131e-11) is slightly lower than R's (`3.690394e-11`). I also noticed that the residual degrees of freedom according to R (1056) appear to differ from those in svy (1058), though perhaps I'm just misunderstanding the output.

Do you know what might be causing this discrepancy? The estimates, standard errors, and test statistics for the `Car_ColorWhite` coefficient all match exactly; it's just the p-value that differs. This makes me wonder whether R and svy might be calculating p-values slightly differently in this case.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting slightly different p values in R and svy within a logistic regression #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Getting slightly different p values in R and svy within a logistic regression #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions