feat: adding 3 functions for custom school data audits #63

nm3224 · 2025-11-24T20:38:30Z

changes & context:

feat: added func for finding duplicates across files based on primary keys (course, cohort and semester files
feat: added func for checking consistency between credits earned vs attempted
feat: added func for checking consistency between p/f flag column and grades column

feat: added func for finding duplicates across files based on primary keys (course, cohort and semester files feat: added func for checking consistency between credits earned vs attempted featu: added func for checking consistency between p/f flag column and grades column

vishpillai123

I think we may need a follow-up PR for consolidation/refactoring/renaming, but I think this is okay for now.

src/edvise/data_audit/eda.py

vishpillai123 · 2025-12-01T15:52:31Z

src/edvise/data_audit/eda.py

+    return dupes
+
+
+def check_credit_earned_attempted_consistency(


This looks really similar to Mesh's function he added last week:validate_credit_consistency(...). But it looks like his function compares between the course and semester file, while this function compares within a dataframe. I think there's definitely some room for consolidation here.

At the least, I'm wondering if we can name this function differently?

yea i agree, i had the same thoughts... do you want me to brainstorm consolidating in this PR or a new one?
renaming is easy, I can do that in this PR- any suggestions?

see new rename @vishpillai123

@vishpillai123 potential new func combining the two:

import typing as t import pandas as pd def check_earned_vs_attempted( df: pd.DataFrame, *, earned_col: str, attempted_col: str, ) -> t.Dict[str, pd.DataFrame]: """ Row-wise checks that: 1. credits_earned <= credits_attempted 2. credits_earned = 0 when credits_attempted = 0 """ earned = pd.to_numeric(df[earned_col], errors="coerce") attempted = pd.to_numeric(df[attempted_col], errors="coerce") earned_gt_attempted = earned > attempted earned_when_no_attempt = (attempted == 0) & (earned > 0) mask = earned_gt_attempted | earned_when_no_attempt anomalies = df[mask].copy() anomalies["earned_gt_attempted"] = earned_gt_attempted[mask] anomalies["earned_when_no_attempt"] = earned_when_no_attempt[mask] summary = pd.DataFrame( { "earned_gt_attempted": [int(earned_gt_attempted.sum())], "earned_when_no_attempt": [int(earned_when_no_attempt.sum())], "total_anomalous_rows": [int(mask.sum())], } ) return {"anomalies": anomalies, "summary": summary} def validate_credit_consistency( semester_df: pd.DataFrame, course_df: pd.DataFrame, cohort_df: pd.DataFrame, *, id_col: str = "student_id", sem_col: str = "semester_code", # course-level credit columns course_credits_attempted_col: str = "credits_attempted", course_credits_earned_col: str = "credits_earned", # semester-level credit columns semester_credits_attempted_col: str = "number_of_semester_credits_attempted", semester_credits_earned_col: str = "number_of_semester_credits_earned", semester_courses_count_col: str = "number_of_semester_courses_enrolled", # cohort-level credit columns (df_cohort) cohort_credits_attempted_col: str = "inst_tot_credits_attempted", cohort_credits_earned_col: str = "inst_tot_credits_earned", credit_tol: float = 0.0, ) -> t.Dict[str, t.Any]: """ Unified credit validation: 1) Reconcile semester-level aggregates (semester_df) against course-level details (course_df). 2) Check row-wise consistency of credits earned vs attempted on cohort_df (df_cohort). """ # ---------- 1) RECONCILIATION: semester vs course ---------- c = course_df[ [id_col, sem_col, course_credits_attempted_col, course_credits_earned_col] ].copy() s = semester_df[ [ id_col, sem_col, semester_credits_attempted_col, semester_credits_earned_col, semester_courses_count_col, ] ].copy() agg = ( c.groupby([id_col, sem_col], dropna=False) .agg( course_sum_attempted=(course_credits_attempted_col, "sum"), course_sum_earned=(course_credits_earned_col, "sum"), course_count=(course_credits_attempted_col, "size"), ) .reset_index() ) merged = s.merge(agg, on=[id_col, sem_col], how="left", indicator="_merge_agg") merged["has_course_rows"] = merged["_merge_agg"].eq("both") merged["course_sum_attempted"] = merged["course_sum_attempted"].fillna(0.0) merged["course_sum_earned"] = merged["course_sum_earned"].fillna(0.0) merged["course_count"] = merged["course_count"].fillna(0.0) # Ensure numeric semester columns for col in ( semester_credits_attempted_col, semester_credits_earned_col, semester_courses_count_col, ): if not pd.api.types.is_numeric_dtype(merged[col]): merged[col] = pd.to_numeric(merged[col], errors="coerce") merged["diff_attempted"] = ( merged["course_sum_attempted"] - merged[semester_credits_attempted_col] ) merged["diff_earned"] = ( merged["course_sum_earned"] - merged[semester_credits_earned_col] ) merged["diff_courses"] = merged["course_count"] - merged[ semester_courses_count_col ].fillna(0.0) merged["match_attempted"] = merged["diff_attempted"].abs() <= credit_tol merged["match_earned"] = merged["diff_earned"].abs() <= credit_tol merged["match_courses"] = merged["diff_courses"] == 0.0 mismatches = ( merged.loc[ ~( merged["match_attempted"] & merged["match_earned"] & merged["match_courses"] ), [ id_col, sem_col, semester_credits_attempted_col, "course_sum_attempted", "diff_attempted", semester_credits_earned_col, "course_sum_earned", "diff_earned", semester_courses_count_col, "course_count", "diff_courses", "has_course_rows", ], ] .sort_values([id_col, sem_col]) .reset_index(drop=True) ) reconciliation_summary = { "total_semesters_in_semester_file": int(len(s)), "unique_student_semesters_in_courses": int(len(agg)), "rows_with_mismatches": int(len(mismatches)), } # ---------- 2) COHORT-LEVEL ROW-WISE CHECKS ---------- cohort_checks = check_earned_vs_attempted( cohort_df, earned_col=cohort_credits_earned_col, attempted_col=cohort_credits_attempted_col, ) return { "reconciliation_summary": reconciliation_summary, "reconciliation_mismatches": mismatches, "reconciliation_merged_detail": merged, "cohort_anomalies": cohort_checks["anomalies"], "cohort_anomalies_summary": cohort_checks["summary"], }

Nice, this looks great! Thanks for consolidating. Can you update the PR with this? It's okay if you replace Mesh's previous one. This is just a refactor.

I'll comment out both and add this (just incase we for whatever reason need the old code and this breaks- i haven't tested it)

I would prefer to not have old code commented out here since this our production repo and we want to keep it clean in terms of tech debt. If we haven't tested it yet, can we create a couple unit tests to make sure it runs as we expect? You can have chat help out with writing a couple unit tests using either mock or pytest.

hmm ok, let me remove it

@vishpillai123 just noting @Mesh-ach function takes less parameters, so that's a modification that will be needed in re-running anything on the new version going forward.

vishpillai123 · 2025-12-01T15:55:52Z

src/edvise/data_audit/eda.py

+    pf_col="pass_fail_flag",
+    credits_col="credits_earned",
+    *,
+    passing_grades=(


I think we should create a constants file within eda similar to feature generation.

Then, we can refer to those constants from here.

that's a good idea! should that happen in this PR or a future one?

We can create another PR for this, but leave this as is for now.

vishpillai123 · 2025-12-01T15:59:36Z

src/edvise/data_audit/eda.py

+    no_credits_with_passing = (pff == True) & credits.notna() & (credits == 0)
+
+    # Grade vs PF disagreement (only where both known)
+    grade_pf_disagree = pfg.notna() & pff.notna() & (pfg != pff)


So is grade_pf_disagree just a sum of earned_with_failing_grade and no_credits_with_passing_grade?

Can we name these with the prefix of rows_with_ or rows_ or something?

it notes all rows where the p/f flag column doesn't match the "expected" p/f flag based on the grade. We can rename. that's easy!

see new rename @vishpillai123

…, rename function fix: rename variables with pre-fix "rows_with" for additional clarity fix: rename function

…kind/edvise into custom_data_audit_functions

fix: style fix: consolidating validate_credit_consistency func with validate_credit_consistency_single_df into validates_credit_consistency; commenting old functions in bottom incase ever needed

nm3224 added 2 commits November 24, 2025 15:37

style: ruff formatting

e3ca2b2

vishpillai123 approved these changes Dec 1, 2025

View reviewed changes

nm3224 added 9 commits December 1, 2025 11:55

fix: rename variables with pre-fix "rows_with" for additional clarity…

ff4061e

…, rename function fix: rename variables with pre-fix "rows_with" for additional clarity fix: rename function

Merge branch 'develop' into custom_data_audit_functions

731bee1

fix: style

084a8d0

Merge branch 'custom_data_audit_functions' of https://github.com/data…

3a4aa39

…kind/edvise into custom_data_audit_functions

fix: style, consolidating functions

f0654b2

fix: style fix: consolidating validate_credit_consistency func with validate_credit_consistency_single_df into validates_credit_consistency; commenting old functions in bottom incase ever needed

Merge branch 'develop' into custom_data_audit_functions

9537081

fix: style

e6611f9

fix: removing deprecated code per suggestion

a52730e

fix: changing "validates" to "validate" in function name

415e242

nm3224 merged commit 8b597b1 into develop Dec 1, 2025
6 checks passed

nm3224 deleted the custom_data_audit_functions branch December 1, 2025 22:29

feat: adding 3 functions for custom school data audits #63

feat: adding 3 functions for custom school data audits #63

Uh oh!

Conversation

nm3224 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

changes & context:

Uh oh!

vishpillai123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nm3224 commented Nov 24, 2025 •

edited

Loading