-
Notifications
You must be signed in to change notification settings - Fork 1
feat: adding 3 functions for custom school data audits #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
feat: added func for finding duplicates across files based on primary keys (course, cohort and semester files feat: added func for checking consistency between credits earned vs attempted featu: added func for checking consistency between p/f flag column and grades column
vishpillai123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need a follow-up PR for consolidation/refactoring/renaming, but I think this is okay for now.
src/edvise/data_audit/eda.py
Outdated
| return dupes | ||
|
|
||
|
|
||
| def check_credit_earned_attempted_consistency( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really similar to Mesh's function he added last week:validate_credit_consistency(...). But it looks like his function compares between the course and semester file, while this function compares within a dataframe. I think there's definitely some room for consolidation here.
At the least, I'm wondering if we can name this function differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i agree, i had the same thoughts... do you want me to brainstorm consolidating in this PR or a new one?
renaming is easy, I can do that in this PR- any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see new rename @vishpillai123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishpillai123 potential new func combining the two:
import typing as t
import pandas as pd
def check_earned_vs_attempted(
df: pd.DataFrame,
*,
earned_col: str,
attempted_col: str,
) -> t.Dict[str, pd.DataFrame]:
"""
Row-wise checks that:
1. credits_earned <= credits_attempted
2. credits_earned = 0 when credits_attempted = 0
"""
earned = pd.to_numeric(df[earned_col], errors="coerce")
attempted = pd.to_numeric(df[attempted_col], errors="coerce")
earned_gt_attempted = earned > attempted
earned_when_no_attempt = (attempted == 0) & (earned > 0)
mask = earned_gt_attempted | earned_when_no_attempt
anomalies = df[mask].copy()
anomalies["earned_gt_attempted"] = earned_gt_attempted[mask]
anomalies["earned_when_no_attempt"] = earned_when_no_attempt[mask]
summary = pd.DataFrame(
{
"earned_gt_attempted": [int(earned_gt_attempted.sum())],
"earned_when_no_attempt": [int(earned_when_no_attempt.sum())],
"total_anomalous_rows": [int(mask.sum())],
}
)
return {"anomalies": anomalies, "summary": summary}
def validate_credit_consistency(
semester_df: pd.DataFrame,
course_df: pd.DataFrame,
cohort_df: pd.DataFrame,
*,
id_col: str = "student_id",
sem_col: str = "semester_code",
# course-level credit columns
course_credits_attempted_col: str = "credits_attempted",
course_credits_earned_col: str = "credits_earned",
# semester-level credit columns
semester_credits_attempted_col: str = "number_of_semester_credits_attempted",
semester_credits_earned_col: str = "number_of_semester_credits_earned",
semester_courses_count_col: str = "number_of_semester_courses_enrolled",
# cohort-level credit columns (df_cohort)
cohort_credits_attempted_col: str = "inst_tot_credits_attempted",
cohort_credits_earned_col: str = "inst_tot_credits_earned",
credit_tol: float = 0.0,
) -> t.Dict[str, t.Any]:
"""
Unified credit validation:
1) Reconcile semester-level aggregates (semester_df) against
course-level details (course_df).
2) Check row-wise consistency of credits earned vs attempted
on cohort_df (df_cohort).
"""
# ---------- 1) RECONCILIATION: semester vs course ----------
c = course_df[
[id_col, sem_col, course_credits_attempted_col, course_credits_earned_col]
].copy()
s = semester_df[
[
id_col,
sem_col,
semester_credits_attempted_col,
semester_credits_earned_col,
semester_courses_count_col,
]
].copy()
agg = (
c.groupby([id_col, sem_col], dropna=False)
.agg(
course_sum_attempted=(course_credits_attempted_col, "sum"),
course_sum_earned=(course_credits_earned_col, "sum"),
course_count=(course_credits_attempted_col, "size"),
)
.reset_index()
)
merged = s.merge(agg, on=[id_col, sem_col], how="left", indicator="_merge_agg")
merged["has_course_rows"] = merged["_merge_agg"].eq("both")
merged["course_sum_attempted"] = merged["course_sum_attempted"].fillna(0.0)
merged["course_sum_earned"] = merged["course_sum_earned"].fillna(0.0)
merged["course_count"] = merged["course_count"].fillna(0.0)
# Ensure numeric semester columns
for col in (
semester_credits_attempted_col,
semester_credits_earned_col,
semester_courses_count_col,
):
if not pd.api.types.is_numeric_dtype(merged[col]):
merged[col] = pd.to_numeric(merged[col], errors="coerce")
merged["diff_attempted"] = (
merged["course_sum_attempted"] - merged[semester_credits_attempted_col]
)
merged["diff_earned"] = (
merged["course_sum_earned"] - merged[semester_credits_earned_col]
)
merged["diff_courses"] = merged["course_count"] - merged[
semester_courses_count_col
].fillna(0.0)
merged["match_attempted"] = merged["diff_attempted"].abs() <= credit_tol
merged["match_earned"] = merged["diff_earned"].abs() <= credit_tol
merged["match_courses"] = merged["diff_courses"] == 0.0
mismatches = (
merged.loc[
~(
merged["match_attempted"]
& merged["match_earned"]
& merged["match_courses"]
),
[
id_col,
sem_col,
semester_credits_attempted_col,
"course_sum_attempted",
"diff_attempted",
semester_credits_earned_col,
"course_sum_earned",
"diff_earned",
semester_courses_count_col,
"course_count",
"diff_courses",
"has_course_rows",
],
]
.sort_values([id_col, sem_col])
.reset_index(drop=True)
)
reconciliation_summary = {
"total_semesters_in_semester_file": int(len(s)),
"unique_student_semesters_in_courses": int(len(agg)),
"rows_with_mismatches": int(len(mismatches)),
}
# ---------- 2) COHORT-LEVEL ROW-WISE CHECKS ----------
cohort_checks = check_earned_vs_attempted(
cohort_df,
earned_col=cohort_credits_earned_col,
attempted_col=cohort_credits_attempted_col,
)
return {
"reconciliation_summary": reconciliation_summary,
"reconciliation_mismatches": mismatches,
"reconciliation_merged_detail": merged,
"cohort_anomalies": cohort_checks["anomalies"],
"cohort_anomalies_summary": cohort_checks["summary"],
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks great! Thanks for consolidating. Can you update the PR with this? It's okay if you replace Mesh's previous one. This is just a refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll comment out both and add this (just incase we for whatever reason need the old code and this breaks- i haven't tested it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not have old code commented out here since this our production repo and we want to keep it clean in terms of tech debt. If we haven't tested it yet, can we create a couple unit tests to make sure it runs as we expect? You can have chat help out with writing a couple unit tests using either mock or pytest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ok, let me remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishpillai123 just noting @Mesh-ach function takes less parameters, so that's a modification that will be needed in re-running anything on the new version going forward.
| pf_col="pass_fail_flag", | ||
| credits_col="credits_earned", | ||
| *, | ||
| passing_grades=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should create a constants file within eda similar to feature generation.
Then, we can refer to those constants from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good idea! should that happen in this PR or a future one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create another PR for this, but leave this as is for now.
src/edvise/data_audit/eda.py
Outdated
| no_credits_with_passing = (pff == True) & credits.notna() & (credits == 0) | ||
|
|
||
| # Grade vs PF disagreement (only where both known) | ||
| grade_pf_disagree = pfg.notna() & pff.notna() & (pfg != pff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is grade_pf_disagree just a sum of earned_with_failing_grade and no_credits_with_passing_grade?
Can we name these with the prefix of rows_with_ or rows_ or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it notes all rows where the p/f flag column doesn't match the "expected" p/f flag based on the grade. We can rename. that's easy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see new rename @vishpillai123
…, rename function fix: rename variables with pre-fix "rows_with" for additional clarity fix: rename function
…kind/edvise into custom_data_audit_functions
fix: style fix: consolidating validate_credit_consistency func with validate_credit_consistency_single_df into validates_credit_consistency; commenting old functions in bottom incase ever needed
changes & context: