Skip to content

Conversation

@nm3224
Copy link
Collaborator

@nm3224 nm3224 commented Nov 24, 2025

changes & context:

  • feat: added func for finding duplicates across files based on primary keys (course, cohort and semester files
  • feat: added func for checking consistency between credits earned vs attempted
  • feat: added func for checking consistency between p/f flag column and grades column

feat: added func for finding duplicates across files based on primary keys (course, cohort and semester files
feat: added func for checking consistency between credits earned vs attempted
featu: added func for checking consistency between p/f flag column and grades column
Copy link
Collaborator

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need a follow-up PR for consolidation/refactoring/renaming, but I think this is okay for now.

return dupes


def check_credit_earned_attempted_consistency(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really similar to Mesh's function he added last week:validate_credit_consistency(...). But it looks like his function compares between the course and semester file, while this function compares within a dataframe. I think there's definitely some room for consolidation here.

At the least, I'm wondering if we can name this function differently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i agree, i had the same thoughts... do you want me to brainstorm consolidating in this PR or a new one?
renaming is easy, I can do that in this PR- any suggestions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see new rename @vishpillai123

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishpillai123 potential new func combining the two:

import typing as t
import pandas as pd

def check_earned_vs_attempted(
    df: pd.DataFrame,
    *,
    earned_col: str,
    attempted_col: str,
) -> t.Dict[str, pd.DataFrame]:
    """
    Row-wise checks that:
      1. credits_earned <= credits_attempted
      2. credits_earned = 0 when credits_attempted = 0
    """
    earned = pd.to_numeric(df[earned_col], errors="coerce")
    attempted = pd.to_numeric(df[attempted_col], errors="coerce")

    earned_gt_attempted = earned > attempted
    earned_when_no_attempt = (attempted == 0) & (earned > 0)
    mask = earned_gt_attempted | earned_when_no_attempt

    anomalies = df[mask].copy()
    anomalies["earned_gt_attempted"] = earned_gt_attempted[mask]
    anomalies["earned_when_no_attempt"] = earned_when_no_attempt[mask]

    summary = pd.DataFrame(
        {
            "earned_gt_attempted": [int(earned_gt_attempted.sum())],
            "earned_when_no_attempt": [int(earned_when_no_attempt.sum())],
            "total_anomalous_rows": [int(mask.sum())],
        }
    )

    return {"anomalies": anomalies, "summary": summary}


def validate_credit_consistency(
    semester_df: pd.DataFrame,
    course_df: pd.DataFrame,
    cohort_df: pd.DataFrame,
    *,
    id_col: str = "student_id",
    sem_col: str = "semester_code",
    # course-level credit columns
    course_credits_attempted_col: str = "credits_attempted",
    course_credits_earned_col: str = "credits_earned",
    # semester-level credit columns
    semester_credits_attempted_col: str = "number_of_semester_credits_attempted",
    semester_credits_earned_col: str = "number_of_semester_credits_earned",
    semester_courses_count_col: str = "number_of_semester_courses_enrolled",
    # cohort-level credit columns (df_cohort)
    cohort_credits_attempted_col: str = "inst_tot_credits_attempted",
    cohort_credits_earned_col: str = "inst_tot_credits_earned",
    credit_tol: float = 0.0,
) -> t.Dict[str, t.Any]:
    """
    Unified credit validation:

    1) Reconcile semester-level aggregates (semester_df) against
       course-level details (course_df).

    2) Check row-wise consistency of credits earned vs attempted
       on cohort_df (df_cohort).
    """

    # ---------- 1) RECONCILIATION: semester vs course ----------

    c = course_df[
        [id_col, sem_col, course_credits_attempted_col, course_credits_earned_col]
    ].copy()
    s = semester_df[
        [
            id_col,
            sem_col,
            semester_credits_attempted_col,
            semester_credits_earned_col,
            semester_courses_count_col,
        ]
    ].copy()

    agg = (
        c.groupby([id_col, sem_col], dropna=False)
        .agg(
            course_sum_attempted=(course_credits_attempted_col, "sum"),
            course_sum_earned=(course_credits_earned_col, "sum"),
            course_count=(course_credits_attempted_col, "size"),
        )
        .reset_index()
    )

    merged = s.merge(agg, on=[id_col, sem_col], how="left", indicator="_merge_agg")
    merged["has_course_rows"] = merged["_merge_agg"].eq("both")
    merged["course_sum_attempted"] = merged["course_sum_attempted"].fillna(0.0)
    merged["course_sum_earned"] = merged["course_sum_earned"].fillna(0.0)
    merged["course_count"] = merged["course_count"].fillna(0.0)

    # Ensure numeric semester columns
    for col in (
        semester_credits_attempted_col,
        semester_credits_earned_col,
        semester_courses_count_col,
    ):
        if not pd.api.types.is_numeric_dtype(merged[col]):
            merged[col] = pd.to_numeric(merged[col], errors="coerce")

    merged["diff_attempted"] = (
        merged["course_sum_attempted"] - merged[semester_credits_attempted_col]
    )
    merged["diff_earned"] = (
        merged["course_sum_earned"] - merged[semester_credits_earned_col]
    )
    merged["diff_courses"] = merged["course_count"] - merged[
        semester_courses_count_col
    ].fillna(0.0)

    merged["match_attempted"] = merged["diff_attempted"].abs() <= credit_tol
    merged["match_earned"] = merged["diff_earned"].abs() <= credit_tol
    merged["match_courses"] = merged["diff_courses"] == 0.0

    mismatches = (
        merged.loc[
            ~(
                merged["match_attempted"]
                & merged["match_earned"]
                & merged["match_courses"]
            ),
            [
                id_col,
                sem_col,
                semester_credits_attempted_col,
                "course_sum_attempted",
                "diff_attempted",
                semester_credits_earned_col,
                "course_sum_earned",
                "diff_earned",
                semester_courses_count_col,
                "course_count",
                "diff_courses",
                "has_course_rows",
            ],
        ]
        .sort_values([id_col, sem_col])
        .reset_index(drop=True)
    )

    reconciliation_summary = {
        "total_semesters_in_semester_file": int(len(s)),
        "unique_student_semesters_in_courses": int(len(agg)),
        "rows_with_mismatches": int(len(mismatches)),
    }

    # ---------- 2) COHORT-LEVEL ROW-WISE CHECKS ----------

    cohort_checks = check_earned_vs_attempted(
        cohort_df,
        earned_col=cohort_credits_earned_col,
        attempted_col=cohort_credits_attempted_col,
    )

    return {
        "reconciliation_summary": reconciliation_summary,
        "reconciliation_mismatches": mismatches,
        "reconciliation_merged_detail": merged,
        "cohort_anomalies": cohort_checks["anomalies"],
        "cohort_anomalies_summary": cohort_checks["summary"],
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks great! Thanks for consolidating. Can you update the PR with this? It's okay if you replace Mesh's previous one. This is just a refactor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll comment out both and add this (just incase we for whatever reason need the old code and this breaks- i haven't tested it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not have old code commented out here since this our production repo and we want to keep it clean in terms of tech debt. If we haven't tested it yet, can we create a couple unit tests to make sure it runs as we expect? You can have chat help out with writing a couple unit tests using either mock or pytest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, let me remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishpillai123 just noting @Mesh-ach function takes less parameters, so that's a modification that will be needed in re-running anything on the new version going forward.

pf_col="pass_fail_flag",
credits_col="credits_earned",
*,
passing_grades=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create a constants file within eda similar to feature generation.

Then, we can refer to those constants from here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea! should that happen in this PR or a future one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create another PR for this, but leave this as is for now.

no_credits_with_passing = (pff == True) & credits.notna() & (credits == 0)

# Grade vs PF disagreement (only where both known)
grade_pf_disagree = pfg.notna() & pff.notna() & (pfg != pff)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is grade_pf_disagree just a sum of earned_with_failing_grade and no_credits_with_passing_grade?

Can we name these with the prefix of rows_with_ or rows_ or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it notes all rows where the p/f flag column doesn't match the "expected" p/f flag based on the grade. We can rename. that's easy!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see new rename @vishpillai123

…, rename function

fix: rename variables with pre-fix "rows_with" for additional clarity
fix: rename function
fix: style
fix: consolidating validate_credit_consistency func with validate_credit_consistency_single_df into validates_credit_consistency; commenting old functions in bottom incase ever needed
@nm3224 nm3224 merged commit 8b597b1 into develop Dec 1, 2025
6 checks passed
@nm3224 nm3224 deleted the custom_data_audit_functions branch December 1, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants