-
Notifications
You must be signed in to change notification settings - Fork 1
fix(metrics-autheticity): correct authenticity logic to match paper definition. #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request refactors the authenticity metric calculation in the AlphaPrecision class within the statistical evaluation module. The calculation method shifts from direct distance comparison to a nearest-neighbor-based approach that determines, for each synthetic point, whether the nearest real point is closer to the synthetic point than the synthetic point is to that real point. The implementation removes the previous default centering behavior for the embedding center parameter. Corresponding unit tests are updated with adjusted tolerance values and one modified expected value to reflect the new calculation logic. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/midst_toolkit/evaluation/quality/synthcity/statistical_eval.py (2)
120-120: Minor: Remove unnecessary whitespace change.This appears to be an unintentional whitespace-only change.
164-171: Consider vectorizing the authenticity calculation for better performance.The new authenticity logic correctly implements the paper definition by considering all synthetic samples. However, the explicit Python loop can be replaced with vectorized NumPy operations for significantly better performance, especially with large datasets.
Apply this diff to vectorize the calculation:
- authen = [] - for syn_idx in range(x_syn.shape[0]): - d_real_to_synthetic = closest_real_to_synth_distance[syn_idx] - d_real_to_real = real_to_real[closest_real_to_synthetic_idx_list[syn_idx]] - is_authentic = d_real_to_real <= d_real_to_synthetic - authen.append(is_authentic) - - authenticity = np.mean(authen) + authen = real_to_real[closest_real_to_synthetic_idx_list] <= closest_real_to_synth_distance + authenticity = np.mean(authen)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/midst_toolkit/evaluation/quality/synthcity/statistical_eval.py(2 hunks)tests/unit/evaluation/quality/test_alpha_precision.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: run-code-check
- GitHub Check: unit-tests
- GitHub Check: integration-tests
🔇 Additional comments (1)
tests/unit/evaluation/quality/test_alpha_precision.py (1)
57-62: Now let me examine the test file and implementation to understand the context better:Based on my investigation, I cannot run the test directly due to missing dependencies in the sandbox environment. However, I can verify that your concerns are valid based on testing best practices.
The pytest.approx default tolerance is 1e-6 relative tolerance (one part in a million), and the change from
1e-8to1e-2represents a 1,000,000× relaxation of precision requirements. Absolute and relative tolerances are important for CI across different systems and architectures, with larger tolerances typically needed for single precision vs. double precision arithmetic.The authenticity_naive change from 0.9905 to 0.7695 (a ~22% drop) is substantial and requires justification. Without access to the implementation changes that prompted this adjustment, I cannot definitively verify whether this tolerance increase is justified.
src/midst_toolkit/evaluation/quality/synthcity/statistical_eval.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, great catch on this bug. Really silly error on their part. Just a few suggestions.
| assert pytest.approx(0.05994074074074074, abs=1e-8) == quality_results["delta_precision_alpha_naive"] | ||
| assert pytest.approx(0.005229629629629584, abs=1e-8) == quality_results["delta_coverage_beta_naive"] | ||
| assert pytest.approx(0.9905185185185185, abs=1e-8) == quality_results["authenticity_naive"] | ||
| assert pytest.approx(0.9732668369518944, abs=1e-2) == quality_results["delta_precision_alpha_OC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are still very loose. I think we talked about just going back to the original values I had proposed and not worrying about the cluster variations, but let me know if I'm misremembering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes right. It's a mistake on my part.
| d_real_to_synthetic = closest_real_to_synth_distance[syn_idx] | ||
| d_real_to_real = real_to_real[closest_real_to_synthetic_idx_list[syn_idx]] | ||
| is_authentic = d_real_to_real <= d_real_to_synthetic | ||
| authen.append(is_authentic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I maybe be wrong, but I believe we can forgo the for loop here by just doing
is_authentic = closest_real_to_synth_distance < real_to_real[closest_real_to_synthetic_idx_list]
authenticity = np.mean(is_authentic.astype(int))
You may have to do a touch of reshaping on these tensors but multi-indexing on numpy arrays normally works I think.
|
|
||
| # See which one is bigger | ||
|
|
||
| authen = real_to_real[real_to_synth_args] < real_to_synth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error brought to you by someone who couldn't be bothered to actually fully understand what the nearest neighbor function does...yikes.
| else: | ||
| assert pytest.approx(0.9732668369518944, abs=1e-8) == quality_results["delta_precision_alpha_OC"] | ||
| assert pytest.approx(0.47238271604938276, abs=1e-8) == quality_results["delta_coverage_beta_OC"] | ||
| assert pytest.approx(0.5102592592592593, abs=1e-8) == quality_results["authenticity_OC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a naive question, but does your change affect the authenticity_OC metric as well? If so, then removing this makes sense. Just want to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. The process for calculating authenticity is shared between them, only that in OC the data is embedded using the one layer NN.
|
|
||
| # Check naive authenticity as the _OC metric depends on a 1-layer NN training | ||
| # which may give different results on different architectures | ||
| expected_authenticity = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a scan of the tests (correct me if I'm wrong), none of them have an expected authenticity greater than 0. I think we want to see at least one where we get something non-zero.
fix(metrics): correct authenticity calculation to align with formal definition
PR Type
Fix
Short Description
This PR corrects the authenticity metric calculation so it considers all the synthetic samples not only the ones that are the closest one to a real data point.