Refactor synthetic_data to use variable_elimination with evidence#93
Refactor synthetic_data to use variable_elimination with evidence#93ryan112358 wants to merge 3 commits intomasterfrom
Conversation
This change refactors `MarkovRandomField.synthetic_data` to use `variable_elimination` with the `evidence` parameter for generating conditional distributions. It implements a hybrid approach to avoid OOM errors on large potentials: 1. First, `self.project` is used to obtain the joint marginal of the target column and its parents. This leverages cached marginals and efficient elimination. 2. Then, `variable_elimination` is called on this projected factor with the vector evidence. This correctly handles slicing and normalization per row. This simplifies the code by removing manual conditional probability calculations and ensures consistency with `variable_elimination` logic. It also preserves the stratified sampling logic for `method='round'`. Co-authored-by: ryan112358 <8495634+ryan112358@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
This commit refactors `MarkovRandomField.synthetic_data` to use `variable_elimination` directly with the `evidence` parameter, replacing the previous manual conditional probability logic. This is cleaner and leverages existing marginal oracles. To support this efficiently with vector evidence (large N), `variable_elimination` in `src/mbi/marginal_oracles.py` has been improved to use a lazy slicing strategy. Instead of immediately slicing all potentials (which can explode memory usage to N * prod(domain)), it treats evidence variables as "retained" and delays slicing until it is beneficial (when the factor size reduction outweighs the cost of introducing the N dimension) or unavoidable. It also enforces consistency to ensure sliced and unsliced versions of the same variable are not mixed. Additionally, `test/test_markov_random_field.py` was updated to correctly pass log-space potentials to `MarkovRandomField`, as `variable_elimination` expects log-potentials. This fixes a latent bug exposed by the refactor. Co-authored-by: ryan112358 <8495634+ryan112358@users.noreply.github.com>
This commit fixes CI failures in `pytype` for `src/mbi/marginal_oracles.py`.
1. Added missing `import numpy as np`.
2. Updated `variable_elimination` type hint for `evidence` to allow `jax.Array` and `np.ndarray`.
3. Changed `evidence = evidence or {}` to explicit `if evidence is None: evidence = {}` to help pytype's inference engine understand that `evidence` is never None when accessed later.
Co-authored-by: ryan112358 <8495634+ryan112358@users.noreply.github.com>
Refactor synthetic_data to use variable_elimination with evidence.
The
synthetic_datamethod was manually computing conditional probabilities by projecting onto parent cliques and performing division and cumsum. This refactor replaces that logic withmarginal_oracles.variable_elimination, which now supports anevidenceparameter (including vector evidence).To prevent Out-Of-Memory errors when slicing large potentials with vector evidence (which expands the domain by N), the implementation first projects the model onto the relevant clique (target + parents) using
self.project. This reduces the factor size before applying the evidence viavariable_elimination.The sampling logic for
method='sample'andmethod='round'is preserved, ensuring that the statistical properties of the synthetic data (especially conditional correlations) are maintained.PR created automatically by Jules for task 12391338670753120305 started by @ryan112358