NullRepair is an LLM-based tool that automatically repairs nullability errors reported by NullAway based on safe usage regions. It is built on top of NullAwayAnnotator.
The preprint LLM-Based Repair of Static Nullability Errors describes the tool and the conducted experiments in detail.
Please cite the paper if you use NullRepair in your research:
@misc{karimipour2026llmbasedrepairstaticnullability,
title={LLM-Based Repair of Static Nullability Errors},
author={Nima Karimipour and Pascal Joos and Michael Pradel and Martin Kellogg and Manu Sridharan},
year={2026},
eprint={2507.20674},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2507.20674},
}The VS Code Dev Container extension with VS Code.
Tested with Docker 29.4.1
OpenAI API key (for running NullRepair and the baselines, not needed for reproducing the experiment results from the log files).
Don't open the VS Code Dev Container before completing the steps 1 to 3.
-
Clone this repository using ssh.
-
Run:
git submodule update --init --recursiveto initialize the mini-swe-agent submodule. -
Run:
bash checkout_benchmarks.sh.
This clones the target projects for the experiment intobenchmarks.
Refer to these folders for any commits created by NullRepair. -
Reopen the project in a devcontainer using the VSCode Dev Container extension. All needed dependencies and setup steps are then executed automatically. Wait until the postcreatecommand finishes executing and the terminal is ready to use.
-
Activate the Python environment by running (if not yet active):
source .venv/bin/activate -
Configure the OpenAI API key by running the script
set_openai_key.pyand pasting the key when prompted. This will write the API key to the mini-SWE-agent configuration file and add it to a .env file.
This is needed to run NullRepair and the baselines, which use the OpenAI API.
For a lightweight reproduction of the experiment results from the log files, the API key is not needed.
python3 set_openai_key.py
Run the smoke test to verify that all components are correctly installed and functional.
No OpenAI API key is required.
The test checks Java, the NullRepair JAR, the benchmark projects, the Python packages, and runs the full NullRepair pipeline in disabled mode (static analysis only, no LLM call). It takes roughly 2-3 minutes.
python3 smoke_test.pyExpected output:
============================================================
1/3 Prerequisites
============================================================
[PASS] Java is available
(openjdk version "21.0.10" 2026-01-20)
[PASS] NullRepair JAR is built
[PASS] Benchmark projects are checked out
============================================================
2/3 Python evaluation packages
============================================================
[PASS] pandas, numpy, scikit-learn are importable
============================================================
3/3 NullRepair end-to-end pipeline (no LLM)
============================================================
Running NullRepair on one error of 'eureka' in disabled mode.
Exercises: NullAway static analysis, build, annotation injection,
and git integration — no API call is made.
$ java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode disabled --selectedErrorIds 2 --depth 1
ANNOTATOR VERSION: 3, BUILD: 6
Received arguments: eureka, --mode, disabled, --selectedErrorIds, 2, --depth, 1
Running eureka benchmark in disabled mode.
Resolve remaining errors mode: DISABLED
Selected error IDs: [2]
Configuring logging for benchmark: eureka, branch: joos/disabled-3
Root path for logs: /home/vscode/NullRepair/evaluation_data/logs/eureka/disabled-3
Running on branch name: joos/disabled-3
Starting annotator...
Preprocessing...
Annotating...true
Max Depth level: 1
Analyzing at level 1, Scheduling for: 5 builds for: 14 fixes
Processing 20% [============> ] 1/5 (0:00:00 / 0:00:00)
...
Processing 100% [===================================================] 5/5 (0:00:11 / 0:00:00)
Finished annotating.
Commiting changes to branch joos/disabled-3...
[PASS] Pipeline runs end-to-end
============================================================
SMOKE TEST PASSED — NullRepair is correctly installed.
============================================================
This requires an OpenAI API key to be set up as described in 1.3., as NullRepair queries the OpenAI API to generate fixes.
A small example run where NullRepair is run on three nullability errors of project eureka can be executed with the following command:
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode advanced --selectedErrorIds 2,4,5
Expected output (truncated):
ANNOTATOR VERSION: 3, BUILD: 6
Received arguments: eureka, --mode, advanced, --selectedErrorIds, 2,4,5
Running eureka benchmark in advanced mode.
Resolve remaining errors mode: ADVANCED
Selected error IDs: [2, 4, 5]
Configuring logging for benchmark: eureka, branch: joos/advanced-3
Root path for logs: /home/vscode/NullRepair/evaluation_data/logs/eureka/advanced-3
Running on branch name: joos/advanced-3
Starting annotator...
Preprocessing...
Annotating...false
Loading cache...
Loaded 0 entries from cache.
Max Depth level: 1
Analyzing at level 1, Scheduling for: 5 builds for: 14 fixes
Processing 100% [===============================================================================================================================================] 5/5 (0:00:18 / 0:00:00)
2 : TOP LEVEL CALL TO FIX ERROR: Type='METHOD_NO_INIT', message='initializer method does not guarantee @NonNull field serverConfig (line 106) is initialized along all control-flow paths (remember to check for exceptions or early returns).'
/home/vscode/benchmarks/eureka/eureka-core/src/main/java/com/netflix/eureka/RateLimitingFilter.java:114
public RateLimitingFilter() {}
Resetting NullAwayCodeFix state.
Sending request to OpenAI...
Response received from OpenAI.
Token usage - Uncached Prompt: 406, Cached Prompt: 0, Completion: 497, Total: 903
Cached response
Finished processing.
Time taken to fix error: 3713 ms
>>> PROGRESS [====== ] 1 / 3 (33%) <<<
Writing log to file...
Logging ChatGPT token usage...
Calculating run metrics...
Running tests...
Trying to commit changes...
Commiting changes...
4 : TOP LEVEL CALL TO FIX ERROR: Type='DEREFERENCE_NULLABLE', message='dereferenced expression resourceRecordSetWithHostedZone is @Nullable'
/home/vscode/benchmarks/eureka/eureka-core/src/main/java/com/netflix/eureka/aws/Route53Binder.java:277
resourceRecordSetWithHostedZone
...
Logs are then located at evaluation_data/logs/eureka/advanced-3 and the changes made by NullRepair are committed to the branch joos/advanced-3 in the target project repository at benchmarks/eureka.
Correspondence of approach names in the paper and the repository:
| Paper name | Repository name |
|---|---|
| NullRepair | advanced |
| SinglePrompt baseline | basic |
| mini-SWE-agent baseline | agent_baseline |
See evaluation_data/logs for the logs of executed runs and benchmarks for the target projects.
The log files are organized by project and by experiment configuration (used approach and per-patch/combined mode). Each run creates a new log folder.
For example, for the run on eureka of NullRepair in per-patch mode, refer to evaluation_data/logs/eureka/advanced-evaluation-run-gpt5.1 for the logs of the run.
The logs are structured as follows:
app.logcontains the complete execution log of the run.log-<errorID>.logcontains the execution log for the specific errorID.test-log-<errorID>.logcontains the log of the test execution after fixing the specific errorID.metrics.tsvcontains the metrics on fix success for each error of the run.token_usage.tsvcontains token usage information for each error of the run.commits.tsvcontains the commit information for each error of the run, where a fix was created.timers.tsvlogs the end-to-end time taken for the run.
For the created fixes, you can check the commit history of the respective run's branch (joos/<log-folder-name>) in the target project repository (e.g., for NullRepair per-patch on eureka it is the branch joos/advanced-evaluation-run-gpt5.1 of benchmarks/eureka).
Each fix made by NullRepair is committed separately with a commit message that includes the error ID and the error message and is then reverted in a subsequent commit (for the per-patch mode).
Aggregated stats on the runs, plots, and manual inspection results can be found in evaluation_data/evaluation_results, organized as follows:
per_patch/Corresponds to: RQ1-C1 (Table 2), RQ2 (Table 4)
per-patch level results of the different approaches. Shows success rates, failing tests, token usage, and timing per benchmark project and as total.combined/Corresponds to: RQ1-C1 (Table 2), RQ1-C2 (Table 3)
combined level results of the different approaches. Shows success rates, failing tests, token usage, and timing per benchmark project and as total.manual_inspection/Corresponds to: RQ1-C3
the 75-sample manual inspection dataset, per-reviewer initial scores, the consolidated scoring file (manual_inspection_scoring_with_classification.tsv), and derived statistics. The scoring file manual_inspection_scoring_with_classification.tsv is most relevant for reproducing and validating the manual inspection results. It contains for all 75 samples the full reasoning of the reviewers' initial scoring, subsequent consolidation discussions, and the final consolidated scores. Import it into Google Slides, Excel, or similar for better readability.venn_diagrams/Not included in the accepted paper version
Venn diagrams showing overlap in resolved errors, resolved errors with no failing tests, and in manual inspection scores, across approaches.stats_excluding_preliminary_study_projects/Corresponds to: Threats to Validity
results with the three preliminary-study projects (conductor, litiengine, retrofit) excluded.
The TSV files contain all raw numbers used in the paper. The tables below map each paper column to the corresponding stat-file field, so the numbers can be directly verified.
Table 2 — per-patch mode (RQ1-C1) (files: evaluation_results/per_patch/evaluation_stats_<approach>_per_patch_evaluation_gpt5.1.tsv (separated by approach)):
| Paper column | Stat-file field | Notes |
|---|---|---|
| Errors | total_target_errors |
|
| G (generated, no compile errors) | generated_patches_no_compilation_errors |
= generated_patches − error_introducing_patches |
| R (resolved, no new errors) | resolving_patches_and_no_new_errors |
Not resolving_patches_incl_new_errors, which also counts patches that resolved the target error but introduced a new one |
| TE (triggered new nullability error) | trigger_new_error_patches |
Table 2 — combined mode (RQ1-C1) (files: evaluation_results/combined/evaluation_stats_<approach>_combined__evaluation_gpt5.1.tsv (separated by approach)):
| Paper column | Stat-file field | Notes |
|---|---|---|
| Errors | total_target_errors |
|
| R (resolved when selectively applied) | resolved_target_errors |
= total_target_errors − remaining_errors |
Table 3 — test failures in combined mode (RQ1-C2) (same combined stat files):
| Paper column | Stat-file field | Notes |
|---|---|---|
| Number of Failing Unit Tests | total_test_failures |
Manual inspection results — (RQ1-C3):
Per-approach (tool) score counts and averages come from evaluation_results/manual_inspection/scoring_stats/manual_inspection_statistics.tsv (approach names are prefixed with hash_):
| Paper text | Stat-file field | Value for NullRepair (hash_advanced) |
|---|---|---|
| Likely acceptable (score 1) | Count_Score_1 |
29 (39%) |
| Needs work (score 2) | Count_Score_2 |
16 (21%) |
| Likely unacceptable (score 3) | Count_Score_3 |
30 (40%) |
| Average score | Average_Score |
2.01 |
| Overall wins (outperforms both baselines) | Overall_Wins |
28 |
| Overall ties (tied for best with ≥ 1 baseline) | Overall_Ties |
26 |
| Overall losses (outperformed by any baseline) | Overall_Losses |
21 |
The Correctness of Classification metrics from the paper are based on the column Correctness of classification by NullRepair in evaluation_data/evaluation_results/manual_inspection/manual_inspection_scoring_with_classification.tsv.
Table 4 — efficiency metrics (RQ2) (files: evaluation_results/per_patch/evaluation_stats_<approach>_per_patch_evaluation_gpt5.1.tsv (separated by approach)):
The paper reports per-project and per-error averages; the stat files store totals and per-error averages.
Refer to the last row of the per-patch stat files for the totals across all projects, and the avg_ fields for the per-error averages.
| Paper metric | Stat-file field | Conversion |
|---|---|---|
| Time (min) (per error) | avg_execution_time_sec |
÷ 60 to get minutes |
| Number of Prompts (per error) | avg_agent_cycles |
Direct |
| Total Tokens (k) (per error) | avg_tokens |
÷ 1000 |
| Cost (USD) (per error) | avg_monetary_cost |
Direct |
Pre-computed results are already present in evaluation_data/evaluation_results/. To recompute them from the log files, run the single wrapper script from the repository root:
python3 reproduce_results.pyReproduced output files are written with a _reproduced suffix, so they sit alongside the originals without overwriting them.
Refer to the mapping of paper table metrics to stat-file fields in section 2.3. to verify the numbers in the paper against the reproduced stat files.
This script runs the following steps in order:
-
Evaluation statistics (
evaluation_scripts/calculate_evaluation_stats.py):Corresponds to: RQ1-C1 (Table 2), RQ1-C2 (Table 3), RQ2 (Table 4)
aggregates per-error metrics (total generated patches, total resolved errors, failing tests, token usage, cost) for all six experiment configurations (NullRepair / SinglePrompt / mini-SWE-agent × per-patch / combined). Outputs six TSV files toevaluation_data/evaluation_results/per_patch/andevaluation_data/evaluation_results/combined/with names such asevaluation_stats_advanced_per_patch_reproduced.tsv. -
Patch file-count statistics (
evaluation_scripts/patch_file_count_stats.py):analyses how many Java files each generated patch touches, broken down by approach and outcome. Prints a summary table and writes
evaluation_data/evaluation_results/per_patch/patch_file_count_stats_reproduced.csv. -
Manual inspection score analysis (
evaluation_scripts/manual_inspection/analyze_manual_inspection_scores.py):Corresponds to: RQ1-C3
reads the consolidated 75-sample manual inspection file and computes per-tool score distributions, win/loss/tie counts, and pairwise matchup tables. Outputsevaluation_data/evaluation_results/manual_inspection/scoring_stats/manual_inspection_statistics_reproduced.tsvand a_reproduced_pairwise.tsvcompanion. -
Inter-rater agreement (
evaluation_scripts/manual_inspection/calculate_inter_rater_agreement.py):Additional analysis for RQ1-C3
computes Cohen's Kappa across the three reviewer pairs over all scored patches and writes the full report toevaluation_data/evaluation_results/manual_inspection/scoring_stats/agreement_analysis_reproduced.txt.
Two additional kinds of figures can be created using Jupyter:
-
Venn diagrams:
open and run
evaluation_scripts/create_venn_diagrams.ipynb. -
Manual inspection score plot:
Corresponds to: RQ1-C3 (Figure 7)
open and runevaluation_scripts/manual_inspection/manual_inspection_plot.ipynb.
Follow the installation steps in 1.3. and then run one of the following commands to run a large-scale experiment on a target project.
Run either NullRepair (advanced), the SinglePrompt baseline (basic), or the mini-SWE-agent baseline (agent_baseline).
Per default the project is reset for each error (patch-level analysis). Set --combined to stack successful error patches (aggregate-level analysis).
The following commands run the experiment on project eureka.
Run NullRepair on project eureka:
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode advanced
Run SinglePrompt baseline on project eureka:
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode basic
Run mini-SWE-agent baseline on project eureka:
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode agent_baseline
List of all projects: conductor, eureka, glide, gson, jadx, libgdx, litiengine, mockito, retrofit, spring-boot, wala-util, zuul
Experiments on different projects can be run in parallel (with sufficient memory). However, multiple experiments on the same project cannot be run simultaneously.
The logs of each run are stored in a new folder in evaluation_data/logs with the name of the project and experiment mode.
To calculate the aggregated evaluation results from these log files run the individual scripts in evaluation_scripts described in section 3., with adapted input and output paths.
If you want to run all experiments on all projects with all three modes and both patch-level and aggregate-level analysis, you can run the following script:
python3 run_nullrepair_and_baselines.pyThis is long-running and expensive (~4 days and 175 USD). We recommend running the experiments in smaller batches, which also allows for parallelization.
You can run NullRepair on new Java projects.
The following instructions assume that the target project uses Gradle.
The setup of a new project is illustrated with the example project https://github.com/cbeust/jcommander/tree/3-lts with Java 17.
-
Create a fork of the project if you do not have write access.
We have created a fork for the example project here: https://github.com/Pascal-Joos/jcommander. -
Clone the target project to the benchmarks directory and checkout the branch you want to run on.
cd benchmarks git clone git@github.com:Pascal-Joos/jcommander.git cd jcommander git checkout 3-lts
-
Create a new branch from this branch named
nimak/auto-code-fix.git checkout -b nimak/auto-code-fix
-
Update the
build.gradleorbuild.gradle.ktsfile to include the NullAway dependency and annotation processing. Commit these changes.
See the following commit for an example on how to do this:
https://github.com/Pascal-Joos/jcommander/commit/f608a5ae8a069d05f588a4d5b1b0c130b7594bbd
This includes adding a file prepare.sh. -
Run the
prepare.shscript to prepare the project for NullRepair. -
Run the gradlew spotlessApply command and commit the changes.
./gradlew spotlessApply
-
Add the project to the list of target projects with adequate configuration in annotator-core/src/main/java/edu/ucr/cs/riple/core/Main.java.
For our example, after line 139 add the following:benchmarks.put("jcommander", new Benchmark("com.beust.jcommander", "jcommander", "compileJava", "test"));
-
Rebuild the annotator-core module to include the new project in the configuration. Run this command from the root of the repository:
./gradlew spotlessApply ./gradlew build -x test -
First run NullAwayAnnotator on the project to add nullability annotations to the code, without running NullRepair. Then, commit these changes to the
nimak/auto-code-fixbranch. This way, the changes made by NullRepair are more clear and the project is in a clean state before running NullRepair.java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar jcommander --mode disabled
-
Finally, run NullRepair on the project:
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar jcommander --mode advanced
If you want changes that NullRepair applies to be pushed to the remote repository directly, add the
--pushCommitsflag.
Key parameters are set in source files and require rebuilding after a change (step 8 of section 5).
LLM model — edit modelName in Config.java:
If the model pricing is not listed in the MODEL_PRICING map at ChatGPT.java yet, add the pricing information to the map. Any OpenAI-compatible model name can be used.
Per-error cost budget — edit COST_LIMIT in ChatGPT.java:
NullRepair aborts LLM calls for an error once this limit is reached.
To modify the cost limit and cycle limit for the mini-SWE-agent baseline, edit agentCostLimit and agentCycleLimit in Config.java:
Analysis depth — controls how many levels of the call graph are explored when building context. Pass --depth <n> on the command line (default: 6):
java -jar annotator-core/build/libs/annotator-core-1.3.16-SNAPSHOT.jar eureka --mode advanced --depth 3NullRepair extends NullAwayAnnotator. The main entry point is annotator-core/src/main/java/edu/ucr/cs/riple/core/Main.java. The three repair modes are implemented in annotator-core/src/main/java/edu/ucr/cs/riple/core/checkers/nullaway/codefix/:
| Class | Mode |
|---|---|
AdvancedNullAwayCodeFix |
advanced (NullRepair) |
BasicNullAwayCodeFix |
basic (SinglePrompt baseline) |
AgentBaselineNullAwayCodeFix |
agent_baseline (mini-SWE-agent baseline) |
LLM communication is handled by ChatGPT.java in the same package. Configuration is managed by Config.java.