Time to first leaderboard entry: ~30 minutes. By the end of this tutorial you will have two reference baselines, your own run, and enough intuition to know whether your result is any good.
You need to understand what you're predicting and who you're predicting it for. Skipping this is why most hackathon runs end up in the wrong direction.
The dataset is a real-world credit scoring dataset (HELOC — Home Equity Line of Credit). Each row is a loan application. The target column is RiskPerformance: Good means the applicant repaid on time, Bad means they defaulted.
Rejecthon splits the data into three groups with fundamentally different roles:
flowchart TB
ALL["All applicants"] --> OBS["✅ observed\n~60%\nFunded. We know if they defaulted."]
ALL --> REJ["❓ rejected\n~25%\nTurned away. Outcomes unknown."]
ALL --> EVAL["🔒 evaluation\n~15%\nHeld out. Used only to score models."]
| Population | What it is | Your model sees it? |
|---|---|---|
observed |
Applicants with known default outcomes | ✅ Used for training |
rejected |
Applicants with hidden outcomes | ✅ Used for training (with inferred labels) |
evaluation |
Held-out set | ❌ Only used to score your final model |
The evaluation population is sacred. Nobody's model ever trains on it. This keeps the comparison honest.
uv syncThen open a notebook or Python REPL:
from rejecthon import submit, leaderboard, show, previewcreated = submit("first-run")!!! note "What just happened?"
The first submit() call does several things automatically:
1. Downloads and prepares the HELOC dataset (takes ~1 minute the first time)
2. Trains and stores the `hgb_observed` baseline (observed applicants only)
3. Trains and stores the `hgb_estimated` baseline (reject inference with a default model)
4. Trains **your** run using reject inference with the same default model
5. Scores all runs on the evaluation set and stores them on the leaderboard
Subsequent calls are fast — the dataset is already prepared.
The return value is a small dictionary:
print(created)
# {
# "run_id": "abc-123-...",
# "roc_auc": 0.7843,
# "brier_score": 0.1621,
# "log_loss": 0.4892,
# "delta_vs_observed": +0.0124,
# "notes": None,
# }delta_vs_observed is the most important number. It tells you how much ROC-AUC you gained (positive) or lost (negative) compared to the honest observed-only baseline.
show()This prints a color-coded table. Green means you're beating the baseline. 🏆 marks your leading run.
If you prefer a DataFrame for further analysis:
lb = leaderboard()
print(lb)┌────────────────┬──────────┬────────┬──────────┬─────────────┬──────────┬──────┐
│ Name │ ROC-AUC │ Brier │ Log Loss │ vs Observed │ Baseline │ Notes│
├────────────────┼──────────┼────────┼──────────┼─────────────┼──────────┼──────┤
│ first-run 🏆 │ 0.7843 │ 0.1621 │ 0.4892 │ +0.0124 │ │ │
│ hgb_observed │ 0.7719 │ 0.1604 │ 0.4843 │ 0.0000 │ ✓ │ │
│ hgb_estimated │ 0.7801 │ 0.1617 │ 0.4871 │ +0.0082 │ ✓ │ │
└────────────────┴──────────┴────────┴──────────┴─────────────┴──────────┴──────┘
| Column | What to look for |
|---|---|
| vs Observed | Green = beating the control. Your primary goal. |
| ROC-AUC | Higher is better. The ranking metric. |
| Brier / Log Loss | Lower is better. Checks probability quality. |
| Baseline | ✓ marks the seeded reference runs — never retrained. |
You now have one run on the board. The useful question is not "is 0.7843 good?" — it's "is it better than the baseline, and by how much?"
from rejecthon import compare
result = compare("first-run", "hgb_observed")
print(result)
# {
# "name_a": "first-run",
# "name_b": "hgb_observed",
# "roc_auc": +0.0124, # positive = first-run is better
# "brier_score": +0.0017, # positive = first-run is slightly worse here (higher Brier)
# "log_loss": +0.0049, # positive = first-run is slightly worse here (higher log loss)
# "winner": "first-run",
# }!!! tip "Always compare, not just read"
A leaderboard rank can be misleading when runs are close. compare() shows you the exact per-metric deltas so you can see exactly what improved and what didn't.
Now try a second approach:
from sklearn.ensemble import ExtraTreesClassifier
from rejecthon import submit, compare
second = submit(
"extra-trees",
scoring_model=ExtraTreesClassifier(n_estimators=300, random_state=42),
notes="trying a different algorithm",
)
# How much did this actually improve?
result = compare("extra-trees", "first-run")
print(f"ROC-AUC delta vs first-run: {result['roc_auc']:+.4f}")This is the core loop of the hackathon:
- Form a hypothesis — "more trees might help" / "better calibration might help" / "the inference step is the bottleneck"
- Submit with
notes=— so you remember what you changed compare()against the previous run — did it actually improve?- Decide — keep pushing in this direction, or try something else
This happens to everyone, and it's useful information.
A run that scores below hgb_observed tells you: this approach is adding noise, not signal. The reject inference step is hurting more than helping.
When this happens:
- Check
compare()carefully. Did ROC-AUC drop but Brier improve? The model may be trading ranking quality for probability quality — sometimes intentional. - Try calibration. A classifier without proper probability calibration often hurts the inference step because it produces bad training targets.
- Simplify the inference model. An overpowered inference model can overfit on the observed population and produce overconfident probabilities for the rejected population.
- Reset and start fresh if you have many speculative runs cluttering the leaderboard:
from rejecthon import reset
reset() # removes all your runs, keeps hgb_observed and hgb_estimated!!! warning "reset() is permanent"
reset() deletes all non-baseline runs. It cannot be undone. Use delete_run("name") if you only want to remove one specific run.
from rejecthon import report
stored = report(created["run_id"])
print(stored.metrics.roc_auc(data_source="test"))The report() function returns a full skore EstimatorReport. You get calibration curves, ROC plots, and full metric breakdowns — useful for understanding why a model scored as it did, not just how.
While you're experimenting, run names are reusable after deletion:
from rejecthon import delete_run
delete_run("first-run") # free the name
submit("first-run", # resubmit with a better model
scoring_model=MyNewModel())Seeded baseline names (hgb_observed, hgb_estimated) cannot be deleted.
!!! tip "Track your hypothesis with notes"
Pass notes= to record what you were testing:
```python
submit(
"extra-trees-300",
scoring_model=ExtraTreesClassifier(n_estimators=300),
notes="testing more trees, no calibration",
)
```
The note appears in the leaderboard and helps you remember what changed between runs.
Not sure what's already been submitted? Use preview():
from rejecthon import preview
state = preview()
# {
# "data_ready": True,
# "baselines_present": ["hgb_observed", "hgb_estimated"],
# "run_count": 3,
# "top_run": {"name": "first-run", "roc_auc": 0.7843},
# }delta_vs_observed |
Interpretation |
|---|---|
| > +0.01 | Clear win — reject inference is helping |
| 0 to +0.01 | Marginal improvement — check Brier and log loss too |
| ~0.00 | No benefit — reject inference isn't helping here |
| Negative | Reject inference is hurting — the extra noise outweighs the information |
Also check that brier_score and log_loss don't get significantly worse when roc_auc improves. A model that ranks well but produces bad probabilities is a liability in production.
from rejecthon import submit, show
# One line to get on the board
created = submit("my-hypothesis")
print(f"ROC-AUC: {created['roc_auc']:.4f}")
print(f"vs baseline: {created['delta_vs_observed']:+.4f}")
# Color-coded leaderboard
show()- The Three Populations — explore the dataset before modeling
- Submit a Model — bring your own estimator
- Try a New Algorithm — swap models systematically
- Read the Leaderboard — know if your result is real