Skip to content

Latest commit

 

History

History
286 lines (201 loc) · 9.85 KB

File metadata and controls

286 lines (201 loc) · 9.85 KB

Getting Started

Time to first leaderboard entry: ~30 minutes. By the end of this tutorial you will have two reference baselines, your own run, and enough intuition to know whether your result is any good.


Before you write a single model

You need to understand what you're predicting and who you're predicting it for. Skipping this is why most hackathon runs end up in the wrong direction.

The dataset: HELOC credit applications

The dataset is a real-world credit scoring dataset (HELOC — Home Equity Line of Credit). Each row is a loan application. The target column is RiskPerformance: Good means the applicant repaid on time, Bad means they defaulted.

The three populations

Rejecthon splits the data into three groups with fundamentally different roles:

flowchart TB
    ALL["All applicants"] --> OBS["✅ observed\n~60%\nFunded. We know if they defaulted."]
    ALL --> REJ["❓ rejected\n~25%\nTurned away. Outcomes unknown."]
    ALL --> EVAL["🔒 evaluation\n~15%\nHeld out. Used only to score models."]
Loading
Population What it is Your model sees it?
observed Applicants with known default outcomes ✅ Used for training
rejected Applicants with hidden outcomes ✅ Used for training (with inferred labels)
evaluation Held-out set ❌ Only used to score your final model

The evaluation population is sacred. Nobody's model ever trains on it. This keeps the comparison honest.


Step 1: Install and open a Python session

uv sync

Then open a notebook or Python REPL:

from rejecthon import submit, leaderboard, show, preview

Step 2: Submit your first run

created = submit("first-run")

!!! note "What just happened?" The first submit() call does several things automatically:

1. Downloads and prepares the HELOC dataset (takes ~1 minute the first time)
2. Trains and stores the `hgb_observed` baseline (observed applicants only)
3. Trains and stores the `hgb_estimated` baseline (reject inference with a default model)
4. Trains **your** run using reject inference with the same default model
5. Scores all runs on the evaluation set and stores them on the leaderboard

Subsequent calls are fast — the dataset is already prepared.

The return value is a small dictionary:

print(created)
# {
#   "run_id":           "abc-123-...",
#   "roc_auc":          0.7843,
#   "brier_score":      0.1621,
#   "log_loss":         0.4892,
#   "delta_vs_observed": +0.0124,
#   "notes":            None,
# }

delta_vs_observed is the most important number. It tells you how much ROC-AUC you gained (positive) or lost (negative) compared to the honest observed-only baseline.


Step 3: Read the leaderboard

show()

This prints a color-coded table. Green means you're beating the baseline. 🏆 marks your leading run.

If you prefer a DataFrame for further analysis:

lb = leaderboard()
print(lb)

Annotated leaderboard output

┌────────────────┬──────────┬────────┬──────────┬─────────────┬──────────┬──────┐
│ Name           │ ROC-AUC  │ Brier  │ Log Loss │ vs Observed │ Baseline │ Notes│
├────────────────┼──────────┼────────┼──────────┼─────────────┼──────────┼──────┤
│ first-run 🏆   │ 0.7843   │ 0.1621 │ 0.4892   │ +0.0124     │          │      │
│ hgb_observed   │ 0.7719   │ 0.1604 │ 0.4843   │ 0.0000      │    ✓     │      │
│ hgb_estimated  │ 0.7801   │ 0.1617 │ 0.4871   │ +0.0082     │    ✓     │      │
└────────────────┴──────────┴────────┴──────────┴─────────────┴──────────┴──────┘
Column What to look for
vs Observed Green = beating the control. Your primary goal.
ROC-AUC Higher is better. The ranking metric.
Brier / Log Loss Lower is better. Checks probability quality.
Baseline ✓ marks the seeded reference runs — never retrained.

Step 4: Try something different — and compare it

You now have one run on the board. The useful question is not "is 0.7843 good?" — it's "is it better than the baseline, and by how much?"

from rejecthon import compare

result = compare("first-run", "hgb_observed")
print(result)
# {
#   "name_a":    "first-run",
#   "name_b":    "hgb_observed",
#   "roc_auc":   +0.0124,      # positive = first-run is better
#   "brier_score": +0.0017,    # positive = first-run is slightly worse here (higher Brier)
#   "log_loss":  +0.0049,      # positive = first-run is slightly worse here (higher log loss)
#   "winner":    "first-run",
# }

!!! tip "Always compare, not just read" A leaderboard rank can be misleading when runs are close. compare() shows you the exact per-metric deltas so you can see exactly what improved and what didn't.

Now try a second approach:

from sklearn.ensemble import ExtraTreesClassifier
from rejecthon import submit, compare

second = submit(
    "extra-trees",
    scoring_model=ExtraTreesClassifier(n_estimators=300, random_state=42),
    notes="trying a different algorithm",
)

# How much did this actually improve?
result = compare("extra-trees", "first-run")
print(f"ROC-AUC delta vs first-run: {result['roc_auc']:+.4f}")

This is the core loop of the hackathon:

  1. Form a hypothesis — "more trees might help" / "better calibration might help" / "the inference step is the bottleneck"
  2. Submit with notes= — so you remember what you changed
  3. compare() against the previous run — did it actually improve?
  4. Decide — keep pushing in this direction, or try something else

What to do when a run is worse

This happens to everyone, and it's useful information.

A run that scores below hgb_observed tells you: this approach is adding noise, not signal. The reject inference step is hurting more than helping.

When this happens:

  • Check compare() carefully. Did ROC-AUC drop but Brier improve? The model may be trading ranking quality for probability quality — sometimes intentional.
  • Try calibration. A classifier without proper probability calibration often hurts the inference step because it produces bad training targets.
  • Simplify the inference model. An overpowered inference model can overfit on the observed population and produce overconfident probabilities for the rejected population.
  • Reset and start fresh if you have many speculative runs cluttering the leaderboard:
from rejecthon import reset

reset()  # removes all your runs, keeps hgb_observed and hgb_estimated

!!! warning "reset() is permanent" reset() deletes all non-baseline runs. It cannot be undone. Use delete_run("name") if you only want to remove one specific run.


Step 5: Inspect why a run scored the way it did

from rejecthon import report

stored = report(created["run_id"])
print(stored.metrics.roc_auc(data_source="test"))

The report() function returns a full skore EstimatorReport. You get calibration curves, ROC plots, and full metric breakdowns — useful for understanding why a model scored as it did, not just how.


Step 6: Delete, refine, resubmit

While you're experimenting, run names are reusable after deletion:

from rejecthon import delete_run

delete_run("first-run")          # free the name
submit("first-run",              # resubmit with a better model
       scoring_model=MyNewModel())

Seeded baseline names (hgb_observed, hgb_estimated) cannot be deleted.

!!! tip "Track your hypothesis with notes" Pass notes= to record what you were testing:

```python
submit(
    "extra-trees-300",
    scoring_model=ExtraTreesClassifier(n_estimators=300),
    notes="testing more trees, no calibration",
)
```
The note appears in the leaderboard and helps you remember what changed between runs.

Step 7: Check workspace state at any time

Not sure what's already been submitted? Use preview():

from rejecthon import preview

state = preview()
# {
#   "data_ready":       True,
#   "baselines_present": ["hgb_observed", "hgb_estimated"],
#   "run_count":        3,
#   "top_run":          {"name": "first-run", "roc_auc": 0.7843},
# }

What a good result looks like

delta_vs_observed Interpretation
> +0.01 Clear win — reject inference is helping
0 to +0.01 Marginal improvement — check Brier and log loss too
~0.00 No benefit — reject inference isn't helping here
Negative Reject inference is hurting — the extra noise outweighs the information

Also check that brier_score and log_loss don't get significantly worse when roc_auc improves. A model that ranks well but produces bad probabilities is a liability in production.


The fast path (summary)

from rejecthon import submit, show

# One line to get on the board
created = submit("my-hypothesis")
print(f"ROC-AUC: {created['roc_auc']:.4f}")
print(f"vs baseline: {created['delta_vs_observed']:+.4f}")

# Color-coded leaderboard
show()

Next steps