Getting Started

Time to first leaderboard entry: ~30 minutes. By the end of this tutorial you will have two reference baselines, your own run, and enough intuition to know whether your result is any good.

Before you write a single model

You need to understand what you're predicting and who you're predicting it for. Skipping this is why most hackathon runs end up in the wrong direction.

The dataset: HELOC credit applications

The dataset is a real-world credit scoring dataset (HELOC — Home Equity Line of Credit). Each row is a loan application. The target column is RiskPerformance: Good means the applicant repaid on time, Bad means they defaulted.

The three populations

Rejecthon splits the data into three groups with fundamentally different roles:

flowchart TB
    ALL["All applicants"] --> OBS["✅ observed\n~60%\nFunded. We know if they defaulted."]
    ALL --> REJ["❓ rejected\n~25%\nTurned away. Outcomes unknown."]
    ALL --> EVAL["🔒 evaluation\n~15%\nHeld out. Used only to score models."]

Population	What it is	Your model sees it?
`observed`	Applicants with known default outcomes	✅ Used for training
`rejected`	Applicants with hidden outcomes	✅ Used for training (with inferred labels)
`evaluation`	Held-out set	❌ Only used to score your final model

The evaluation population is sacred. Nobody's model ever trains on it. This keeps the comparison honest.

Step 1: Install and open a Python session

uv sync

Then open a notebook or Python REPL:

from rejecthon import submit, leaderboard, show, preview

Step 2: Submit your first run

created = submit("first-run")

!!! note "What just happened?" The first submit() call does several things automatically:

1. Downloads and prepares the HELOC dataset (takes ~1 minute the first time)
2. Trains and stores the `hgb_observed` baseline (observed applicants only)
3. Trains and stores the `hgb_estimated` baseline (reject inference with a default model)
4. Trains **your** run using reject inference with the same default model
5. Scores all runs on the evaluation set and stores them on the leaderboard

Subsequent calls are fast — the dataset is already prepared.

The return value is a small dictionary:

print(created)
# {
#   "run_id":           "abc-123-...",
#   "roc_auc":          0.7843,
#   "brier_score":      0.1621,
#   "log_loss":         0.4892,
#   "delta_vs_observed": +0.0124,
#   "notes":            None,
# }

delta_vs_observed is the most important number. It tells you how much ROC-AUC you gained (positive) or lost (negative) compared to the honest observed-only baseline.

Step 3: Read the leaderboard

show()

This prints a color-coded table. Green means you're beating the baseline. 🏆 marks your leading run.

If you prefer a DataFrame for further analysis:

lb = leaderboard()
print(lb)

Annotated leaderboard output

┌────────────────┬──────────┬────────┬──────────┬─────────────┬──────────┬──────┐
│ Name           │ ROC-AUC  │ Brier  │ Log Loss │ vs Observed │ Baseline │ Notes│
├────────────────┼──────────┼────────┼──────────┼─────────────┼──────────┼──────┤
│ first-run 🏆   │ 0.7843   │ 0.1621 │ 0.4892   │ +0.0124     │          │      │
│ hgb_observed   │ 0.7719   │ 0.1604 │ 0.4843   │ 0.0000      │    ✓     │      │
│ hgb_estimated  │ 0.7801   │ 0.1617 │ 0.4871   │ +0.0082     │    ✓     │      │
└────────────────┴──────────┴────────┴──────────┴─────────────┴──────────┴──────┘

Column	What to look for
vs Observed	Green = beating the control. Your primary goal.
ROC-AUC	Higher is better. The ranking metric.
Brier / Log Loss	Lower is better. Checks probability quality.
Baseline	✓ marks the seeded reference runs — never retrained.

Step 4: Try something different — and compare it

You now have one run on the board. The useful question is not "is 0.7843 good?" — it's "is it better than the baseline, and by how much?"

from rejecthon import compare

result = compare("first-run", "hgb_observed")
print(result)
# {
#   "name_a":    "first-run",
#   "name_b":    "hgb_observed",
#   "roc_auc":   +0.0124,      # positive = first-run is better
#   "brier_score": +0.0017,    # positive = first-run is slightly worse here (higher Brier)
#   "log_loss":  +0.0049,      # positive = first-run is slightly worse here (higher log loss)
#   "winner":    "first-run",
# }

!!! tip "Always compare, not just read" A leaderboard rank can be misleading when runs are close. compare() shows you the exact per-metric deltas so you can see exactly what improved and what didn't.

Now try a second approach:

from sklearn.ensemble import ExtraTreesClassifier
from rejecthon import submit, compare

second = submit(
    "extra-trees",
    scoring_model=ExtraTreesClassifier(n_estimators=300, random_state=42),
    notes="trying a different algorithm",
)

# How much did this actually improve?
result = compare("extra-trees", "first-run")
print(f"ROC-AUC delta vs first-run: {result['roc_auc']:+.4f}")

This is the core loop of the hackathon:

Form a hypothesis — "more trees might help" / "better calibration might help" / "the inference step is the bottleneck"
Submit with notes= — so you remember what you changed
compare() against the previous run — did it actually improve?
Decide — keep pushing in this direction, or try something else

What to do when a run is worse

This happens to everyone, and it's useful information.

A run that scores below hgb_observed tells you: this approach is adding noise, not signal. The reject inference step is hurting more than helping.

When this happens:

Check compare() carefully. Did ROC-AUC drop but Brier improve? The model may be trading ranking quality for probability quality — sometimes intentional.
Try calibration. A classifier without proper probability calibration often hurts the inference step because it produces bad training targets.
Simplify the inference model. An overpowered inference model can overfit on the observed population and produce overconfident probabilities for the rejected population.
Reset and start fresh if you have many speculative runs cluttering the leaderboard:

from rejecthon import reset

reset()  # removes all your runs, keeps hgb_observed and hgb_estimated

!!! warning "reset() is permanent" reset() deletes all non-baseline runs. It cannot be undone. Use delete_run("name") if you only want to remove one specific run.

Step 5: Inspect why a run scored the way it did

from rejecthon import report

stored = report(created["run_id"])
print(stored.metrics.roc_auc(data_source="test"))

The report() function returns a full skore EstimatorReport. You get calibration curves, ROC plots, and full metric breakdowns — useful for understanding why a model scored as it did, not just how.

Step 6: Delete, refine, resubmit

While you're experimenting, run names are reusable after deletion:

from rejecthon import delete_run

delete_run("first-run")          # free the name
submit("first-run",              # resubmit with a better model
       scoring_model=MyNewModel())

Seeded baseline names (hgb_observed, hgb_estimated) cannot be deleted.

!!! tip "Track your hypothesis with notes" Pass notes= to record what you were testing:

```python
submit(
    "extra-trees-300",
    scoring_model=ExtraTreesClassifier(n_estimators=300),
    notes="testing more trees, no calibration",
)
```
The note appears in the leaderboard and helps you remember what changed between runs.

Step 7: Check workspace state at any time

Not sure what's already been submitted? Use preview():

from rejecthon import preview

state = preview()
# {
#   "data_ready":       True,
#   "baselines_present": ["hgb_observed", "hgb_estimated"],
#   "run_count":        3,
#   "top_run":          {"name": "first-run", "roc_auc": 0.7843},
# }

What a good result looks like

`delta_vs_observed`	Interpretation
> +0.01	Clear win — reject inference is helping
0 to +0.01	Marginal improvement — check Brier and log loss too
~0.00	No benefit — reject inference isn't helping here
Negative	Reject inference is hurting — the extra noise outweighs the information

Also check that brier_score and log_loss don't get significantly worse when roc_auc improves. A model that ranks well but produces bad probabilities is a liability in production.

The fast path (summary)

from rejecthon import submit, show

# One line to get on the board
created = submit("my-hypothesis")
print(f"ROC-AUC: {created['roc_auc']:.4f}")
print(f"vs baseline: {created['delta_vs_observed']:+.4f}")

# Color-coded leaderboard
show()

Next steps

The Three Populations — explore the dataset before modeling
Submit a Model — bring your own estimator
Try a New Algorithm — swap models systematically
Read the Leaderboard — know if your result is real

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Before you write a single model

The dataset: HELOC credit applications

The three populations

Step 1: Install and open a Python session

Step 2: Submit your first run

Step 3: Read the leaderboard

Annotated leaderboard output

Step 4: Try something different — and compare it

What to do when a run is worse

Step 5: Inspect why a run scored the way it did

Step 6: Delete, refine, resubmit

Step 7: Check workspace state at any time

What a good result looks like

The fast path (summary)

Next steps

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Before you write a single model

The dataset: HELOC credit applications

The three populations

Step 1: Install and open a Python session

Step 2: Submit your first run

Step 3: Read the leaderboard

Annotated leaderboard output

Step 4: Try something different — and compare it

What to do when a run is worse

Step 5: Inspect why a run scored the way it did

Step 6: Delete, refine, resubmit

Step 7: Check workspace state at any time

What a good result looks like

The fast path (summary)

Next steps