AI Research Engineer — Python • Systems • EvalOps/RLHF Tooling
I build RLHF evaluation systems, multi-agent orchestration workflows, and production-minded tooling for LLM post-training.
Friday: "I should learn RLHF infrastructure"
Sunday night:
- ✅ 6 data quality detectors running on 160K preference pairs
- ✅ PostgreSQL pipeline storing every signal
- ✅ Trained two reward models (clean vs unfiltered data)
- ✅ Orchestrated 6 LLM agents in parallel to build an RLHF data-quality pipeline
- ✅ Started a GPT implementation from scratch
I don't do tutorials. I ship.
| Project | What It Does | Stack |
|---|---|---|
| RLHF Data Quality System | Detects problematic preference pairs in RLHF training data. Found 12,693 flagged examples (7.9%) in Anthropic's HH-RLHF dataset. | PyTorch, PostgreSQL, sentence-transformers |
| GPT From Scratch | Transformer implementation from bigram → attention → CUDA kernels. Training on War and Peace, not TinyShakespeare. | PyTorch, CUDA |
| Multi-Agent Orchestration | First iteration: A multi-agent coding system using two Claude agents plus Gemini and Codex to build production code in parallel with shared contracts and coordination protocols. I designed the orchestration and approval loop. | Claude, Gemini, Codex API, Python |
| Project | Description | Link |
|---|---|---|
| NLWeb (Microsoft Open Source) | Identified an explicitly documented CI/CD gap in NLWeb and implemented the pipeline (Ruff linting, mypy checks, pytest matrix, Docker validation, Dependabot automation). | PR #397 |
3 years as an RLHF contractor. Selected into Alignerr’s elite “Alignerrd” group of top-tier programmers after standout evaluation performance. I’ve created rubrics, graded models, and seen exactly how preference data breaks.
Now I build systems to catch those problems automatically.
| Category | Tools |
|---|---|
| Languages | Python, TypeScript/JavaScript, Java, SQL |
| ML/LLM | PyTorch, Hugging Face, sentence-transformers |
| Backend | FastAPI, REST APIs, PostgreSQL, SQLite |
| Infra | Docker, GitHub Actions, Google Colab, Linux |
- Ship first, polish later
- Interfaces + invariants before implementation
- Tests that prove behavior
- Logging/metrics as first-class citizens
- If it takes more than a weekend, break it down
Research Engineer • Applied Evals • EvalOps • Data Quality Engineering • ML Systems

