Tuning for Alignment, Constitutional Training, & Instruction Calibration.
TACTIC is a training library for three models in the CeSIA safety stack:
- a constitutional classifier — a LoRA-adapted causal LM + linear head that classifies a prompt into a 12-class harm taxonomy,
- a jailbreak classifier — the same backbone with a single sigmoid head that flags whether a prompt is a jailbreak attempt (binary), with per-technique loss logging, and
- a paraphraser — a LoRA fine-tuned causal LM that rewrites prompts while preserving intent ("reverse paraphrasing" / redaction), consumed by REDACT.
TACTIC is training-only. Inference (the Probe → Decode → Assess routine, vLLM/API
deployment) lives in a separate downstream library that consumes the weights and the
weight_frame.json calibration manifest produced here.
uv sync --extra dev # or: pip install -e ".[dev]"Both models share the same three explicit stages, driven by one Trainer:
from tactic import ClassifierTrainer, ClassifierTrainConfig, build_dataset
# 1. LOAD DATA — local file or any HuggingFace Hub dataset, same call
data = build_dataset("prompts.csv", text_column="prompt", label_column="category")
data = build_dataset("walledai/HarmBench", text_column="prompt",
label_column="category", split="train") # drop-in HF
# 2 & 3. LOAD MODEL & TRAIN
trainer = ClassifierTrainer(ClassifierTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.load_model()
trainer.fit(data)
trainer.save(push_to_hub=False)The jailbreak classifier mirrors this with build_jailbreak_dataset — the binary label is
derived from the dataset's shape (a populated jailbreak column means an augmented
attempt, label 1; a base prompt row is label 0), so no label column is needed:
from tactic import JailbreakTrainer, JailbreakTrainConfig
from tactic.data import build_jailbreak_dataset
data = build_jailbreak_dataset("centrepourlasecuriteia/constitution-input-augmented-dataset")
trainer = JailbreakTrainer(JailbreakTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(data)The paraphraser mirrors this too:
from tactic import ParaphraserTrainer, ParaphraserTrainConfig, build_dataset
pairs = build_dataset("pairs.jsonl", input_column="prompt", target_column="paraphrase")
trainer = ParaphraserTrainer(ParaphraserTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(pairs)End-to-end runnable examples are in notebooks/.
build_dataset(source, ...) auto-detects the source: a local .csv/.jsonl/.json
path is read from disk; anything else is treated as a HuggingFace Hub dataset id and
loaded via datasets.load_dataset. Column-mapping kwargs adapt any schema, so any HF
dataset is a true drop-in.
# Classifier
tactic classifier train --dataset_path prompts.csv --max_steps 1000 --save_every 200
tactic classifier train --dataset_path org/dataset --sweep_config sweep_configs/classifier_sweep.yaml
tactic classifier calibrate --checkpoint checkpoints/<run>/step_1000 --eval_csv eval.csv
# Jailbreak classifier (binary; label derived from the dataset's jailbreak column)
# Pipeline: sweep -> continue-best -> calibrate -> deploy (mirrors the classifier)
tactic jailbreak sweep --dataset_path centrepourlasecuriteia/constitution-input-augmented-dataset --max_steps 500
tactic jailbreak resume <run_id> --max_steps 4500 --save_every 250
tactic jailbreak calibrate --checkpoint <ckpt>/best --dataset_path <hf-id> --eval_frac 0.1
# Paraphraser
tactic paraphraser train --dataset_path pairs.jsonl --num_train_epochs 1
tactic paraphraser eval --adapter checkpoints/paraphraser --eval_csv eval.csvThe flat tactic-classifier-train / tactic-jailbreak-train / tactic-paraphraser-train
aliases still work for back-compat.
Each run writes a checkpoint directory containing the LoRA adapter, the head (the harm
classifier's classification_head.pt or the jailbreak jailbreak_head.pt), the tokenizer,
and a weight_frame.json manifest. Calibration adds the harm classifier's per-category
thresholds.json (sets classifier.thresholds) or the jailbreak detector's single
threshold.json (sets jailbreak_classifier.threshold). Set --push_to_hub --hub_repo_id <id> (with HF_TOKEN in your environment) to publish to the HuggingFace Hub.
uv run ruff check src
uv run mypy src
uv run pytest # CPU-only, no networkSee CLAUDE.md for the full layout and where the reference material lives.