TACTIC

Tuning for Alignment, Constitutional Training, & Instruction Calibration.

TACTIC is a training library for three models in the CeSIA safety stack:

a constitutional classifier — a LoRA-adapted causal LM + linear head that classifies a prompt into a 12-class harm taxonomy,
a jailbreak classifier — the same backbone with a single sigmoid head that flags whether a prompt is a jailbreak attempt (binary), with per-technique loss logging, and
a paraphraser — a LoRA fine-tuned causal LM that rewrites prompts while preserving intent ("reverse paraphrasing" / redaction), consumed by REDACT.

TACTIC is training-only. Inference (the Probe → Decode → Assess routine, vLLM/API deployment) lives in a separate downstream library that consumes the weights and the weight_frame.json calibration manifest produced here.

Install

uv sync --extra dev          # or: pip install -e ".[dev]"

The pipeline: load → train → save

Both models share the same three explicit stages, driven by one Trainer:

from tactic import ClassifierTrainer, ClassifierTrainConfig, build_dataset

# 1. LOAD DATA — local file or any HuggingFace Hub dataset, same call
data = build_dataset("prompts.csv", text_column="prompt", label_column="category")
data = build_dataset("walledai/HarmBench", text_column="prompt",
                     label_column="category", split="train")   # drop-in HF

# 2 & 3. LOAD MODEL & TRAIN
trainer = ClassifierTrainer(ClassifierTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.load_model()
trainer.fit(data)
trainer.save(push_to_hub=False)

The jailbreak classifier mirrors this with build_jailbreak_dataset — the binary label is derived from the dataset's shape (a populated jailbreak column means an augmented attempt, label 1; a base prompt row is label 0), so no label column is needed:

from tactic import JailbreakTrainer, JailbreakTrainConfig
from tactic.data import build_jailbreak_dataset

data = build_jailbreak_dataset("centrepourlasecuriteia/constitution-input-augmented-dataset")
trainer = JailbreakTrainer(JailbreakTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(data)

The paraphraser mirrors this too:

from tactic import ParaphraserTrainer, ParaphraserTrainConfig, build_dataset

pairs = build_dataset("pairs.jsonl", input_column="prompt", target_column="paraphrase")
trainer = ParaphraserTrainer(ParaphraserTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(pairs)

End-to-end runnable examples are in notebooks/.

Drop-in HuggingFace datasets

build_dataset(source, ...) auto-detects the source: a local .csv/.jsonl/.json path is read from disk; anything else is treated as a HuggingFace Hub dataset id and loaded via datasets.load_dataset. Column-mapping kwargs adapt any schema, so any HF dataset is a true drop-in.

CLI

# Classifier
tactic classifier train --dataset_path prompts.csv --max_steps 1000 --save_every 200
tactic classifier train --dataset_path org/dataset --sweep_config sweep_configs/classifier_sweep.yaml
tactic classifier calibrate --checkpoint checkpoints/<run>/step_1000 --eval_csv eval.csv

# Jailbreak classifier (binary; label derived from the dataset's jailbreak column)
# Pipeline: sweep -> continue-best -> calibrate -> deploy (mirrors the classifier)
tactic jailbreak sweep --dataset_path centrepourlasecuriteia/constitution-input-augmented-dataset --max_steps 500
tactic jailbreak resume <run_id> --max_steps 4500 --save_every 250
tactic jailbreak calibrate --checkpoint <ckpt>/best --dataset_path <hf-id> --eval_frac 0.1

# Paraphraser
tactic paraphraser train --dataset_path pairs.jsonl --num_train_epochs 1
tactic paraphraser eval --adapter checkpoints/paraphraser --eval_csv eval.csv

The flat tactic-classifier-train / tactic-jailbreak-train / tactic-paraphraser-train aliases still work for back-compat.

Outputs

Each run writes a checkpoint directory containing the LoRA adapter, the head (the harm classifier's classification_head.pt or the jailbreak jailbreak_head.pt), the tokenizer, and a weight_frame.json manifest. Calibration adds the harm classifier's per-category thresholds.json (sets classifier.thresholds) or the jailbreak detector's single threshold.json (sets jailbreak_classifier.threshold). Set --push_to_hub --hub_repo_id <id> (with HF_TOKEN in your environment) to publish to the HuggingFace Hub.

Development

uv run ruff check src
uv run mypy src
uv run pytest          # CPU-only, no network

See CLAUDE.md for the full layout and where the reference material lives.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
src/tactic		src/tactic
sweep_configs		sweep_configs
tests		tests
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TACTIC

Install

The pipeline: load → train → save

Drop-in HuggingFace datasets

CLI

Outputs

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TACTIC

Install

The pipeline: load → train → save

Drop-in HuggingFace datasets

CLI

Outputs

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages