Skip to content

CentreSecuriteIA/TACTIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TACTIC

Tuning for Alignment, Constitutional Training, & Instruction Calibration.

TACTIC is a training library for three models in the CeSIA safety stack:

  • a constitutional classifier — a LoRA-adapted causal LM + linear head that classifies a prompt into a 12-class harm taxonomy,
  • a jailbreak classifier — the same backbone with a single sigmoid head that flags whether a prompt is a jailbreak attempt (binary), with per-technique loss logging, and
  • a paraphraser — a LoRA fine-tuned causal LM that rewrites prompts while preserving intent ("reverse paraphrasing" / redaction), consumed by REDACT.

TACTIC is training-only. Inference (the Probe → Decode → Assess routine, vLLM/API deployment) lives in a separate downstream library that consumes the weights and the weight_frame.json calibration manifest produced here.

Install

uv sync --extra dev          # or: pip install -e ".[dev]"

The pipeline: load → train → save

Both models share the same three explicit stages, driven by one Trainer:

from tactic import ClassifierTrainer, ClassifierTrainConfig, build_dataset

# 1. LOAD DATA — local file or any HuggingFace Hub dataset, same call
data = build_dataset("prompts.csv", text_column="prompt", label_column="category")
data = build_dataset("walledai/HarmBench", text_column="prompt",
                     label_column="category", split="train")   # drop-in HF

# 2 & 3. LOAD MODEL & TRAIN
trainer = ClassifierTrainer(ClassifierTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.load_model()
trainer.fit(data)
trainer.save(push_to_hub=False)

The jailbreak classifier mirrors this with build_jailbreak_dataset — the binary label is derived from the dataset's shape (a populated jailbreak column means an augmented attempt, label 1; a base prompt row is label 0), so no label column is needed:

from tactic import JailbreakTrainer, JailbreakTrainConfig
from tactic.data import build_jailbreak_dataset

data = build_jailbreak_dataset("centrepourlasecuriteia/constitution-input-augmented-dataset")
trainer = JailbreakTrainer(JailbreakTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(data)

The paraphraser mirrors this too:

from tactic import ParaphraserTrainer, ParaphraserTrainConfig, build_dataset

pairs = build_dataset("pairs.jsonl", input_column="prompt", target_column="paraphrase")
trainer = ParaphraserTrainer(ParaphraserTrainConfig(model_name="Qwen/Qwen3.5-0.8B"))
trainer.fit(pairs)

End-to-end runnable examples are in notebooks/.

Drop-in HuggingFace datasets

build_dataset(source, ...) auto-detects the source: a local .csv/.jsonl/.json path is read from disk; anything else is treated as a HuggingFace Hub dataset id and loaded via datasets.load_dataset. Column-mapping kwargs adapt any schema, so any HF dataset is a true drop-in.

CLI

# Classifier
tactic classifier train --dataset_path prompts.csv --max_steps 1000 --save_every 200
tactic classifier train --dataset_path org/dataset --sweep_config sweep_configs/classifier_sweep.yaml
tactic classifier calibrate --checkpoint checkpoints/<run>/step_1000 --eval_csv eval.csv

# Jailbreak classifier (binary; label derived from the dataset's jailbreak column)
# Pipeline: sweep -> continue-best -> calibrate -> deploy (mirrors the classifier)
tactic jailbreak sweep --dataset_path centrepourlasecuriteia/constitution-input-augmented-dataset --max_steps 500
tactic jailbreak resume <run_id> --max_steps 4500 --save_every 250
tactic jailbreak calibrate --checkpoint <ckpt>/best --dataset_path <hf-id> --eval_frac 0.1

# Paraphraser
tactic paraphraser train --dataset_path pairs.jsonl --num_train_epochs 1
tactic paraphraser eval --adapter checkpoints/paraphraser --eval_csv eval.csv

The flat tactic-classifier-train / tactic-jailbreak-train / tactic-paraphraser-train aliases still work for back-compat.

Outputs

Each run writes a checkpoint directory containing the LoRA adapter, the head (the harm classifier's classification_head.pt or the jailbreak jailbreak_head.pt), the tokenizer, and a weight_frame.json manifest. Calibration adds the harm classifier's per-category thresholds.json (sets classifier.thresholds) or the jailbreak detector's single threshold.json (sets jailbreak_classifier.threshold). Set --push_to_hub --hub_repo_id <id> (with HF_TOKEN in your environment) to publish to the HuggingFace Hub.

Development

uv run ruff check src
uv run mypy src
uv run pytest          # CPU-only, no network

See CLAUDE.md for the full layout and where the reference material lives.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors