Training datasets for cognitive security models.
Datasets for training models to detect malicious intent hidden in natural language.
Location: datasets/cognitive-security/v1/
| File | Samples | Description |
|---|---|---|
attacks.jsonl |
530 | Attack seed scenarios with metadata |
sft_train.jsonl |
314 | SFT training data (instruction/response) |
dpo_train.jsonl |
285 | DPO preference pairs (chosen/rejected) |
- Prompt injection (direct override, instruction escape, context manipulation)
- Jailbreaks (roleplay, encoding, hypothetical, gradual escalation)
- Social engineering (authority claims, urgency, trust exploitation)
- Indirect injection (hidden instructions, data poisoning)
- Data security (PII exposure, credential leakage, secret extraction)
- Toxic flows (read-then-exfil, cross-context, privilege chains)
- Resource exhaustion (infinite loops, polling attacks)
The DPO dataset includes rejected responses demonstrating these failure patterns:
| Mode | Description |
|---|---|
| MISSED_DECEPTION | Takes cover story at face value, helps with harmful request |
| PARTIAL_COMPLIANCE | Refuses obvious part but helps with something almost as harmful |
| INFORMATION_LEAK | Reveals detection logic to attacker |
| SOFT_REFUSAL | Weak refusal that could be pushed past |
| WRONG_THREAT | Detects threat but misidentifies the actual attack vector |
from datasets import load_dataset
dataset = load_dataset("json", data_files="datasets/cognitive-security/v1/sft_train.jsonl")from datasets import load_dataset
dataset = load_dataset("json", data_files="datasets/cognitive-security/v1/dpo_train.jsonl")
# Each sample has:
# - prompt: The attack input
# - chosen: Correct response (detects and refuses)
# - rejected: Flawed response (demonstrates failure mode)Apache 2.0