Skip to content

infernet-org/guardrails

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Guardrails

Training datasets for cognitive security models.

Datasets

Cognitive Security v1

Datasets for training models to detect malicious intent hidden in natural language.

Location: datasets/cognitive-security/v1/

File Samples Description
attacks.jsonl 530 Attack seed scenarios with metadata
sft_train.jsonl 314 SFT training data (instruction/response)
dpo_train.jsonl 285 DPO preference pairs (chosen/rejected)

Attack Categories Covered

  • Prompt injection (direct override, instruction escape, context manipulation)
  • Jailbreaks (roleplay, encoding, hypothetical, gradual escalation)
  • Social engineering (authority claims, urgency, trust exploitation)
  • Indirect injection (hidden instructions, data poisoning)
  • Data security (PII exposure, credential leakage, secret extraction)
  • Toxic flows (read-then-exfil, cross-context, privilege chains)
  • Resource exhaustion (infinite loops, polling attacks)

DPO Failure Modes

The DPO dataset includes rejected responses demonstrating these failure patterns:

Mode Description
MISSED_DECEPTION Takes cover story at face value, helps with harmful request
PARTIAL_COMPLIANCE Refuses obvious part but helps with something almost as harmful
INFORMATION_LEAK Reveals detection logic to attacker
SOFT_REFUSAL Weak refusal that could be pushed past
WRONG_THREAT Detects threat but misidentifies the actual attack vector

Usage

SFT Training

from datasets import load_dataset

dataset = load_dataset("json", data_files="datasets/cognitive-security/v1/sft_train.jsonl")

DPO Training

from datasets import load_dataset

dataset = load_dataset("json", data_files="datasets/cognitive-security/v1/dpo_train.jsonl")

# Each sample has:
# - prompt: The attack input
# - chosen: Correct response (detects and refuses)
# - rejected: Flawed response (demonstrates failure mode)

License

Apache 2.0

About

Datasets for training models to detect malicious intent hidden in natural language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors