Two-Stage Real-Time DDoS Detection

A production-grade, flow-level DDoS detection system trained on 72.5 million real network flows from two independent datasets. Detects 13 distinct attack types without payload inspection — works on encrypted traffic at ISP scale.

Results

Metric	Offline (CIC Test Set)	Live Simulation (115K inferences)
Recall	94.9%	99.0%
Precision	72.8%	99.8%
False Positive Rate	2.6%	0.2%
ROC-AUC	0.9934	—
Attack types detected (≥70%)	12/13	8/8

Key result: A slow-rate SYN evasion attack (throttled below rule thresholds) achieves 0% Stage-1 detection and 100% Stage-2 ML detection — proving the ML layer is genuinely necessary.

Architecture

Raw Network Flows (Zeek conn.log)
        │
        ▼
 01_build_windows.py   →  94 features × 3 time windows (2s / 10s / 60s)
        │
        ▼
 02_train.py           →  XGBoost (70%) + LightGBM (30%) ensemble
        │
        ├──► 05_demo.py          Real CIC test set evaluation + plots
        ├──► 06_two_stage.py     Comprehensive 20-scenario simulation
        └──► src/realtime/
              ├── simulate_attacks.py   Live flow generator
              ├── monitor_realtime.py   Rich terminal dashboard
              └── track_performance.py  30-min checkpoint tracker

Two-stage design:

Stage 1 (Rules): O(1) arithmetic checks — catches volumetric SYN, UDP, DNS, NTP floods instantly
Stage 2 (ML): XGBoost + LightGBM ensemble — catches evasive, application-layer attacks the rules miss

Setup

# Clone and create environment
git clone <repo-url>
cd DDOS
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

Step 1 — Build Feature Windows

venv\Scripts\python.exe src\pipeline\01_build_windows.py `
    --cic    data\flows_parquet_cicddos\CICDDOS2019.parquet `
    --mawi   data\flows_parquet_mawi\MAWI.parquet `
    --labels data\label_intervals.json `
    --out    data\windows `
    --stride 2.0 --mawi-frac 0.10 --rows-per-shard 500000

Step 2 — Train the Ensemble

# Train with sf_frac removed (recommended — v2_no_sf)
venv\Scripts\python.exe src\pipeline\02_train.py `
    --windows data\windows --out models\v2_no_sf `
    --fast --exclude-features sf_frac

Step 3 — Evaluate on Real CIC Test Data

venv\Scripts\python.exe src\pipeline\05_demo.py `
    --model models\v2_no_sf --windows data\windows `
    --exclude-features sf_frac

Step 4 — Run the Comprehensive Simulation

venv\Scripts\python.exe src\pipeline\06_two_stage.py `
    --model models\v2_no_sf --s2-threshold 0.15

Step 5 — Live Real-Time Demo (3 terminals)

# Terminal 1: Attack simulator
Remove-Item data\live_stream.jsonl -ErrorAction SilentlyContinue
venv\Scripts\python.exe src\realtime\simulate_attacks.py

# Terminal 2: Live dashboard
venv\Scripts\python.exe src\realtime\monitor_realtime.py --model models\v2_no_sf

# Terminal 3: Performance tracker (30-min checkpoints)
venv\Scripts\python.exe src\realtime\track_performance.py

Generate Traffic Graph On-Demand

venv\Scripts\python.exe src\realtime\plot_traffic.py --bin 10 --last 60

Project Structure

DDOS/
├── src/
│   ├── pipeline/
│   │   ├── features.py            # 94-feature engineering library (shared)
│   │   ├── 01_build_windows.py    # Flow → windowed feature extraction
│   │   ├── 02_train.py            # XGBoost + LightGBM training
│   │   ├── 05_demo.py             # Real CIC test set evaluation
│   │   └── 06_two_stage.py        # 20-scenario comprehensive simulation
│   └── realtime/
│       ├── simulate_attacks.py    # Live flow generator (writes JSONL)
│       ├── monitor_realtime.py    # Rich terminal dashboard
│       ├── track_performance.py   # 30-min checkpoint tracker + graphs
│       └── plot_traffic.py        # Traffic volume spike visualizer
├── models/
│   ├── v2_no_sf/
│   │   ├── feature_cols.json      # 94 feature names (ordered)
│   │   └── meta.json              # threshold=0.662, blend weights
│   │   # xgb.json + lgb.txt excluded (.gitignore) — retrain from source
├── data/
│   └── label_intervals.json       # 9,260 CIC attack time intervals
├── requirements.txt
└── README.md

Features Engineered (94 total)

Category	Features	What They Detect
Volume	`flow_rate`, `pkt_rate`, `byte_rate`	Flooding intensity
Protocol	`syn_frac`, `dns_frac`, `frac_udp`	Attack type identity
Asymmetry	`frac_zero_resp`, `inbound_frac`	No-response floods
Amplification	`byte_ratio`, `large_flow_frac`, `ntp_frac`	Reflection attacks
Entropy	`H_src_ip`, `H_flow_bytes`	Botnet uniformity
Cross-window	`accel_pkt_rate`, `tw_std_byte_rate`	Attack ramping detection

All features use stateless log1p() normalization — no fitted state, no domain shift.

Key Engineering Decisions

Problem	Root Cause	Fix	Impact
MemoryError at 82%	Unbounded per-destination buffer	15K row flush limit	32GB → 4GB RAM
40% false positives on real traffic	QuantileTransformer domain shift	Replaced with `log1p()`	FPR: 40% → 2.6%
sf_frac feature leak	UDP attacks have sf_frac=0 by definition	Ablation → remove sf_frac	Recall: 93.9% → 94.9%, NTP: +12.3%
0% ML scores in simulation	t_span=30s vs 120s flow density mismatch	Fixed t_span=120	SYN score: 0.009 → 0.587
XGBoost crash on load	CUDA-specific JSON format	`tree_method=hist` (CPU)	Full portability

Datasets

Note: Raw datasets are not included in this repository due to size (≈74 GB).

Dataset	Source	Size	Role
CIC-DDoS2019	University of New Brunswick, Canada	72.5M flows	Training + evaluation
MAWI Backbone	WIDE Project, Japan	~10% sample	Honest benign validation

Download links:

CIC-DDoS2019: https://www.unb.ca/cic/datasets/ddos-2019.html
MAWI: https://mawi.wide.ad.jp/

Per-Attack Detection Rates (Live 30-min Session)

Attack	Detection Rate
LDAP Amplification	100.0%
Slow SYN (Rule-Evasion)	100.0% — ML only, rules fire 0%
NTP Amplification	99.7%
SNMP Amplification	99.7%
MSSQL Amplification	99.6%
UDP Flood	98.4%
DNS Amplification	98.3%
SYN Flood	97.0%
Normal Traffic FPR	0.2%

Related Work

This system implements the same two-stage (rules + ML ensemble) architecture described in:

Cloudflare Magic Transit engineering blog
Arbor Networks ATLAS system
Sharafaldin et al., CIC-DDoS2019, IEEE ICCST 2019

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Stage Real-Time DDoS Detection

Results

Architecture

Setup

Usage

Step 1 — Build Feature Windows

Step 2 — Train the Ensemble

Step 3 — Evaluate on Real CIC Test Data

Step 4 — Run the Comprehensive Simulation

Step 5 — Live Real-Time Demo (3 terminals)

Generate Traffic Graph On-Demand

Project Structure

Features Engineered (94 total)

Key Engineering Decisions

Datasets

Per-Attack Detection Rates (Live 30-min Session)

Related Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models/v2_no_sf		models/v2_no_sf
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Two-Stage Real-Time DDoS Detection

Results

Architecture

Setup

Usage

Step 1 — Build Feature Windows

Step 2 — Train the Ensemble

Step 3 — Evaluate on Real CIC Test Data

Step 4 — Run the Comprehensive Simulation

Step 5 — Live Real-Time Demo (3 terminals)

Generate Traffic Graph On-Demand

Project Structure

Features Engineered (94 total)

Key Engineering Decisions

Datasets

Per-Attack Detection Rates (Live 30-min Session)

Related Work

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages