A production-grade, flow-level DDoS detection system trained on 72.5 million real network flows from two independent datasets. Detects 13 distinct attack types without payload inspection — works on encrypted traffic at ISP scale.
| Metric | Offline (CIC Test Set) | Live Simulation (115K inferences) |
|---|---|---|
| Recall | 94.9% | 99.0% |
| Precision | 72.8% | 99.8% |
| False Positive Rate | 2.6% | 0.2% |
| ROC-AUC | 0.9934 | — |
| Attack types detected (≥70%) | 12/13 | 8/8 |
Key result: A slow-rate SYN evasion attack (throttled below rule thresholds) achieves 0% Stage-1 detection and 100% Stage-2 ML detection — proving the ML layer is genuinely necessary.
Raw Network Flows (Zeek conn.log)
│
▼
01_build_windows.py → 94 features × 3 time windows (2s / 10s / 60s)
│
▼
02_train.py → XGBoost (70%) + LightGBM (30%) ensemble
│
├──► 05_demo.py Real CIC test set evaluation + plots
├──► 06_two_stage.py Comprehensive 20-scenario simulation
└──► src/realtime/
├── simulate_attacks.py Live flow generator
├── monitor_realtime.py Rich terminal dashboard
└── track_performance.py 30-min checkpoint tracker
Two-stage design:
- Stage 1 (Rules): O(1) arithmetic checks — catches volumetric SYN, UDP, DNS, NTP floods instantly
- Stage 2 (ML): XGBoost + LightGBM ensemble — catches evasive, application-layer attacks the rules miss
# Clone and create environment
git clone <repo-url>
cd DDOS
python -m venv venv
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtvenv\Scripts\python.exe src\pipeline\01_build_windows.py `
--cic data\flows_parquet_cicddos\CICDDOS2019.parquet `
--mawi data\flows_parquet_mawi\MAWI.parquet `
--labels data\label_intervals.json `
--out data\windows `
--stride 2.0 --mawi-frac 0.10 --rows-per-shard 500000# Train with sf_frac removed (recommended — v2_no_sf)
venv\Scripts\python.exe src\pipeline\02_train.py `
--windows data\windows --out models\v2_no_sf `
--fast --exclude-features sf_fracvenv\Scripts\python.exe src\pipeline\05_demo.py `
--model models\v2_no_sf --windows data\windows `
--exclude-features sf_fracvenv\Scripts\python.exe src\pipeline\06_two_stage.py `
--model models\v2_no_sf --s2-threshold 0.15# Terminal 1: Attack simulator
Remove-Item data\live_stream.jsonl -ErrorAction SilentlyContinue
venv\Scripts\python.exe src\realtime\simulate_attacks.py
# Terminal 2: Live dashboard
venv\Scripts\python.exe src\realtime\monitor_realtime.py --model models\v2_no_sf
# Terminal 3: Performance tracker (30-min checkpoints)
venv\Scripts\python.exe src\realtime\track_performance.pyvenv\Scripts\python.exe src\realtime\plot_traffic.py --bin 10 --last 60DDOS/
├── src/
│ ├── pipeline/
│ │ ├── features.py # 94-feature engineering library (shared)
│ │ ├── 01_build_windows.py # Flow → windowed feature extraction
│ │ ├── 02_train.py # XGBoost + LightGBM training
│ │ ├── 05_demo.py # Real CIC test set evaluation
│ │ └── 06_two_stage.py # 20-scenario comprehensive simulation
│ └── realtime/
│ ├── simulate_attacks.py # Live flow generator (writes JSONL)
│ ├── monitor_realtime.py # Rich terminal dashboard
│ ├── track_performance.py # 30-min checkpoint tracker + graphs
│ └── plot_traffic.py # Traffic volume spike visualizer
├── models/
│ ├── v2_no_sf/
│ │ ├── feature_cols.json # 94 feature names (ordered)
│ │ └── meta.json # threshold=0.662, blend weights
│ │ # xgb.json + lgb.txt excluded (.gitignore) — retrain from source
├── data/
│ └── label_intervals.json # 9,260 CIC attack time intervals
├── requirements.txt
└── README.md
| Category | Features | What They Detect |
|---|---|---|
| Volume | flow_rate, pkt_rate, byte_rate |
Flooding intensity |
| Protocol | syn_frac, dns_frac, frac_udp |
Attack type identity |
| Asymmetry | frac_zero_resp, inbound_frac |
No-response floods |
| Amplification | byte_ratio, large_flow_frac, ntp_frac |
Reflection attacks |
| Entropy | H_src_ip, H_flow_bytes |
Botnet uniformity |
| Cross-window | accel_pkt_rate, tw_std_byte_rate |
Attack ramping detection |
All features use stateless log1p() normalization — no fitted state, no domain shift.
| Problem | Root Cause | Fix | Impact |
|---|---|---|---|
| MemoryError at 82% | Unbounded per-destination buffer | 15K row flush limit | 32GB → 4GB RAM |
| 40% false positives on real traffic | QuantileTransformer domain shift | Replaced with log1p() |
FPR: 40% → 2.6% |
| sf_frac feature leak | UDP attacks have sf_frac=0 by definition | Ablation → remove sf_frac | Recall: 93.9% → 94.9%, NTP: +12.3% |
| 0% ML scores in simulation | t_span=30s vs 120s flow density mismatch | Fixed t_span=120 | SYN score: 0.009 → 0.587 |
| XGBoost crash on load | CUDA-specific JSON format | tree_method=hist (CPU) |
Full portability |
Note: Raw datasets are not included in this repository due to size (≈74 GB).
| Dataset | Source | Size | Role |
|---|---|---|---|
| CIC-DDoS2019 | University of New Brunswick, Canada | 72.5M flows | Training + evaluation |
| MAWI Backbone | WIDE Project, Japan | ~10% sample | Honest benign validation |
Download links:
- CIC-DDoS2019: https://www.unb.ca/cic/datasets/ddos-2019.html
- MAWI: https://mawi.wide.ad.jp/
| Attack | Detection Rate |
|---|---|
| LDAP Amplification | 100.0% |
| Slow SYN (Rule-Evasion) | 100.0% — ML only, rules fire 0% |
| NTP Amplification | 99.7% |
| SNMP Amplification | 99.7% |
| MSSQL Amplification | 99.6% |
| UDP Flood | 98.4% |
| DNS Amplification | 98.3% |
| SYN Flood | 97.0% |
| Normal Traffic FPR | 0.2% |
This system implements the same two-stage (rules + ML ensemble) architecture described in:
- Cloudflare Magic Transit engineering blog
- Arbor Networks ATLAS system
- Sharafaldin et al., CIC-DDoS2019, IEEE ICCST 2019
MIT License — see LICENSE for details.