Skip to content

bit-soham/DDOS-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Two-Stage Real-Time DDoS Detection

A production-grade, flow-level DDoS detection system trained on 72.5 million real network flows from two independent datasets. Detects 13 distinct attack types without payload inspection — works on encrypted traffic at ISP scale.

Python XGBoost LightGBM License: MIT


Results

Metric Offline (CIC Test Set) Live Simulation (115K inferences)
Recall 94.9% 99.0%
Precision 72.8% 99.8%
False Positive Rate 2.6% 0.2%
ROC-AUC 0.9934
Attack types detected (≥70%) 12/13 8/8

Key result: A slow-rate SYN evasion attack (throttled below rule thresholds) achieves 0% Stage-1 detection and 100% Stage-2 ML detection — proving the ML layer is genuinely necessary.


Architecture

Raw Network Flows (Zeek conn.log)
        │
        ▼
 01_build_windows.py   →  94 features × 3 time windows (2s / 10s / 60s)
        │
        ▼
 02_train.py           →  XGBoost (70%) + LightGBM (30%) ensemble
        │
        ├──► 05_demo.py          Real CIC test set evaluation + plots
        ├──► 06_two_stage.py     Comprehensive 20-scenario simulation
        └──► src/realtime/
              ├── simulate_attacks.py   Live flow generator
              ├── monitor_realtime.py   Rich terminal dashboard
              └── track_performance.py  30-min checkpoint tracker

Two-stage design:

  • Stage 1 (Rules): O(1) arithmetic checks — catches volumetric SYN, UDP, DNS, NTP floods instantly
  • Stage 2 (ML): XGBoost + LightGBM ensemble — catches evasive, application-layer attacks the rules miss

Setup

# Clone and create environment
git clone <repo-url>
cd DDOS
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

Step 1 — Build Feature Windows

venv\Scripts\python.exe src\pipeline\01_build_windows.py `
    --cic    data\flows_parquet_cicddos\CICDDOS2019.parquet `
    --mawi   data\flows_parquet_mawi\MAWI.parquet `
    --labels data\label_intervals.json `
    --out    data\windows `
    --stride 2.0 --mawi-frac 0.10 --rows-per-shard 500000

Step 2 — Train the Ensemble

# Train with sf_frac removed (recommended — v2_no_sf)
venv\Scripts\python.exe src\pipeline\02_train.py `
    --windows data\windows --out models\v2_no_sf `
    --fast --exclude-features sf_frac

Step 3 — Evaluate on Real CIC Test Data

venv\Scripts\python.exe src\pipeline\05_demo.py `
    --model models\v2_no_sf --windows data\windows `
    --exclude-features sf_frac

Step 4 — Run the Comprehensive Simulation

venv\Scripts\python.exe src\pipeline\06_two_stage.py `
    --model models\v2_no_sf --s2-threshold 0.15

Step 5 — Live Real-Time Demo (3 terminals)

# Terminal 1: Attack simulator
Remove-Item data\live_stream.jsonl -ErrorAction SilentlyContinue
venv\Scripts\python.exe src\realtime\simulate_attacks.py

# Terminal 2: Live dashboard
venv\Scripts\python.exe src\realtime\monitor_realtime.py --model models\v2_no_sf

# Terminal 3: Performance tracker (30-min checkpoints)
venv\Scripts\python.exe src\realtime\track_performance.py

Generate Traffic Graph On-Demand

venv\Scripts\python.exe src\realtime\plot_traffic.py --bin 10 --last 60

Project Structure

DDOS/
├── src/
│   ├── pipeline/
│   │   ├── features.py            # 94-feature engineering library (shared)
│   │   ├── 01_build_windows.py    # Flow → windowed feature extraction
│   │   ├── 02_train.py            # XGBoost + LightGBM training
│   │   ├── 05_demo.py             # Real CIC test set evaluation
│   │   └── 06_two_stage.py        # 20-scenario comprehensive simulation
│   └── realtime/
│       ├── simulate_attacks.py    # Live flow generator (writes JSONL)
│       ├── monitor_realtime.py    # Rich terminal dashboard
│       ├── track_performance.py   # 30-min checkpoint tracker + graphs
│       └── plot_traffic.py        # Traffic volume spike visualizer
├── models/
│   ├── v2_no_sf/
│   │   ├── feature_cols.json      # 94 feature names (ordered)
│   │   └── meta.json              # threshold=0.662, blend weights
│   │   # xgb.json + lgb.txt excluded (.gitignore) — retrain from source
├── data/
│   └── label_intervals.json       # 9,260 CIC attack time intervals
├── requirements.txt
└── README.md

Features Engineered (94 total)

Category Features What They Detect
Volume flow_rate, pkt_rate, byte_rate Flooding intensity
Protocol syn_frac, dns_frac, frac_udp Attack type identity
Asymmetry frac_zero_resp, inbound_frac No-response floods
Amplification byte_ratio, large_flow_frac, ntp_frac Reflection attacks
Entropy H_src_ip, H_flow_bytes Botnet uniformity
Cross-window accel_pkt_rate, tw_std_byte_rate Attack ramping detection

All features use stateless log1p() normalization — no fitted state, no domain shift.


Key Engineering Decisions

Problem Root Cause Fix Impact
MemoryError at 82% Unbounded per-destination buffer 15K row flush limit 32GB → 4GB RAM
40% false positives on real traffic QuantileTransformer domain shift Replaced with log1p() FPR: 40% → 2.6%
sf_frac feature leak UDP attacks have sf_frac=0 by definition Ablation → remove sf_frac Recall: 93.9% → 94.9%, NTP: +12.3%
0% ML scores in simulation t_span=30s vs 120s flow density mismatch Fixed t_span=120 SYN score: 0.009 → 0.587
XGBoost crash on load CUDA-specific JSON format tree_method=hist (CPU) Full portability

Datasets

Note: Raw datasets are not included in this repository due to size (≈74 GB).

Dataset Source Size Role
CIC-DDoS2019 University of New Brunswick, Canada 72.5M flows Training + evaluation
MAWI Backbone WIDE Project, Japan ~10% sample Honest benign validation

Download links:


Per-Attack Detection Rates (Live 30-min Session)

Attack Detection Rate
LDAP Amplification 100.0%
Slow SYN (Rule-Evasion) 100.0% — ML only, rules fire 0%
NTP Amplification 99.7%
SNMP Amplification 99.7%
MSSQL Amplification 99.6%
UDP Flood 98.4%
DNS Amplification 98.3%
SYN Flood 97.0%
Normal Traffic FPR 0.2%

Related Work

This system implements the same two-stage (rules + ML ensemble) architecture described in:

  • Cloudflare Magic Transit engineering blog
  • Arbor Networks ATLAS system
  • Sharafaldin et al., CIC-DDoS2019, IEEE ICCST 2019

License

MIT License — see LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors