Skip to content

Latest commit

Β 

History

History
558 lines (454 loc) Β· 23 KB

File metadata and controls

558 lines (454 loc) Β· 23 KB

PROJECT OVERVIEW & WORKFLOW

ML-Guided Materials Database: Visual Summary


🎯 PROJECT IN ONE SENTENCE

Use machine learning to predict which crystal structures are most likely stable, then validate only those with expensive DFT calculations β€” reducing computational cost by 10-100Γ—.


πŸ“Š THE PROBLEM

Traditional Approach (SLOW & EXPENSIVE)

New Composition: "Li₃FeOβ‚„"
         ↓
Test ALL 230 space groups
         ↓
Run 230 Γ— 5 = 1150 DFT calculations
         ↓
Each takes 2-10 hours
         ↓
Total: 2,300 - 11,500 CPU hours
         ↓
Find 1 stable structure

Cost: ~$5,000-20,000 per composition Time: Weeks


πŸš€ OUR SOLUTION (FAST & SMART)

ML-Guided Approach

New Composition: "Li₃FeOβ‚„"
         ↓
ML predicts top 3 space groups (0.1 seconds)
         ↓
Test only those 3 + nearby structures
         ↓
Run 3 Γ— 5 = 15 DFT calculations (rapid)
         ↓
Keep best 3 candidates
         ↓
Run 3 high-accuracy DFT calculations
         ↓
Find stable structure confirmed

Cost: ~$100-500 per composition (10-50Γ— cheaper!) Time: 1-2 days (50Γ— faster!)


πŸ”„ COMPLETE WORKFLOW DIAGRAM

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INPUT: COMPOSITION                            β”‚
β”‚                      "Li₃FeOβ‚„"                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              STAGE 1: ML PRE-SCREENING                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Model 1: Space Group Predictor                      β”‚      β”‚
β”‚  β”‚  Input: Composition features (132 descriptors)        β”‚      β”‚
β”‚  β”‚  Output: Top-5 space groups with probabilities       β”‚      β”‚
β”‚  β”‚  Example: [(227, 0.45), (225, 0.23), (141, 0.12)...] β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Model 2: Formation Energy Predictor                 β”‚      β”‚
β”‚  β”‚  Input: Composition + predicted SG                   β”‚      β”‚
β”‚  β”‚  Output: E_f = -3.2 Β± 0.4 eV/atom                   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Model 3: Hull Distance Predictor                    β”‚      β”‚
β”‚  β”‚  Output: E_hull = 0.015 Β± 0.05 eV/atom              β”‚      β”‚
β”‚  β”‚  Decision: Likely STABLE (< 0.05 threshold)         β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Model 4: Volume Predictor                           β”‚      β”‚
β”‚  β”‚  Output: V = 145 Β± 8 Ε²                              β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                    βœ… PASS: E_hull < 0.05
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         STAGE 2: STRUCTURE GENERATION (CONSTRAINED)             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  USPEX / CALYPSO / Particle Swarm Optimization       β”‚      β”‚
β”‚  β”‚  Constraints from ML:                                β”‚      β”‚
β”‚  β”‚  β€’ Search only SG: 227, 225, 141                    β”‚      β”‚
β”‚  β”‚  β€’ Volume range: 137-153 Ε² (from ML prediction)     β”‚      β”‚
β”‚  β”‚  β€’ Generate: 300 structures (not 3000!)             β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚           15 unique candidate structures                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          STAGE 3: RAPID DFT SCREENING (TIER 1)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  DFT Settings:                                        β”‚      β”‚
β”‚  β”‚  β€’ Functional: PBE (fast)                            β”‚      β”‚
β”‚  β”‚  β€’ k-points: 4Γ—4Γ—4 (~500 k-points)                   β”‚      β”‚
β”‚  β”‚  β€’ Convergence: Medium                               β”‚      β”‚
β”‚  β”‚  β€’ Time: 2-5 hours per structure                     β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚  Compare with ML predictions:                                    β”‚
β”‚  β€’ Structure 1: E_f = -3.1 eV βœ“ (close to ML: -3.2)            β”‚
β”‚  β€’ Structure 2: E_f = -2.9 eV βœ“                                 β”‚
β”‚  β€’ Structure 3: E_f = -3.15 eV βœ“ BEST                           β”‚
β”‚  β€’ ... (12 more)                                                 β”‚
β”‚                         ↓                                        β”‚
β”‚  Filter: Keep top 3 candidates by energy                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        STAGE 4: HIGH-ACCURACY DFT (TIER 2)                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  DFT Settings:                                        β”‚      β”‚
β”‚  β”‚  β€’ Functional: SCAN / rΒ²SCAN (accurate)              β”‚      β”‚
β”‚  β”‚  β€’ k-points: 8Γ—8Γ—8 (~2000 k-points)                  β”‚      β”‚
β”‚  β”‚  β€’ Convergence: Tight                                β”‚      β”‚
β”‚  β”‚  β€’ Phonons: Yes (check dynamic stability)           β”‚      β”‚
β”‚  β”‚  β€’ Time: 10-24 hours per structure                  β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                         ↓                                        β”‚
β”‚  Final Results:                                                  β”‚
β”‚  Structure 3 (SG 227):                                          β”‚
β”‚  β€’ E_f = -3.18 eV/atom                                          β”‚
β”‚  β€’ E_hull = 0.000 eV (STABLE! On convex hull)                  β”‚
β”‚  β€’ No imaginary phonon modes βœ“                                  β”‚
β”‚  β€’ Volume = 146.2 Ε²                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              STAGE 5: DATABASE ENTRY                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Material: Li₃FeOβ‚„                                    β”‚      β”‚
β”‚  β”‚  Space Group: 227 (Fd-3m)                             β”‚      β”‚
β”‚  β”‚  Formation Energy: -3.18 Β± 0.08 eV/atom              β”‚      β”‚
β”‚  β”‚  Energy Above Hull: 0.000 eV (STABLE)                β”‚      β”‚
β”‚  β”‚  Volume: 146.2 Β± 0.5 Ε²                               β”‚      β”‚
β”‚  β”‚  Confidence: 95%                                       β”‚      β”‚
β”‚  β”‚  ─────────────────────────────────────────────────    β”‚      β”‚
β”‚  β”‚  Provenance:                                           β”‚      β”‚
β”‚  β”‚  β€’ ML prediction: 2025-11-01                          β”‚      β”‚
β”‚  β”‚  β€’ DFT validation: SCAN functional                    β”‚      β”‚
β”‚  β”‚  β€’ Sources: MP, OQMD, JARVIS (training data)         β”‚      β”‚
β”‚  β”‚  β€’ Uncertainty: From ensemble of 5 ML models         β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         ↓
                   βœ… COMPLETE!

πŸ“ˆ COMPUTATIONAL SAVINGS

Cost Comparison

Approach # DFT Calcs CPU Hours $ Cost Time Success Rate
Traditional 1,150 2,300-11,500 $5k-20k 2-4 weeks ~95%
Random sampling 100 200-1,000 $500-2k 3-7 days ~60%
Our ML-guided 18 36-180 $100-500 1-2 days ~85%

Improvement:

  • 64Γ— fewer DFT calculations
  • 13Γ— cheaper in compute cost
  • 10Γ— faster time to result
  • Only 10% lower success rate

🧠 MACHINE LEARNING MODELS

Model Pipeline Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INPUT: Composition String                              β”‚
β”‚         "Feβ‚‚O₃"                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FEATURIZATION                                          β”‚
β”‚  Convert to 132 numerical features:                     β”‚
β”‚  β€’ Elemental properties (weighted): 80 features         β”‚
β”‚  β€’ Stoichiometry: 15 features                           β”‚
β”‚  β€’ Crystal chemistry: 12 features                       β”‚
β”‚  β€’ Historical patterns: 25 features                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Model A   β”‚  Model B   β”‚  Model C   β”‚  Model D    β”‚
β”‚   Space     β”‚ Formation  β”‚   Hull     β”‚  Volume     β”‚
β”‚   Group     β”‚  Energy    β”‚  Distance  β”‚ Predictor   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓              ↓           ↓            ↓
   Top 5 SGs     E_fΒ±Οƒ      E_hullΒ±Οƒ      VΒ±Οƒ

Model Specifications

Model A: Space Group Prediction

  • Architecture: CrabNet (Composition Transformer)
  • Training data: 1.5M structures from MP+OQMD+AFLOW+JARVIS
  • Performance: 85% top-5 accuracy
  • Output: Probability distribution over 230 space groups

Model B: Formation Energy

  • Architecture: Roost or CrabNet
  • Training: 1M formation energies (corrected for functional)
  • Performance: MAE = 0.12 eV/atom
  • Output: E_f with uncertainty estimate

Model C: Hull Distance

  • Architecture: Multi-task with Model B
  • Training: Hull distances from all databases
  • Performance: MAE = 0.04 eV/atom
  • Output: E_hull, binary stability prediction

Model D: Volume Prediction

  • Architecture: Random Forest or Neural Network
  • Input: Composition + predicted space group
  • Performance: MAPE = 4.5%
  • Output: Cell volume

πŸ”§ KEY FEATURES (DESCRIPTORS)

The 132 Features Explained

Category 1: Elemental Properties (Weighted by Stoichiometry)

  • Atomic radius (mean, range, std)
  • Electronegativity (mean, range, std)
  • Ionization energy (mean, range, std)
  • Atomic mass (mean, range, std)
  • Valence electrons (mean, sum, std)
  • Example for Feβ‚‚O₃:
    • mean_radius = 0.4Γ—0.72 + 0.6Γ—0.66 = 0.684 Γ…
    • range_electronegativity = 3.44 - 1.83 = 1.61

Category 2: Composition

  • Number of elements (2 for Feβ‚‚O₃)
  • Stoichiometry ratios ([0.4, 0.6])
  • Mixing entropy: -Ξ£(fΓ—ln f) = 0.67

Category 3: Crystal Chemistry

  • Radius ratio: r_cation/r_anion = 0.51
  • Ionic character: 0.57 (predominantly ionic)
  • Tolerance factor (for perovskites)

Category 4: Historical

  • Space group frequency for oxides
  • Prototype similarity (corundum-like for Feβ‚‚O₃)
  • Typical hull distances for this chemistry

πŸ“Š DATABASE RECONCILIATION

The Multi-Database Problem

Same material, different values across databases:

Feβ‚‚O₃ Properties:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Property    β”‚    MP    β”‚  OQMD    β”‚  JARVIS  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Volume (Ε²) β”‚  101.2   β”‚  101.5   β”‚   99.8   β”‚
β”‚  E_f (eV)   β”‚  -2.51   β”‚  -2.48   β”‚  -2.53   β”‚
β”‚  Band gap   β”‚  2.2 eV  β”‚  2.0 eV  β”‚  2.1 eV  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Our Solution: Uncertainty Quantification

Unified Entry:
Feβ‚‚O₃
β”œβ”€ Volume: 100.8 Β± 0.7 Ε²
β”‚  └─ Sources: {MP: 101.2, OQMD: 101.5, JARVIS: 99.8}
β”œβ”€ Formation Energy: -2.51 Β± 0.03 eV/atom
β”‚  └─ Corrected for functional differences
└─ Confidence: 94%

🎯 EXPECTED OUTCOMES

After 16 Weeks

Technical Achievements:

  • βœ… ML models trained on 1.5M+ structures
  • βœ… 1,000+ new materials validated with DFT
  • βœ… 10-100Γ— speedup demonstrated
  • βœ… Unified database with uncertainties

Publications:

  1. Main paper: Methodology (NPJ Computational Materials)
  2. Database paper: Description (Scientific Data)
  3. Case studies: Applications to specific chemistries

Software:

  • Open-source Python package
  • Web interface for queries
  • REST API for programmatic access
  • Integration with Materials Project

Impact:

  • Enable rapid discovery for experimentalists
  • Standard tool for materials screening
  • Reduce computational waste
  • Accelerate clean energy technologies

πŸ’‘ INNOVATION HIGHLIGHTS

What Makes This Unique?

  1. Hierarchical approach

    • ML filters β†’ rapid DFT β†’ accurate DFT
    • Not seen in existing databases
  2. Multi-database reconciliation

    • First systematic approach
    • Uncertainty quantification built-in
  3. Phase prediction capability

    • Beyond T=0K, P=0
    • Temperature and pressure dependence
  4. Quality assurance

    • Every entry validated
    • Provenance tracking
    • Confidence scores

πŸ“š RESOURCES CREATED

For Your Team

  1. Main Proposal (50 pages)

    • Complete project description
    • Scientific background
    • Implementation details
    • Timeline and milestones
  2. Descriptor Reference (30 pages)

    • All 132 features explained
    • Implementation examples
    • Best practices
  3. Quick Start Guide (20 pages)

    • Week 1 action items
    • Complete code examples
    • Troubleshooting
  4. This Visual Summary (10 pages)

    • Big picture overview
    • Workflow diagrams
    • Expected outcomes

πŸŽ“ LEARNING OUTCOMES

Skills Your Team Will Master

Technical:

  • Materials database APIs (MP, OQMD, JARVIS, AFLOW)
  • Machine learning (PyTorch/TensorFlow)
  • DFT calculations (VASP/Quantum Espresso)
  • Database design (MongoDB/PostgreSQL)
  • Web development (API creation)

Scientific:

  • Thermodynamic stability theory
  • Crystal structure prediction
  • Electronic structure methods
  • Statistical analysis
  • Uncertainty quantification

Professional:

  • Large-scale project management
  • Scientific writing and publishing
  • Conference presentations
  • Collaborative research
  • Open-source development

πŸš€ GETTING STARTED CHECKLIST

Week 1 To-Do (Each Person)

Day 1: Setup

  • Install Python, conda, essential packages
  • Get Materials Project API key
  • Test data download (10 materials)
  • Join team communication channel

Day 2: Exploration

  • Download 1000 test materials
  • Explore data structure
  • Plot space group distribution
  • Understand key properties

Day 3: Features

  • Install matminer
  • Generate 132 features for dataset
  • Analyze feature correlations
  • Save featurized dataset

Day 4: First Model

  • Train Random Forest baseline
  • Achieve >30% top-1 accuracy
  • Plot feature importance
  • Save model

Day 5: Analysis & Meeting

  • Create visualizations
  • Prepare presentation
  • Attend team meeting
  • Plan Week 2

πŸ“ž CONTACT & COLLABORATION

Project Resources:

  • πŸ“„ Full Proposal: ML_Materials_Database_Proposal.md
  • πŸ§ͺ Descriptor Guide: Descriptor_Reference_Guide.md
  • πŸš€ Quick Start: Quick_Start_Week1_Guide.md
  • πŸ“Š This Summary: Project_Overview_Visual.md

Code Repository: (To be created)

  • GitHub: [your-repo-here]

Communication:

  • Team Slack/Discord
  • Weekly meetings: Fridays 3pm
  • Office hours: By appointment

πŸŽ‰ FINAL THOUGHTS

Why This Matters

Every major technology breakthrough requires new materials:

  • Better batteries β†’ electric vehicles, grid storage
  • Efficient solar cells β†’ renewable energy
  • Quantum computers β†’ computational revolution
  • Green catalysts β†’ sustainable chemistry

Traditional discovery: 10-20 years from lab to market

Our approach can help reduce this to 2-5 years by:

  • Predicting stable materials before synthesis
  • Reducing computational waste
  • Providing high-confidence targets for experimentalists

You're Not Just Building a Database

You're creating:

  • A tool that will accelerate discovery
  • A methodology that will be adopted widely
  • Publications that will be highly cited
  • Skills that will define your career
  • Impact on real-world technology

🏁 READY TO CHANGE MATERIALS SCIENCE?

Next Steps:

  1. πŸ“– Read the full proposal
  2. πŸ§ͺ Review descriptor guide
  3. πŸš€ Start Week 1 tasks
  4. πŸ‘₯ Connect with team
  5. πŸ’ͺ Let's build something amazing!

"The best way to predict the future is to invent it." - Alan Kay

Now let's invent the future of materials discovery! πŸš€πŸ”¬βš—οΈ


DOCUMENT MAP

Project Documentation
β”‚
β”œβ”€β”€ πŸ“˜ ML_Materials_Database_Proposal.md (50 pages)
β”‚   └─ Complete scientific proposal
β”‚      β”œβ”€ Background & motivation
β”‚      β”œβ”€ Detailed methodology
β”‚      β”œβ”€ Implementation plan
β”‚      β”œβ”€ Timeline & deliverables
β”‚      └─ Budget & resources
β”‚
β”œβ”€β”€ πŸ§ͺ Descriptor_Reference_Guide.md (30 pages)
β”‚   └─ Feature engineering manual
β”‚      β”œβ”€ All 132 descriptors explained
β”‚      β”œβ”€ Code examples
β”‚      β”œβ”€ Best practices
β”‚      └─ Implementation templates
β”‚
β”œβ”€β”€ πŸš€ Quick_Start_Week1_Guide.md (20 pages)
β”‚   └─ Hands-on week 1 tutorial
β”‚      β”œβ”€ Day-by-day tasks
β”‚      β”œβ”€ Complete code examples
β”‚      β”œβ”€ Troubleshooting
β”‚      └─ Expected results
β”‚
└── πŸ“Š Project_Overview_Visual.md (THIS FILE)
    └─ High-level summary
       β”œβ”€ Workflow diagrams
       β”œβ”€ Key concepts
       └─ Expected outcomes

Start with this file, then dive deeper into the others!


Version 1.0 | October 31, 2025