Use machine learning to predict which crystal structures are most likely stable, then validate only those with expensive DFT calculations β reducing computational cost by 10-100Γ.
New Composition: "LiβFeOβ"
β
Test ALL 230 space groups
β
Run 230 Γ 5 = 1150 DFT calculations
β
Each takes 2-10 hours
β
Total: 2,300 - 11,500 CPU hours
β
Find 1 stable structure
Cost: ~$5,000-20,000 per composition Time: Weeks
New Composition: "LiβFeOβ"
β
ML predicts top 3 space groups (0.1 seconds)
β
Test only those 3 + nearby structures
β
Run 3 Γ 5 = 15 DFT calculations (rapid)
β
Keep best 3 candidates
β
Run 3 high-accuracy DFT calculations
β
Find stable structure confirmed
Cost: ~$100-500 per composition (10-50Γ cheaper!) Time: 1-2 days (50Γ faster!)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: COMPOSITION β
β "LiβFeOβ" β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: ML PRE-SCREENING β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model 1: Space Group Predictor β β
β β Input: Composition features (132 descriptors) β β
β β Output: Top-5 space groups with probabilities β β
β β Example: [(227, 0.45), (225, 0.23), (141, 0.12)...] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model 2: Formation Energy Predictor β β
β β Input: Composition + predicted SG β β
β β Output: E_f = -3.2 Β± 0.4 eV/atom β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model 3: Hull Distance Predictor β β
β β Output: E_hull = 0.015 Β± 0.05 eV/atom β β
β β Decision: Likely STABLE (< 0.05 threshold) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model 4: Volume Predictor β β
β β Output: V = 145 Β± 8 Ε² β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
PASS: E_hull < 0.05
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: STRUCTURE GENERATION (CONSTRAINED) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β USPEX / CALYPSO / Particle Swarm Optimization β β
β β Constraints from ML: β β
β β β’ Search only SG: 227, 225, 141 β β
β β β’ Volume range: 137-153 Ε² (from ML prediction) β β
β β β’ Generate: 300 structures (not 3000!) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β 15 unique candidate structures β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: RAPID DFT SCREENING (TIER 1) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DFT Settings: β β
β β β’ Functional: PBE (fast) β β
β β β’ k-points: 4Γ4Γ4 (~500 k-points) β β
β β β’ Convergence: Medium β β
β β β’ Time: 2-5 hours per structure β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β Compare with ML predictions: β
β β’ Structure 1: E_f = -3.1 eV β (close to ML: -3.2) β
β β’ Structure 2: E_f = -2.9 eV β β
β β’ Structure 3: E_f = -3.15 eV β BEST β
β β’ ... (12 more) β
β β β
β Filter: Keep top 3 candidates by energy β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: HIGH-ACCURACY DFT (TIER 2) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DFT Settings: β β
β β β’ Functional: SCAN / rΒ²SCAN (accurate) β β
β β β’ k-points: 8Γ8Γ8 (~2000 k-points) β β
β β β’ Convergence: Tight β β
β β β’ Phonons: Yes (check dynamic stability) β β
β β β’ Time: 10-24 hours per structure β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β Final Results: β
β Structure 3 (SG 227): β
β β’ E_f = -3.18 eV/atom β
β β’ E_hull = 0.000 eV (STABLE! On convex hull) β
β β’ No imaginary phonon modes β β
β β’ Volume = 146.2 Ε² β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 5: DATABASE ENTRY β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Material: LiβFeOβ β β
β β Space Group: 227 (Fd-3m) β β
β β Formation Energy: -3.18 Β± 0.08 eV/atom β β
β β Energy Above Hull: 0.000 eV (STABLE) β β
β β Volume: 146.2 Β± 0.5 Ε² β β
β β Confidence: 95% β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β Provenance: β β
β β β’ ML prediction: 2025-11-01 β β
β β β’ DFT validation: SCAN functional β β
β β β’ Sources: MP, OQMD, JARVIS (training data) β β
β β β’ Uncertainty: From ensemble of 5 ML models β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
β
COMPLETE!
| Approach | # DFT Calcs | CPU Hours | $ Cost | Time | Success Rate |
|---|---|---|---|---|---|
| Traditional | 1,150 | 2,300-11,500 | $5k-20k | 2-4 weeks | ~95% |
| Random sampling | 100 | 200-1,000 | $500-2k | 3-7 days | ~60% |
| Our ML-guided | 18 | 36-180 | $100-500 | 1-2 days | ~85% |
Improvement:
- 64Γ fewer DFT calculations
- 13Γ cheaper in compute cost
- 10Γ faster time to result
- Only 10% lower success rate
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: Composition String β
β "FeβOβ" β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURIZATION β
β Convert to 132 numerical features: β
β β’ Elemental properties (weighted): 80 features β
β β’ Stoichiometry: 15 features β
β β’ Crystal chemistry: 12 features β
β β’ Historical patterns: 25 features β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββββ
β Model A β Model B β Model C β Model D β
β Space β Formation β Hull β Volume β
β Group β Energy β Distance β Predictor β
ββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββββββββββ
β β β β
Top 5 SGs E_fΒ±Ο E_hullΒ±Ο VΒ±Ο
Model A: Space Group Prediction
- Architecture: CrabNet (Composition Transformer)
- Training data: 1.5M structures from MP+OQMD+AFLOW+JARVIS
- Performance: 85% top-5 accuracy
- Output: Probability distribution over 230 space groups
Model B: Formation Energy
- Architecture: Roost or CrabNet
- Training: 1M formation energies (corrected for functional)
- Performance: MAE = 0.12 eV/atom
- Output: E_f with uncertainty estimate
Model C: Hull Distance
- Architecture: Multi-task with Model B
- Training: Hull distances from all databases
- Performance: MAE = 0.04 eV/atom
- Output: E_hull, binary stability prediction
Model D: Volume Prediction
- Architecture: Random Forest or Neural Network
- Input: Composition + predicted space group
- Performance: MAPE = 4.5%
- Output: Cell volume
Category 1: Elemental Properties (Weighted by Stoichiometry)
- Atomic radius (mean, range, std)
- Electronegativity (mean, range, std)
- Ionization energy (mean, range, std)
- Atomic mass (mean, range, std)
- Valence electrons (mean, sum, std)
- Example for FeβOβ:
- mean_radius = 0.4Γ0.72 + 0.6Γ0.66 = 0.684 Γ
- range_electronegativity = 3.44 - 1.83 = 1.61
Category 2: Composition
- Number of elements (2 for FeβOβ)
- Stoichiometry ratios ([0.4, 0.6])
- Mixing entropy: -Ξ£(fΓln f) = 0.67
Category 3: Crystal Chemistry
- Radius ratio: r_cation/r_anion = 0.51
- Ionic character: 0.57 (predominantly ionic)
- Tolerance factor (for perovskites)
Category 4: Historical
- Space group frequency for oxides
- Prototype similarity (corundum-like for FeβOβ)
- Typical hull distances for this chemistry
Same material, different values across databases:
FeβOβ Properties:
βββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Property β MP β OQMD β JARVIS β
βββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β Volume (Ε²) β 101.2 β 101.5 β 99.8 β
β E_f (eV) β -2.51 β -2.48 β -2.53 β
β Band gap β 2.2 eV β 2.0 eV β 2.1 eV β
βββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
Unified Entry:
FeβOβ
ββ Volume: 100.8 Β± 0.7 Ε²
β ββ Sources: {MP: 101.2, OQMD: 101.5, JARVIS: 99.8}
ββ Formation Energy: -2.51 Β± 0.03 eV/atom
β ββ Corrected for functional differences
ββ Confidence: 94%
Technical Achievements:
- β ML models trained on 1.5M+ structures
- β 1,000+ new materials validated with DFT
- β 10-100Γ speedup demonstrated
- β Unified database with uncertainties
Publications:
- Main paper: Methodology (NPJ Computational Materials)
- Database paper: Description (Scientific Data)
- Case studies: Applications to specific chemistries
Software:
- Open-source Python package
- Web interface for queries
- REST API for programmatic access
- Integration with Materials Project
Impact:
- Enable rapid discovery for experimentalists
- Standard tool for materials screening
- Reduce computational waste
- Accelerate clean energy technologies
-
Hierarchical approach
- ML filters β rapid DFT β accurate DFT
- Not seen in existing databases
-
Multi-database reconciliation
- First systematic approach
- Uncertainty quantification built-in
-
Phase prediction capability
- Beyond T=0K, P=0
- Temperature and pressure dependence
-
Quality assurance
- Every entry validated
- Provenance tracking
- Confidence scores
-
Main Proposal (50 pages)
- Complete project description
- Scientific background
- Implementation details
- Timeline and milestones
-
Descriptor Reference (30 pages)
- All 132 features explained
- Implementation examples
- Best practices
-
Quick Start Guide (20 pages)
- Week 1 action items
- Complete code examples
- Troubleshooting
-
This Visual Summary (10 pages)
- Big picture overview
- Workflow diagrams
- Expected outcomes
Technical:
- Materials database APIs (MP, OQMD, JARVIS, AFLOW)
- Machine learning (PyTorch/TensorFlow)
- DFT calculations (VASP/Quantum Espresso)
- Database design (MongoDB/PostgreSQL)
- Web development (API creation)
Scientific:
- Thermodynamic stability theory
- Crystal structure prediction
- Electronic structure methods
- Statistical analysis
- Uncertainty quantification
Professional:
- Large-scale project management
- Scientific writing and publishing
- Conference presentations
- Collaborative research
- Open-source development
Day 1: Setup
- Install Python, conda, essential packages
- Get Materials Project API key
- Test data download (10 materials)
- Join team communication channel
Day 2: Exploration
- Download 1000 test materials
- Explore data structure
- Plot space group distribution
- Understand key properties
Day 3: Features
- Install matminer
- Generate 132 features for dataset
- Analyze feature correlations
- Save featurized dataset
Day 4: First Model
- Train Random Forest baseline
- Achieve >30% top-1 accuracy
- Plot feature importance
- Save model
Day 5: Analysis & Meeting
- Create visualizations
- Prepare presentation
- Attend team meeting
- Plan Week 2
Project Resources:
- π Full Proposal: ML_Materials_Database_Proposal.md
- π§ͺ Descriptor Guide: Descriptor_Reference_Guide.md
- π Quick Start: Quick_Start_Week1_Guide.md
- π This Summary: Project_Overview_Visual.md
Code Repository: (To be created)
- GitHub: [your-repo-here]
Communication:
- Team Slack/Discord
- Weekly meetings: Fridays 3pm
- Office hours: By appointment
Every major technology breakthrough requires new materials:
- Better batteries β electric vehicles, grid storage
- Efficient solar cells β renewable energy
- Quantum computers β computational revolution
- Green catalysts β sustainable chemistry
Traditional discovery: 10-20 years from lab to market
Our approach can help reduce this to 2-5 years by:
- Predicting stable materials before synthesis
- Reducing computational waste
- Providing high-confidence targets for experimentalists
You're creating:
- A tool that will accelerate discovery
- A methodology that will be adopted widely
- Publications that will be highly cited
- Skills that will define your career
- Impact on real-world technology
Next Steps:
- π Read the full proposal
- π§ͺ Review descriptor guide
- π Start Week 1 tasks
- π₯ Connect with team
- πͺ Let's build something amazing!
"The best way to predict the future is to invent it." - Alan Kay
Now let's invent the future of materials discovery! ππ¬βοΈ
Project Documentation
β
βββ π ML_Materials_Database_Proposal.md (50 pages)
β ββ Complete scientific proposal
β ββ Background & motivation
β ββ Detailed methodology
β ββ Implementation plan
β ββ Timeline & deliverables
β ββ Budget & resources
β
βββ π§ͺ Descriptor_Reference_Guide.md (30 pages)
β ββ Feature engineering manual
β ββ All 132 descriptors explained
β ββ Code examples
β ββ Best practices
β ββ Implementation templates
β
βββ π Quick_Start_Week1_Guide.md (20 pages)
β ββ Hands-on week 1 tutorial
β ββ Day-by-day tasks
β ββ Complete code examples
β ββ Troubleshooting
β ββ Expected results
β
βββ π Project_Overview_Visual.md (THIS FILE)
ββ High-level summary
ββ Workflow diagrams
ββ Key concepts
ββ Expected outcomes
Start with this file, then dive deeper into the others!
Version 1.0 | October 31, 2025