Machine Learning for Scientific Discovery
Tianhao Li · Ph.D. Candidate, Johns Hopkins
I build ML/AI systems for expensive scientific search and decision problems— active learning, surrogate modeling, LLM-assisted workflows, and closed-loop pipelines connecting models with high-fidelity simulation backends. Previously M.S. at Duke University.
I am a Ph.D. candidate in Materials Science and Engineering at Johns Hopkins University, working with Prof. Corey Oses, and an Amazon AI PhD Fellow (2025–2027).
My work focuses on building ML/AI systems for expensive scientific search and decision problems—active learning for data-efficient model adaptation, surrogate modeling over large combinatorial spaces, LLM-assisted discovery workflows, and closed-loop pipelines connecting machine learning with high-fidelity simulation backends.
Previously, I completed my M.S. at Duke University and my B.S. at Changsha University of Science and Technology.
Summer 2026 applied-science / research internships in ML for scientific discovery, simulation, or agentic workflows. Also open to research collaborations that bridge ML methods with domain-science backends.
Education & Honors
Current Projects
Active Learning for Scientific Simulations
Data-efficient active-learning workflow for adapting pretrained ML interatomic potentials to ultra-complex disordered materials. Reduced required simulation labels from 20,000 to 930 (95% reduction) while improving energy MAE from 0.061 to 0.026 eV/atom. Trained on >1.5M atom-step trajectories.
ML/AI Infrastructure for Scientific Discovery
Built and maintain a structured scientific data asset of 194,760 simulation records for model training, benchmarking, and retrieval. Contributed to LLM-assisted discovery interfaces supporting tool-calling, structured retrieval, and natural-language exploration over scientific datasets.
ML Screening for Fuel-Cell Catalysts
End-to-end ML workflow for feature engineering, surrogate modeling, and multi-objective ranking over 20,000+ compositions. Physics-informed random-forest models used to down-select platinum-free catalyst candidates under activity, stability, and cost constraints.
Closed-Loop AI Discovery Pipelines
Developing closed-loop learning workflows combining surrogate models, automated large-scale screening, and escalation to high-fidelity evaluation under uncertainty. Building API-connected infrastructure for experiment–model feedback and campaign-scale prioritization.
Visual Tour
Key figures from recent publications — each captures a core idea, method, or finding.
SOAP-guided active learning for disordered materials
A closed-loop fine-tuning workflow adapts pretrained ML interatomic potentials to ultra-complex disordered oxides. SOAP-similarity selection reaches target MAE with ~109 samples, vs. thousands for random sampling.
AI-driven search for Pt-free fuel-cell catalysts
End-to-end ML pipeline screens 26,334 quinary HEA compositions under activity, stability, and sustainability constraints — matching Pt-like d-band behavior while using earth-abundant elements.
LLM-assisted exploration of ~200K simulation records
The CHAOS database curates 194,760 first-principles records across high-entropy oxides. A natural-language interface (CHAOS-GPT) turns user prompts into SQL-style retrieval and on-the-fly analysis.
High-entropy materials powering green energy
Invited review synthesizing the landscape of high-entropy materials for hydrogen generation & storage, batteries, electronics, catalysis, thermoelectrics, and biofuel applications.
Revisiting thermoelectrics with a high-entropy design
Invited review examining how configurational entropy — from doping through alloying to full high-entropy compositions — reshapes phonon scattering, band convergence, and thermoelectric performance.
Projects & Tools
S4E-MatForge
Led end-to-end curation of an open materials dataset with ~200K records— schema design, metadata standardization, QA/QC, packaging, and ML-ready release. Hosted on Hugging Face.
CHAOS-GPT
LLM-assisted screening interface for scientific candidate triage and dataset exploration. Supports tool-calling, structured retrieval, and natural-language queries over large scientific data repositories.
HE Fuel-Cell Discovery Pipeline
Open ML pipeline for feature engineering, surrogate modeling, and candidate screening in large composition spaces. Reusable code, standardized datasets, and analysis artifacts for multi-objective material discovery.
Selected Works
-
Active-Learning Deep-Learning Models for Scientific Simulations
-
The Search for High-Entropy Fuel-Cell Catalysts Using Disorder Descriptors
-
High Entropy Powering Green Energy: Hydrogen, Batteries, Electronics, and Catalysis
-
Beyond the Four Core Effects: Revisiting Thermoelectrics with a High-Entropy Design
-
Effects of H₂–H₃ Phase Transition Reversibility on Ni-Rich Layered Oxide Cathodes
Full list on Google Scholar · ORCID: 0009-0008-5486-5940
Selected Presentations
Professional Activities
Reviewer for Nano Materials Science, Nanotechnology, and Journal of Alloys and Compounds — topics spanning ML for materials, high-entropy alloys, and energy-materials characterization.
Co-editing a Special Issue on data-driven design of high-entropy materials — coordinating submissions, peer review, and editorial decisions.
Mentored junior graduate students and research interns on ML-for-science projects at Johns Hopkins — from onboarding to paper-ready deliverables.
Teaching Experience
Weekly office hours for 40 engineering students; organized four quizzes and graded programming projects with detailed feedback.
Led five workshops on numerical integration, Monte Carlo, DFT+VASP, high-throughput AFLOW, and AI-driven discovery. Authored starter code and unit tests; graded ~40 submissions.
Skills
Get in Touch
I'm always happy to discuss research collaborations, questions about my work, or computational materials science in general.