This repository collects papers on Large Language Models (LLMs) for Chemistry. In addition to LLMs, there are also many excellent works based on Pretrained Language Models (PLMs), such as BERT, BART, and T5, which should be considered for inclusion to foster future research. To differentiate between PLMs and LLMs, we highlight the titles of PLM-based papers in italic font.
Besides, we also collect some useful links to prominent teams and popular projects.
😎 Welcome to recommend missing papers and any helpful links through Adding Issues or Pull Requests.
2022.05
Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned. ACL Workshop2022.11
Galactica: A large language model for science. arXiv2022.11
Is GPT-3 all you need for machine learning for chemistry? NIPS2022 Workshop2023.08
Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. Chemical Science2023.08
HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science. EMNLP20232023.10
MatChat: A Large Language Model and Application Service Platform for Materials Science. Chinese Physics B2024.01
ChemDFM: Dialogue Foundation Model for Chemistry. arXiv2024.01
Structured information extraction from scientific text with large language models. Nature Communication2024.02
Leveraging large language models for predictive chemistry. Nature Machine Intelligence2024.03
SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning. arXiv2024.03
Domain-Agnostic Molecular Generation with Chemical Feedback. ICLR20242024.04
ChemLLM: A Chemical Large Language Model. arXiv2024.04
BatGPT-Chem: A Foundation Large Model For Chemical Engineering. chemRxiv2024.04
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. ICLR20242024.04
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset. arXiv2024.05
nach0: Multimodal Natural and Chemical Languages Foundation Model. Chemical Science2024.06
Fine-tuning large language models for chemical text mining. Chemical Science2024.06
MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction. arXiv2024.06
SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis. arXiv2024.06
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes. arXiv2024.09
SciDFM: A Large Language Model with Mixture-of-Experts for Science. arXiv
2023.03
Uni-Mol: A Universal 3D Molecular Representation Learning Framework. ICLR2023.05
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs. arXiv2023.06
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter. EMNLP20232023.06
MolFM: A Multimodal Molecular Foundation Model. arXiv2023.08
BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine. arXiv2023.09
3D-MOLM: TOWARDS 3D MOLECULE-TEXT INTERPRETATION IN LANGUAGE MODELS. ICLR20242023.11
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery. arXiv2023.12
MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction. NIPS Workshop2024.01
MolTC: Towards Molecular Relational Modeling In Language Models ACL20242024.01
ReactXT: Understanding Molecular “Reaction-ship” viaReaction-Contextualized Molecule-Text Pretraining. ACL20242024.03
GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text. arXiv2024.06
HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment. arXiv2024.06
3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization. ICLR20252024.06
MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension. arXiv2024.07
MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations. Bioinformatics2024.08
UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation. arXiv2024.08
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area. arXiv2024.09
ChemDFM-X: Towards Large Multimodal Model for Chemistry. arXiv2025.02
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model. arXiv2025.02
Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization. arXiv
2023.09
Generative Retrieval-Augmented Ontologic Graph and Multiagent Strategies for Interpretive Large Language Model-Based Materials Design. ACS Engineering Au2023.10
Large language models for chemistry robotics. Autonomous Robots2023.10
Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design. EMNLP20232023.11
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis. arXiv2023.12
Autonomous chemical research with large language models. Nature2024.01
Structured Chemistry Reasoning with Large Language Models. ICML20242024.01
ChemReasoner: Heuristic Search over a Large Language Model's Knowledge Space using Quantum-Chemical Feedback. ICML20242024.02
An Autonomous Large Language Model Agent for Chemical Literature Data Mining. arXiv2024.03
From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery. AAAI20242024.03
DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs. arXiv2024.04
Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering. arXiv2024.04
Large Language Models are In-Context Molecule Learners. arXiv2024.04
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions. arXiv2024.04
Large Language Models Open New Way of AI-Assisted Molecule Design for Chemists. ChemRxiv2024.05
Augmenting large language models with chemistry tools. Nature Machine Intelligence2024.05
ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models. Nature Communications2024.06
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation. arXiv2025.01
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning. ICLR20252025.03
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses. ICLR2025
2017.09
Crowdsourcing multiple choice science questions. ACL Workshop2020.09
ChemistryQA: A Complex Question Answering Dataset from Chemistry. OpenReview2023.01
Assessment of chemistry knowledge in large language models that generate code. Digital Discovery2023.03
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT. Journal of Chemical Information and Modeling2023.06
Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. TKDE2023.07
Can Large Language Models Empower Molecular Property Prediction? arXiv2023.10
ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction. arXiv2023.10
GPT-MolBERTa: GPT Molecular Features Language Model for molecular property. arXiv2023.12
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks. NeurIPS20232023.12
SciMT-Safety: Control Risk for Potential Misuse of Artificial Intelligence in Science. arXiv2024.01
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research. AAAI20242024.01
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis. arXiv2024.02
Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science. arXiv2024.02
Building a Dataset for Language+Molecules. arXiv2024.03
Benchmarking Large Language Models for Molecule Prediction Tasks. arXiv2024.03
MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension. arXiv2024.03
Benchmarking Large Language Models for Molecule Prediction Tasks. arXiv2024.02
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. arXiv2024.04
Are large language models superhuman chemists? arXiv2024.06
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models. arXiv2024.07
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering. arXiv2024.09
VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning. arXiv2024.09
ChemEval: A Comprehensive Multi-Level Chemical Evalution for Large Language Models. arXiv2024.10
Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry. NIPS20242024.10
MassSpecGym: A benchmark for the discovery and identification of molecules. NIPS20242024.10
Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation. NIPS20242024.10
DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials. NIPS20242024.12
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation. arXiv
2023.04
A Systematic Survey of Chemical Pre-trained Models. IJCAI20232023.09
Large Language Models in Molecular Discovery. NIPS2023 Workshop2024.01
Scientific Large Language Models: A Survey on Biological & Chemical Domains. arXiv2024.01
From Words to Molecules: A Survey of Large Language Models in Chemistry. IJCAI20242024.03
Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule. arXiv2024.03
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey. arXiv2024.06
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery. arXiv2024.07
A Review of Large Language Models and Autonomous Agents in Chemistry. arXiv