Materials & Chemistry Datasets

# Awesome Materials & Chemistry Datasets

About¶

A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets—prioritizing open-access resources and community contributions.

This project aims to: - Catalog the best datasets by domain, type, quality, and size - Support reproducible research in AI for chemistry and materials - Provide a community-driven resource with contributions from researchers and developers

Contributing¶

Want to add a new dataset or improve metadata?

Fork the repository
Edit the appropriate dataset list or add a new entry
Submit a pull request with a brief description and download link OR
Submit as an issue

Datasets¶

Computational Datasets¶

Dataset	Domain	Size	Type	Format
BOOM: Benchmarks for Out-Of-distribution Molecules	Small molecules	10 Out-Of-Distribution Tasks (1M+ entries)	Computational	CSV
MSR-ACC/TAE25	Small molecules	77k CCSD(T)/CBS atomization energies	Computational	JSON
OMat24 (Meta)	Inorganic crystals	110M DFT entries	Computational	JSON/HDF5
OMol25 (Meta)	Molecular chemistry	100M+ DFT calculations	Computational	LMDB
OMC25	Molecular crystals	>27M structures	Computational	Zarr
Materials Project (LBL)	Inorganic crystals	500k+ compounds	Computational	JSON/API
Open Catalyst 2020 (OC20)	Catalysis (surfaces)	1.2M relaxations	Computational	JSON/HDF5
AFLOW	Inorganic materials	3.5M materials	Computational	REST API
OQMD	Inorganic solids	1M+ compounds	Computational	SQL/CSV
JARVIS-DFT (NIST)	3D/2D materials	40k+ entries	Computational	JSON/API
Carolina Materials DB	Hypothetical crystals	214k structures	Computational	JSON
NOMAD	Various DFT/MD	>19M calculations	Computational	JSON
MatPES	DFT Potential Energy Surfaces	~400,000 structures from 300K MD simulations	Computational	JSON
Vector-QM24	Small organic and inorganic molecules	836k conformational isomers	Computational	JSON
AIMNet2 Dataset	Non-metallic compounds	20M hybrid DFT calculations	Computational	JSON
RDB7	Barrier height and enthalpy for small organic reactions	12k CCSD(T)-F12 calculations	Computational	CSV
RDB19-Rad	ΔG of activation and of reaction for organic reactions in 40 common solvents	5.6k DFT + COSMO-RS calculations	Computational	CSV
QCML	Small molecules consisting of up to 8 heavy atoms	14.7B Semi-empirical + 33.5M DFT calculations	Computational	TFDS
QM9	Small organic molecules	134k molecules with quantum properties	Experimental	SDF/CSV
QM7/QM7b	Small molecules	7k molecules with atomization energies	Experimental	SDF/CSV
QMugs	Drug-like molecules	665 k mol / 2 M conf	Computational	HDF5
C2DB	2-D materials	~4 000 entries	Computational	JSON/API
ANI-1x / 1ccx	Small organic mol	5 M (DFT) + 0.5 M (CCSD)	Computational	HDF5
CoRE MOF 2019	Metal-organic frameworks	14 763 structures	Computational	CIF/JSON
QMOF Database	Metal-organic frameworks	20k+ structures (DFT)	Computational	CIF/JSON
Catalysis-Hub	Surface reactions	>100 k energies	Computational	JSON/API
ODAC23	MOF + CO₂/H₂O adsorption	38 M DFT calcs	Computational	HDF5
MOFX-DB	Gas adsorption in MOFs	3 M isotherm pts	Computational	CSV/HDF5
LeMat-Bulk	Inorganic materials (bulk)	6.7M structures (5.9M materials)	Computational	HuggingFace Dataset
LeMat-Traj	Inorganic materials (trajectories)	113M structures	Computational	HuggingFace Dataset
NeurIPS Open Polymer Prediction 2025	Polymers	~1,500 test polymers with MD-derived properties	Computational	CSV
Carbon Data	Carbon materials	22.9M atoms, 546 trajectories	Computational	EXTXYZ
MSR-ACC/TAE25	Small molecules (up to Ar)	76,879 total atomization energies	Computational	HDF5/CSV
DFT Solvation Energy Dataset	Small molecules	651,290 solvation energies in 5 solvents	Computational	CSV/JSON
MD Simulated Monomer Properties	Small molecules	410 molecules with thermodynamic properties	Computational	CSV/JSON
Multimodal Spectroscopic Dataset	Molecular spectroscopy	790k molecules with simulated spectra	Computational	HDF5/JSON
PubChemQCR	Small molecules (relaxation)	3.5M trajectories / 300M conformations	Computational	HuggingFace Dataset
MP-ALOE	Universal MLIPs (89 elements)	~1M r2SCAN DFT calculations	Computational	JSONL/MACE
Alexandria DB	Inorganic (1D–3D)	>5 M DFT calcs (PBE)	Computational	JSON/OPTIMADE/LMDB
Quantum‑Chemical Bonding DB (LOBSTER)	Solid‑state bonding analysis	1,520 compounds	Computational	JSON
MultixcQM9 (OpenQDC)	Small molecules (QM9, multi‑XC)	133k molecules	Computational	Torch/NumPy
SPICE (OpenQDC)	Drug‑like molecules	1 M conformers (energies & forces)	Computational	Torch/ASE
Matbench v0.1	Benchmarks (13 tasks)	10 datasets	Benchmark/Comp	CSV/HDF5
Matbench Discovery	Stability, κ, structures	Multiple files	Benchmark/Comp	CSV/ZIP
Materials Cloud Archives	Various DFT/MD workflows	1,000+ datasets	Computational	HDF5/JSON/CIF
MS25	MLIP benchmark (6 material systems)	Multi-system benchmark suite	Computational/Benchmark	HDF5
RadonPy Polymer Properties Data	Polymer	~1070 MD-calculated Properties	Computational	CSV
SHNITSEL Data	Organic Molecules	418,870 Post-HF-calculated Ground- and Excited-states Properties	Computational	XARRAY
Frustrated Lewis Pairs Database	Small Molecules	146 Metal-free FLPs	Computational	HTML
AQCat25	Catalysis	13.5M frames / 5K materials	Computational	Parquet/ASE DB
OMol25 Electronic Structures	Molecular chemistry	4M+ calculations	Computational	Raw DFT outputs
Unrestricted CCSD(T) Dataset For Organic Molecule Reactions	Organic reactions	3119 configurations	Computational
MC-PDFT-OPESf	Reaction kinetics	Diels-Alder reaction	Computational
Quantum Cluster Database	Nanoclusters	63,015 clusters	Computational	CSV/JSON
The Cambridge Cluster Database	Mixed Clusters	Multiple Files	Computational	Multiple Types
Battery Electrolyte Solvation/Ionization	Organic molecules	Thousands of molecules	Computational

Experimental Datasets¶

Dataset	Domain	Size	Type	Format
Crystallography Open Database (COD)	Crystal structures	~525k entries	Experimental	CIF/SMILES
NIST ICSD (subset)	Inorganic structures	~290k structures	Experimental	CIF
CSD (Cambridge)	Organic crystals	~1.3M structures	Experimental	CIF
opXRD	Crystal structures	92552 (2179 labeled)	Experimental	JSON
MDR SuperCon	Superconductivity	legacy superconductor database w/ material composition, structure, properties, and processes	Mixed
ChEMBL	Bioactive molecules	2.3M+ compounds with bioactivity data	Experimental	JSON/SDF
MoleculeNet	Molecular properties	700k+ compounds across 17 datasets	Mixed	CSV/SDF
ESOL	Aqueous solubility	1,128 compounds with solubility data	Experimental	CSV
FreeSolv	Hydration free energy	643 molecules with experimental data	Experimental	CSV
Lipophilicity	Octanol/water distribution	4,200 compounds with logD values	Experimental	CSV
PCBA	Bioassay screening	400k+ compounds, 128 bioassays	Experimental	CSV
HIV	Antiviral screening	41k compounds with HIV inhibition data	Experimental	CSV
BACE	Beta-secretase inhibitors	1,522 compounds with IC50 data	Experimental	CSV
BBBP	Blood-brain barrier permeability	2,053 compounds with permeability data	Experimental	CSV
Tox21	Toxicity screening	8k compounds, 12 toxicity targets	Experimental	CSV
ToxCast	High-throughput toxicity	8k compounds, 600+ assays	Experimental	CSV
SIDER	Drug side effects	1,427 drugs with adverse reactions	Experimental	CSV
ClinTox	Clinical trial toxicity	1,491 compounds with FDA approval status	Experimental	CSV
PDBbind	Protein-ligand binding	19k complexes with binding affinities	Experimental	PDB/SDF
BindingDB	Protein-ligand binding	2.8M+ binding data points	Experimental	CSV/SDF
ProtBENCH	Drug-target interactions	Protein family-specific datasets	Experimental	CSV
PDBench	Protein sequence design	595 protein structures, 40 architectures	Experimental	PDB
PDB-Struct	Structure-based protein design	Comprehensive protein design benchmark	Experimental	PDB
HTEM-DB	Thin-film composition libraries	140 k+ samples	Experimental	JSON/API
OCx24	Electrocatalyst inks	572 samples (+DFT)	Experimental	CSV
Polymer Genome	Polymers	20 k polymers	Experimental + Comp	CSV/JSON
CoRE MOF 2024	Metal-organic frameworks	40k+ experimental MOFs	Experimental	CIF
SAIR	Protein-ligand binding	1M+ complexes, 5.2M structures, 2.5TB	Experimental	3D/CSV
Anion Solvation DB	Anion solvation	~26k properties	Mixed	CSV
BigSolDB	Organic molecule solubility	~54k exp. values	Experimental	CSV
StarryData2	Experimental properties	Figshare dump (2023/2024)	Experimental	CSV/JSON
CRIPT Polymer Data	Polymers (synthesis, properties)	Growing community DB	Mixed	JSON/API
Catechol Benchmark	Solvent selection / Reaction yield	1200+ process conditions	Experimental	CSV
Leeds Solubility Data	Solubility	2.3k measurements	Experimental	CSV
BigSolDB 2.0	Solubility	103k+ values	Experimental	CSV/XLSX
OpenExp	Chemical reactions	274k pairs	Experimental	Varies
Battery Imaging Library (BIL)	Battery imaging	80+ scans, >500B voxels	Experimental	Various

LLM Training Datasets¶

Dataset	Domain	Size	Type	Format
ChemPile	Chemistry	75B+ tokens	LLM Training	Mixed
SmolInstruct	Small molecules	3.3M samples	LLM Training	JSON
CAMEL	Chemistry	20K problem-solution pairs	LLM Training	JSON
ChemNLP	Chemistry	Extensive, many combined datasets	LLM Training	JSON
ChemQA	Chemistry	Multimodal QA dataset	LLM Training	JSON
ChemLLMBench	Chemistry	8 chemistry tasks benchmark	LLM Training	JSON
ChemistryQA	Chemistry	4,500 questions across 200 topics	LLM Training	JSON
MaScQA	Materials Science	640 QA pairs	LLM Training	XLSX
SciCode	Research Coding in Physics, Math, Material Science, Biology, and Chemistry	338 subproblems	LLM Training	JSON
ChemData 700K	Chemistry (9 core tasks)	730K Q-A instruction pairs	LLM Training	JSON
MatSci-Instruct (HoneyBee)	Materials science	≈55K verified instructions	LLM Training	JSON
MoleculeQA	Molecular properties & safety	62K multiple-choice QA pairs	LLM Training	JSON
BioInstruct 25K	Biomedical / biochemistry	25K GPT-4 generated instructions	LLM Training	JSON
Lab-Bench	Biology	2,400+ questions for biology agents	LLM Training	JSON
ChemBench 4K	Chemistry competency benchmark	4,100 single-choice questions	LLM Training	JSON
GPQA Diamond	Biology, Physics, Chemistry	448 multiple-choice questions	LLM Training	JSON
MaCBench	Chemistry and materials science	Vision-language tasks	LLM Training	JSON
ChemBench	Chemistry	2,700+ question-answer pairs	LLM Training	JSON
MatText	Materials property prediction	2M structures	LLM Training	HuggingFace Dataset
SciAssess	Scientific literature analysis	Benchmark for LLMs in science	LLM Training	JSON
ZINC20-ML	Drug-like molecules (SMILES)	≈1B molecules	LLM Training	SMILES
PMC Open Access Subset	Biomedical full-text	3.4M+ articles	LLM Training	XML
MatScholar Task-Schema QA (MatSci-NLP)	Materials science (7 NLP tasks)	Tens of thousands of examples	LLM Training	JSON
Mol-Instructions	Chemistry	molecular, protein, and biochemical instructions	LLM Training	HuggingFace Dataset
USPTO-LLM	Chemical reactions	247K reactions	LLM Training	JSON/Graph
ChemRxivQuest	Chem literature QA	970 QA pairs	LLM Training	JSON
USPTO-Lowe	Patent reactions	1.8 M reactions	Literature-mined	RXN/SMILES
MolTextNet	Small molecules with text	2.5M molecule-text pairs	LLM Training	HuggingFace Dataset
MolOpt-Instructions	Molecule optimization	1.18M instruction-based optimization tasks	LLM Training	HuggingFace Dataset
TextEdge	Crystal properties	Crystal text descriptions with properties	LLM Training	JSON
LAMBench-TrainingSet-v1	Materials structures	19.8M structures for Large Atom Models	LLM Training	Various
LLM4Mat	Materials property prediction	1.9M crystal structures, 45 properties, 3 modalities	LLM Training	Various
LLM-EO	Transition metal complexes / Optimization	1.37M TMC space explored	LLM Training	GitHub
Flavor Analysis and Recognition Transformer	Molecular taste prediction	Multi-class taste classification dataset	LLM Training	SMILES/JSON
SCQA (Solar Cell QA)	Solar cells	47K QA pairs	LLM Training	JSON
ScienceQA	K–12 science, multimodal MCQs w/ lectures & explanations	21,208 Qs	LLM Training/Eval	JSON
SciBench	College-level scientific problem solving (math/chem/phys)	Open & closed sets	LLM Eval	PDF/JSON
MegaScience	Scientific reasoning (7 disciplines)	1.25M instances (650k reasoning questions from 12k textbooks)	LLM Training	HuggingFace Dataset
Mat-Instructions	Inorganic materials	~30k instructions	LLM Training	JSON
Open Materials Guide (OMG)	Materials synthesis	17K synthesis recipes	LLM Training	JSON
ChemDFM	Chemistry	34B tokens / 2.7M instructions	LLM Training	HuggingFace
ChemTable	Chemistry Tables	Large-scale benchmark	LLM Training/Benchmark	JSON
ChemCoTBench	Molecular reasoning	Annotated datasets	LLM Training/Benchmark	HuggingFace Dataset

Literature-mined & Text Datasets¶

Dataset	Domain	Size	Type	Format
PubChem	Molecules & data	119M compounds	Literature	SMILES/SDF
Open Reaction Database (ORD)	Synthetic reactions	~1M reactions	Experimental/Lit	JSON
PatCID (IBM)	Chemical image data	81M images / 13M mols	Literature	PNG/SMILES
MatScholar	NLP corpus (materials)	5M+ abstracts	Literature	JSON/Graph
Matbench (metadata/text tasks)	Text/meta ML tasks	13 tasks	Literature/Benchmark	CSV
OpenQDC Hub	QM molecules & reactions	1.5 B geometries	Literature/Computational	Python API/NPZ
L2M3 - Large Language Model MOF Miner	Metal-organic frameworks	from >40k articles	Literature-mined	CSV

🌊 Computational Fluid Dynamics, PDE & Engineering Datasets¶

Dataset	Domain	Size	Type	Format
PDEBench	PDE solving / Scientific ML	Multiple datasets	Benchmark / Simulation	HDF5/PyTorch
BLASTNet	Fluid mechanics / Reacting flows	17 TB	Simulation / CFD	HDF5/NPY
Johns Hopkins Turbulence DB (JHTDB)	DNS/LES turbulence (9 canonical flows)	≈ 350 TB	Simulation	Web API / HDF5 cutouts
Airfoil CFD 2k	1,830 airfoils × 25 AoA × 3 Re	~6 GB (250 k cases)	Simulation	HDF5
PDEArena (collection)	2-D Navier–Stokes, Shallow-Water, 3-D Maxwell	≈ 100 GB (4 datasets)	Simulation	Torch / HDF5
WeatherBench 2	Global weather reanalysis (ERA5, 1979-2023)	≈ 5 TB	Reanalysis	NetCDF/Zarr
UT Austin Channel-DNS Suite	Incompressible channel flow Re_τ 180 – 5200	≈ 10 TB	Simulation	Binary / ASCII
Compressible TPC DNS DB	Compressible channel flow (25 M, Re_τ*)	~2 GB	Simulation	TXT tables
Curated RANS ↔ DNS Dataset	29 geometries, 4 RANS models w/ DNS/LES labels	1.1 GB	Simulation	HDF5/CSV
NASA Common Research Model (CRM)	Aircraft CRM geom. + wind-tunnel & CFD results	Multi-GB	Mixed (Exp + Sim)	CAD / CSV / Tecplot
Darcy-Flow (FNO)	2-D porous-media pressure fields (∇·k∇u = f)	≈ 1 GB (10 k samples)	Simulation	HDF5
HiFi-TURB LES/DNS	High-fidelity LES/DNS for complex 3D flows	Multi-case suite	Simulation (DNS/LES)	HDF5/NetCDF
NASA High Lift Prediction Workshop (HLPW)	High-lift aircraft configurations	Multi-GB	Mixed (exp + CFD)	CAD/CSV/Tecplot
High-Speed TBL DNS DB	Compressible turbulent boundary layers	DNS database	Simulation	HDF5
ML Turbulence (Kaggle)	RANS Reynolds stress tensor data	~GB scale	Benchmark/Simulation	CSV/HDF5

Proprietary Datasets (for reference)¶

Dataset	Domain	Size	Use Case Notes
CAS Registry	Chemical substances	250M+ substances	Industry standard for molecule indexing
Reaxys (Elsevier)	Reactions & properties	Millions of reactions	Rich curated literature reaction data
Citrine Informatics DB	Experimental materials	Private	Materials ML platform w/ industry data
CSD (Cambridge)	Organic crystals	1.3M+	Gold-standard X-ray structures
PoLyInfo	Polymers & properties	500k+ data points / Experimental	Polymer properties from literature sources

Dataset Resources¶

The Materials Data Facility - Over 100 TB of open materials data. #TODO list some of these in the tables above
Foundry-ML search Foundry - 61 structured datasets ready for download through a Python client #TODO list some of these in the tables above

TODO¶

Add all OpenQDC datasets https://www.openqdc.io/datasets
A dataset on solubilities of gases in polymers (15 000 experimental measurements of 79 gases' uptakes (0.01–50 wt%) in 102 different polymers, pressures from 1 × 10−3 to 7 × 102 bar and temperatures from 233 to 508 K, includes nearly 500 solvent–polymer systems). Optimized structures of various repeating units are included. Should it be of interest for you, it is available here: Data
Add Materials Cloud Datasets
Classify Atomly. A bit challenging with non-English
Look into adding NOMAD for experimental data as well
Add A Quantum-Chemical Bonding Database for Solid-State Materials Part 1: https://zenodo.org/records/8091844 Part 2: https://zenodo.org/records/8092187
Add QM datasets. http://quantum-machine.org/datasets/
Find link for | ChemRxivQuest | Chemistry literature QA | 970 curated QA pairs | LLM Training | JSON | CC BY 4.0 | Open | ChemRxivQuest |
Find new link for USPTO-Reactions | USPTO Reactions | Organic reactions | 1.8M reactions | Literature | RXN/SMILES | Open | Open |
Find dataset link for | SciCUEval | Multidomain scientific comprehension (bio/chem/phys/matsci) | 10 sub-datasets | LLM Eval | JSON/PDF | Open | Open |
Find dataset for | MatSciKB | Materials science KB | 38.5k entries (20k papers, 3.6k Wikipedia, 1.9k textbooks, 10.5k datasets) | Literature | Structured text | Open | Open |

License¶

This project is licensed under the MIT License. Each dataset listed has its own license, noted in the table. Always check the source's license before using the data in your project.

Acknowledgements¶

The primary effort of Ben Blaiszik on this project was performed under financial assistance award 70NANB24H049 / MML24-1001 from the National Institute of Standards and Technology (NIST).

Thanks to the open data and research communities including: - Meta AI FAIR - The Materials Data Facility / Foundry-ML - NIST JARVIS and Materials Project - LBL, MIT, CCDC, FIZ Karlsruhe - Contributors to Open Catalyst, PubChem, ORD, and AFLOW - Developers of open chemistry toolkits (RDKit, Open Babel)

Citation¶

If this repository was helpful in your work, feel free to cite or star the repo. You can also reference the underlying dataset publications linked above.

Changelog¶

This Changelog is autogenerated, there may be errors.

October 2025¶

Added 18 new datasets focusing on catalysis, reaction kinetics, cluster chemistry, experimental solubility, literature mining, and foundation models to enhance resources for computational chemistry and machine learning applications.

🧮 Computational Datasets (7 datasets)¶

AQCat25: The AQCat25 dataset provides a large and diverse collection of 13.5 million DFT calculation trajectories, encompassing approximately 5K materials and 47K intermediate-catalyst systems. It is designed to complement existing large-scale datasets by providing calculations at higher fidelity and including critical spin-polarized systems, which are essential for accurately modeling many industrially relevant catalysts.
OMol25 Electronic Structures Dataset: The OMol25 Electronic Structures dataset includes the raw DFT outputs, electronic densities, wavefunctions, and molecular orbital information for over 4M million high-accuracy quantum chemical calculations. We see this as a transformative opportunity to develop higher quality partial charges, partial spins, and advanced electronic features to unlock the next generation of physics-informed ML models.
Unrestricted CCSD(T) Dataset For Organic Molecule Reactions: Dataset of 3119 organic molecule configurations at gold-standard quantum accuracy with automated workflows for unrestricted CCSD(T) calculations. Includes a transferable MLIP trained on UCCSD(T) data, showing significant improvements in force and activation energy accuracy.
MC-PDFT-OPESf: This work combines multi-configuration pair-density functional theory (MC-PDFT) as an accurate and eﬃcient multireference electronic structure method with on-the-fly probability enhanced sampling flooding (OPESf) as an enhanced sampling method capable of accelerating reactive transitions. MC-PDFT–OPESf provides reaction rates in agreement with experiments at a fraction of the computational cost required by conventional unbiased ab-initio calculations.
Quantum Cluster Database: A database of 63015 low-energy atomically precise nanoclusters for 55 elements across the periodic table, including main group and transition metal elements.
Cambridge Cluster Database: A collection of results from global optimizations for a variety of cluster systems, including Lennard-Jones, metal, molecular, and ionic clusters. The database is continuously updated with new results from published papers.
Battery Electrolyte Solvation/Ionization: This dataset presents molecular properties critical for battery electrolyte design, specifically solvation energies, ionization potentials, and electron affinities for thousands of organic molecules from QM9, EGP, GDB17, and ZINC.

🧪 Experimental Datasets (5 datasets)¶

Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. This dataset focuses on solvent selection, a task that is particularly difficult to model theoretically.
BNNLab/Solubility_data: Leeds Solubility Data: Curated solubility data in organic solvents and water and descriptors for solubility prediction.
BigSolDB 2.0: A comprehensive dataset of 103,944 experimentally measured solubility values of 1,448 organic compounds in 213 solvents reported in 1,595 literature peer-reviewed articles.
OpenExp: Features 274,439 pairs of chemical reactions and their corresponding step-by-step instructions of experimental procedures. This dataset, compiled from the USPTO-Applications and ORD databases.
Battery Imaging Library (BIL): An open, curated collection of multi-modal and multi-length scale battery imaging datasets, featuring over 80 scans and 500 billion voxels of data from single particles to full cells.

📚 LLM Training Datasets (5 datasets)¶

Mat-Instructions: A large-scale inorganic material instruction dataset with ~30k instruction-response pairs, designed to unlock the potential of LLMs in materials science.
Open Materials Guide (OMG): A dataset of 17K high-quality, expert-verified synthesis recipes from open-access literature, which forms the basis for the AlchemyBench benchmark for LLM-guided synthesis prediction.
ChemDFM: A pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions. As a result, it can understand and reason with chemical knowledge in free-form dialogue.
ChemTable: A large-scale benchmark of real-world chemical tables curated from the experimental sections of literature. ChemTable supports table recognition and table understanding tasks to advance scientific reasoning in chemistry.
ChemCoTBench: A reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations to formalize chemical problem-solving into transparent, step-by-step workflows for tasks like molecular optimization and reaction prediction.

📖 Literature-mined & Text Datasets (1 dataset)¶

L2M3 (Large Language Model MOF Miner): A database of MOF synthesis conditions and properties extracted from over 40,000 research articles using LLMs, enabling analysis of synthesis-structure-property relationships.

August 2025¶

Enhanced scientific reasoning capabilities and machine learning interatomic potential benchmarking with 6 new high-quality datasets for AI scientists and materials researchers.

🧮 Computational Datasets (3 datasets)¶

OMC25: A collection of over 27 million molecular crystal structures containing 12 elements and up to 300 atoms in the unit cell. The dataset was generated from dispersion-inclusive density functional theory (DFT) relaxation trajectories of over 230,000 randomly generated molecular crystal structures of around 50,000 organic molecules.
MS25: Comprehensive benchmark dataset for evaluating machine learning interatomic potentials (MLIPs) across 6 diverse materials systems including MgO surfaces, liquid water, zeolites, catalytic Pt surface reactions, high-entropy alloys, and disordered Zr-oxides. Evaluates 5 MLIP architectures (MACE, NequIP, Allegro, MTP, Torch-ANI) with focus on derived physical observables beyond traditional energy/force metrics. Demonstrates that equivariant MLIPs offer 1.5-2× improvements over nonequivariant models in complex systems, while highlighting the importance of explicit validation of physical properties rather than relying solely on error metrics.
Added the * Frustrated Lewis Pairs Database ** of 146 Metal-free FLPs

📚 LLM Training Datasets (4 datasets)¶

MaCBench: A comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. Reveals fundamental limitations in spatial reasoning and cross-modal information synthesis in leading models.
ChemBench: A cutting-edge framework to evaluate the chemical knowledge and reasoning capabilities of large language models (LLMs). It includes over 2,700 curated question-answer pairs across diverse chemistry topics and uniquely encodes chemical semantics, enabling models to process and reason about molecules and equations.
MatText: A comprehensive benchmarking framework spanning multiple representations and model scales, which finds that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size, dataset scale, or text representation strategy.
MegaScience: Large-scale scientific reasoning dataset featuring 1.25 million high-quality instances across 7 scientific disciplines. Includes TextbookReasoning component with 650k reasoning questions extracted from 12,000 university-level textbooks, providing truthful reference answers for training AI scientists. Developed through systematic ablation studies and comprehensive evaluation across 15 benchmarks, demonstrating superior performance and training efficiency compared to existing open-source scientific datasets.

July 2025¶

Expanded the collection into new scientific domains with 31 new datasets, introducing benchmarks for physics-based machine learning, adding comprehensive quantum mechanics datasets, expanding materials science resources, and enhancing scientific evaluation benchmarks.

🌊 Computational Fluid Dynamics, PDE & Engineering Datasets (15 datasets)¶

PDEBench: A comprehensive benchmark suite for scientific machine learning featuring a wide range of Partial Differential Equations. It provides large, ready-to-use datasets for challenging physics problems, supporting both forward and inverse modeling.
BLASTNet: A 17 TB collection of high-fidelity fluid mechanics simulation datasets for ML applications in automotive, propulsion, and energy sectors. It includes code and pre-trained models for tasks like turbulence modeling and spatio-temporal prediction.
JHTDB: multi-terabyte DNS/LES portal with isotropic, channel, MHD, boundary-layer and atmospheric datasets.
Airfoil CFD 2k: DOE/NREL benchmark: 1,830 shapes × 250 k RANS simulations; HDF5 + AWS mirror.
PDEArena: Hugging-Face org offering Navier–Stokes, Shallow-Water & Maxwell tensors; MIT license.
WeatherBench 2: ERA5-derived Zarr cubes for data-driven medium-range forecasting; MIT.
UT Austin DNS Suite: public HTTP server with Reτ 180–5200 channel data & statistics.
Compressible TPC DNS DB: 25 Reynolds–Mach cases, plain-text statistics (Mendeley Data).
Curated RANS ↔ DNS: Scientific Data descriptor + Kaggle mirror for ML turbulence closures.
NASA CRM: open CAD, grids, wind-tunnel Cp & force/moment datasets for the community benchmark.
Darcy Flow (FNO): canonical permeability→pressure dataset used in FNO/PINO papers.
HiFi-TURB LES/DNS: EU-funded project providing high-fidelity Large Eddy Simulation and Direct Numerical Simulation datasets for complex 3D turbulent flows, supporting advanced turbulence modeling and AI/ML applications in computational fluid dynamics.
NASA High Lift Prediction Workshop (HLPW): Multi-phase workshop datasets featuring high-lift aircraft configurations with comprehensive experimental validation data, CAD geometries, and CFD solutions for aerodynamic modeling and validation.
High-Speed TBL DNS DB: Specialized database of Direct Numerical Simulation data for compressible turbulent boundary layers, providing detailed flow field information for high-speed aerodynamic applications and turbulence model development.
ML Turbulence (Kaggle): Community-contributed dataset featuring RANS Reynolds stress tensor data with ground truth labels, providing a standardized benchmark for machine learning approaches to turbulence modeling.

🧮 Computational Datasets (7 datasets)¶

PubChemQCR: A massive dataset of molecular relaxation trajectories for ~3.5 million small molecules, containing over 300 million conformations with energy and force labels. It is the largest public dataset of its kind, designed to accelerate the development of machine learning interatomic potentials (MLIPs).
MP-ALOE: Nearly 1 million DFT calculations using the accurate r2SCAN meta-generalized gradient approximation, covering 89 elements. Created using active learning and primarily consisting of off-equilibrium structures, MP-ALOE is designed for training universal machine learning interatomic potentials (UMLIPs) with strong performance on thermochemical properties, force prediction, and physical soundness under extreme conditions.
Alexandria DB: Massive computational materials database containing over 5 million DFT calculations using PBE functional for 1D-3D inorganic materials. Provides OPTIMADE-compliant API access and LMDB format for high-performance materials screening and property prediction workflows.
Quantum-Chemical Bonding DB (LOBSTER): Specialized dataset providing detailed bonding analysis for 1,520 solid-state compounds using LOBSTER methodology. Enables understanding of chemical bonding in crystalline materials through projected crystal orbital Hamilton populations and related descriptors.
MultixcQM9 & SPICE (OpenQDC): Enhanced quantum chemistry datasets within the OpenQDC framework. MultixcQM9 provides multi-exchange correlation functional data for 133k small molecules, while SPICE offers 1 million conformers with energies and forces for drug-like molecules, both optimized for machine learning applications.
Matbench v0.1 & Discovery: Comprehensive benchmarking suites for materials property prediction featuring 13 standardized tasks across 10 datasets. Matbench Discovery specifically targets stability prediction, thermal conductivity, and structure generation with rigorous evaluation protocols.
Materials Cloud Archives: Centralized repository of over 1,000 computational datasets from various DFT and molecular dynamics workflows. Provides standardized access to diverse materials science calculations with comprehensive metadata and version control.

📚 LLM Training Datasets (5 datasets)¶

LLM-EO (Evolutionary Optimization): A framework that integrates LLMs into evolutionary algorithms for optimizing transition metal complexes. This approach leverages the chemical knowledge of LLMs to surpass traditional genetic algorithms, enabling flexible, multi-objective optimization without complex mathematical formulations.
Flavor Analysis and Recognition Transformer: A state-of-the-art machine learning model dataset for predicting molecular taste from chemical structures. Built on ChemBERTa transformer architecture, it classifies molecules across four taste categories (sweet, bitter, sour, umami) with >91% accuracy, enabling interpretability through gradient-based visualizations and applications in flavor compound discovery and rational food design.
SCQA (Solar Cell QA): Domain-specific question-answering dataset containing 47,268 QA pairs about solar cell properties, auto-generated using ChemDataExtractor. Fine-tuning language models on this dataset achieves F1-scores exceeding general-English QA datasets by 10-20%, demonstrating the value of domain-specific training data for specialized scientific applications.
ScienceQA: Comprehensive K-12 science education dataset with 21,208 multimodal multiple-choice questions including lectures and explanations. Supports development of educational AI systems and scientific reasoning capabilities in language models.
SciBench: College-level scientific problem-solving benchmark covering mathematics, chemistry, and physics with both open and closed evaluation sets. Enables systematic assessment of LLM performance on advanced scientific reasoning tasks.

🧪 Experimental Datasets (4 datasets)¶

Anion Solvation DB: Comprehensive compilation of 26,000+ solvation properties including 8,241 experimental pKa values across 8 solvents, 5,536 computed gas-phase acidities, and over 12,000 solvation energies for anions and neutral compounds computed using COSMO-RS. Bridges experimental and computational approaches for understanding anion behavior in different solvation environments.
BigSolDB: Extensive experimental solubility database containing 54,273 measured solubility values across temperature range 243.15-403.15 K in various organic solvents and water. Features diverse chemical space coverage with interactive t-SNE exploration tool and comprehensive statistical analysis for QSPR model development.
StarryData2: Large-scale experimental properties dataset from Figshare spanning 2023-2024, providing comprehensive experimental measurements across diverse materials and chemical systems for machine learning model validation and training.
CRIPT Polymer Data: Community-driven polymer database featuring synthesis procedures, characterization data, and properties. Enables standardized data sharing and collaborative research in polymer science through structured JSON API access.

June 2025¶

Added 28 new high-quality datasets spanning polymer science, drug discovery, carbon materials, spectroscopy, MOF databases, foundation model training, and materials knowledge bases:

🧮 Computational Datasets (15 datasets)¶

NeurIPS Open Polymer Prediction 2025: Kaggle competition dataset for predicting 5 key polymer properties (Tg, FFV, Tc, density, Rg) from SMILES structures using MD simulation ground truth. Includes ~1,500 test polymers.
Carbon Data: 22.9 million atom dataset with synthetic energy labels from C-GAP-17 potential, featuring 546 carbon trajectories across diverse densities and temperatures. Captures nanotubes, graphitic films, diamond, and amorphous carbon environments.
MSR-ACC/TAE25: Microsoft Research's comprehensive dataset of 76,879 total atomization energies computed at CCSD(T)/CBS level using W1-F12 protocol. Exhaustively covers chemical space for elements up to argon with sub-chemical accuracy (±1 kcal/mol).
DFT Solvation Energy Dataset: 651,290 computed solvation energies for 130,258 molecules from QM9 dataset across 5 solvents (acetone, ethanol, acetonitrile, DMSO, water). Achieves 0.5 kcal/mol MAE for small molecules with accompanying ML models and web interface.
MD Simulated Monomer Properties: GPU-accelerated molecular dynamics dataset of thermodynamic properties for 410 molecules, generated through active learning pipeline. Includes validation against experimental data and automated simulation workflow.
Multimodal Spectroscopic Dataset: Comprehensive spectroscopic dataset with simulated 1H-NMR, 13C-NMR, HSQC-NMR, Infrared, and Mass spectra for 790k molecules from patent reactions. Enables multimodal foundation model development for structure elucidation and functional group prediction.
QMugs: 665k drug-like molecules with ~2M conformers, featuring quantum mechanical properties at both semi-empirical (GFN2-xTB) and DFT (ωB97X-D/def2-SVP) levels.
C2DB (Computational 2D Materials Database): ~4,000 two-dimensional materials with computed structural, electronic, magnetic, and optical properties.
ANI-1x / ANI-1ccx: 5 million DFT and 500k CCSD(T) calculations for organic molecules, supporting machine learning potential development.
CoRE MOF 2019: 14,763 computation-ready metal-organic frameworks with solvent and charge balancing, suitable for high-throughput screening.
QMOF Database: Comprehensive database of quantum-chemical properties for 20,000+ metal-organic frameworks derived from high-throughput periodic density functional theory calculations.
Catalysis-Hub Surface Reactions: Over 100,000 adsorption and reaction energies on catalytic surfaces, accessible via a Python/GraphQL API.
ODAC23 (Open DAC 2023): 38 million DFT calculations of CO₂/H₂O adsorption on 8,400 MOFs, aimed at direct-air-capture sorbent discovery.
MOFX-DB: Over 3 million simulated adsorption data points across 160,000 MOFs and 286 zeolites for various gases.
Enhanced QCML dataset entry with more comprehensive description of coverage and properties

🧪 Experimental Datasets (5 datasets)¶

SAIR (Structurally Augmented IC50 Repository): Largest public protein–ligand binding dataset with over 1 million complexes and 5.2 million cofolded 3D structures (2.5TB total). Combines experimental binding affinities from ChEMBL/BindingDB with Boltz-1x predicted structures.
CoRE MOF 2024: Updated database of over 40,000 experimentally reported metal-organic frameworks from literature through early 2024. Includes pre-computed material properties for high-throughput material-process screening and carbon-capture applications.
HTEM-DB (High-Throughput Experimental Materials Database): More than 140,000 composition–process–property data points from combinatorial sputtering experiments, with optical, electrical, and structural measurements.
OCx24 (Open Catalyst Experiments 2024): 572 synthesized catalyst inks evaluated with matched XRF/XRD and DFT adsorption energies, bridging the gap between simulation and laboratory data.
Khazana / Polymer Genome: Approximately 20,000 polymers with DFT-calculated properties and experimental dielectric data, supporting machine learning on soft materials.

📚 LLM Training Datasets (5 datasets)¶

MolTextNet: 2.5 million high-quality molecule-text pairs from ChEMBL35, featuring GPT-4o-mini generated descriptions 10x longer than existing datasets. Integrates structural features, computed properties, bioactivity data, and synthetic complexity for multimodal molecular modeling.
MolOpt-Instructions: 1.18 million instruction-based molecule optimization tasks for fine-tuning LLMs on drug discovery. Supports interactive human-machine dialogue for molecule optimization through the DrugAssist framework, enabling expert feedback integration and iterative refinement.
TextEdge: Benchmark dataset for predicting crystal properties from natural language text descriptions. Demonstrates superior performance of LLM-based approaches over traditional GNN methods, with improvements of 8% on band gap prediction and 65% on unit cell volume prediction.
LAMBench-TrainingSet-v1: Massive training dataset for Large Atom Models (LAMs) containing 19.8 million valid structures from the OpenLAM Initiative. Includes 1 million structures on the convex hull for advancing generative modeling and materials science applications.
LLM4Mat: Comprehensive benchmark dataset for evaluating LLMs in materials property prediction, containing 1.9M crystal structures from 10 data sources with 45 distinct properties. Features three input modalities (crystal composition, CIF, text description) with 4.7M, 615.5M, and 3.1B tokens respectively.

📖 Literature-mined & Text Datasets (3 datasets)¶

MatSciKB: Comprehensive materials science knowledge base with 38,469 curated entries across 16 categories. Integrates ArXiv papers (20,384), Wikipedia articles (3,620), textbooks (1,930), datasets (10,473), formulas (57), and GPT-generated examples (2,005) with efficient CRUD operations for research applications.
ChemRxivQuest: 970 curated question–answer pairs spanning 17 chemistry subfields, designed for retrieval-augmented generation and factuality assessments.
USPTO-Lowe Reactions (1976–2016): 1.8 million atom-mapped reactions extracted from US patents, serving as a benchmark for reaction prediction and retrosynthesis models.

📚 Enhanced Literature & Benchmark Resources (2 datasets)¶

Matbench (metadata/text tasks): Extended benchmarking suite providing 13 standardized tasks for text-based and metadata-driven materials property prediction. Enables systematic evaluation of natural language processing approaches in materials science applications.
OpenQDC Hub: Comprehensive quantum chemistry database aggregating 1.5 billion molecular geometries and quantum mechanical properties. Provides unified Python API access to diverse quantum chemistry datasets with standardized formats for large-scale machine learning applications.

⸻

Earlier Updates¶

For changes made earlier than the changelog entries, please see the repository commit history.

Source

Materials & Chemistry Datasets

About¶

Table of Contents¶

Contributing¶

Datasets¶

Computational Datasets¶

Experimental Datasets¶

LLM Training Datasets¶

Literature-mined & Text Datasets¶

🌊 Computational Fluid Dynamics, PDE & Engineering Datasets¶

Proprietary Datasets (for reference)¶

Dataset Resources¶

TODO¶

Other Links¶

License¶

Acknowledgements¶

Citation¶

Changelog¶

October 2025¶

🧮 Computational Datasets (7 datasets)¶

🧪 Experimental Datasets (5 datasets)¶

📚 LLM Training Datasets (5 datasets)¶

📖 Literature-mined & Text Datasets (1 dataset)¶

August 2025¶

🧮 Computational Datasets (3 datasets)¶

📚 LLM Training Datasets (4 datasets)¶

July 2025¶

🌊 Computational Fluid Dynamics, PDE & Engineering Datasets (15 datasets)¶

🧮 Computational Datasets (7 datasets)¶

📚 LLM Training Datasets (5 datasets)¶

🧪 Experimental Datasets (4 datasets)¶

June 2025¶

🧮 Computational Datasets (15 datasets)¶

🧪 Experimental Datasets (5 datasets)¶

📚 LLM Training Datasets (5 datasets)¶

📖 Literature-mined & Text Datasets (3 datasets)¶

📚 Enhanced Literature & Benchmark Resources (2 datasets)¶

Earlier Updates¶

Comments