Skip to content

Materials & Chemistry Datasets

# Awesome Materials & Chemistry Datasets

A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets—prioritizing open-access resources and community contributions.

This project aims to: - Catalog the best datasets by domain, type, quality, and size - Support reproducible research in AI for chemistry and materials - Provide a community-driven resource with contributions from researchers and developers


Table of Contents


How to Use

  • Explore datasets by domain or data type using the tables below
  • Click the access links to explore or download the data
  • Sort/filter by quality, size, and suitability for ML models
  • Fork the repo and submit a pull request to add new datasets

Contributing

Want to add a new dataset or improve metadata?

  1. Fork the repository
  2. Edit the appropriate dataset list or add a new entry
  3. Submit a pull request with a brief description and source
  4. Use the following fields:
  5. Dataset Name
  6. Domain
  7. Type (Computational, Experimental, Literature-mined)
  8. Size
  9. Access (Open/Restricted/Proprietary)
  10. Format (JSON, CSV, CIF, HDF5, SMILES, etc.)
  11. License
  12. Access Link
  13. Notes or Use Cases

Datasets

Computational Datasets

Dataset Domain Size Type Format License Access
OMat24 (Meta) Inorganic crystals 110M DFT entries Computational JSON/HDF5 CC BY 4.0 Open
OMol25 (Meta) Molecular chemistry 100M+ DFT calculations Computational LMDB CC BY 4.0 Open
Materials Project (LBL) Inorganic crystals 500k+ compounds Computational JSON/API CC BY 4.0 Open
Open Catalyst 2020 (OC20) Catalysis (surfaces) 1.2M relaxations Computational JSON/HDF5 CC BY 4.0 Open
AFLOW Inorganic materials 3.5M materials Computational REST API Open Open
OQMD Inorganic solids 1M+ compounds Computational SQL/CSV Open Open
JARVIS-DFT (NIST) 3D/2D materials 40k+ entries Computational JSON/API Open Open
Carolina Materials DB Hypothetical crystals 214k structures Computational JSON CC BY 4.0 Open
NOMAD Various DFT/MD >19M calculations Computational JSON CC BY 4.0 Open
MatPES DFT Potential Energy Surfaces ~400,000 structures from 300K MD simulations Computational JSON Open
Vector-QM24 Small organic and inorganic molecules 836k conformational isomers Computational JSON Placeholder Open
AIMNet2 Dataset Non-metallic compounds 20M hybrid DFT calculations Computational JSON Open Open
RDB7 Barrier height and enthalpy for small organic reactions 12k CCSD(T)-F12 calculations Computational CSV Open Open
RDB19-Rad ΔG of activation and of reaction for organic reactions in 40 common solvents 5.6k DFT + COSMO-RS calculations Computational CSV Open Open
QCML Small molecules consisting of up to 8 heavy atoms 14.7B Semi-empirical + 33.5M DFT calculations Computational TFDS CC BY-NC 4.0 Open
QM9 Small organic molecules 134k molecules with quantum properties Experimental SDF/CSV CC BY 4.0 Open
QM7/QM7b Small molecules 7k molecules with atomization energies Experimental SDF/CSV CC BY 4.0 Open

Experimental Datasets

Dataset Domain Size Type Format License Access
Crystallography Open Database (COD) Crystal structures ~525k entries Experimental CIF/SMILES CC0 1.0 Open
NIST ICSD (subset) Inorganic structures ~290k structures Experimental CIF Proprietary Restricted
CSD (Cambridge) Organic crystals ~1.3M structures Experimental CIF Proprietary Restricted
opXRD Crystal structures 92552 (2179 labeled) Experimental JSON CC BY 4.0 Open
MDR SuperCon Superconductivity legacy superconductor database w/ material composition, structure, properties, and processes Mixed CC BY 4.0 Open
ChEMBL Bioactive molecules 2.3M+ compounds with bioactivity data Experimental JSON/SDF CC BY-SA 3.0 Open
MoleculeNet Molecular properties 700k+ compounds across 17 datasets Mixed CSV/SDF Various Open
ESOL Aqueous solubility 1,128 compounds with solubility data Experimental CSV Open Open
FreeSolv Hydration free energy 643 molecules with experimental data Experimental CSV CC BY 4.0 Open
Lipophilicity Octanol/water distribution 4,200 compounds with logD values Experimental CSV Open Open
PCBA Bioassay screening 400k+ compounds, 128 bioassays Experimental CSV Open Open
HIV Antiviral screening 41k compounds with HIV inhibition data Experimental CSV Open Open
BACE Beta-secretase inhibitors 1,522 compounds with IC50 data Experimental CSV Open Open
BBBP Blood-brain barrier permeability 2,053 compounds with permeability data Experimental CSV Open Open
Tox21 Toxicity screening 8k compounds, 12 toxicity targets Experimental CSV Open Open
ToxCast High-throughput toxicity 8k compounds, 600+ assays Experimental CSV Open Open
SIDER Drug side effects 1,427 drugs with adverse reactions Experimental CSV Open Open
ClinTox Clinical trial toxicity 1,491 compounds with FDA approval status Experimental CSV Open Open
PDBbind Protein-ligand binding 19k complexes with binding affinities Experimental PDB/SDF Open Open
BindingDB Protein-ligand binding 2.8M+ binding data points Experimental CSV/SDF CC BY 4.0 Open
ProtBENCH Drug-target interactions Protein family-specific datasets Experimental CSV GPL-3.0 Open
PDBench Protein sequence design 595 protein structures, 40 architectures Experimental PDB MIT Open
PDB-Struct Structure-based protein design Comprehensive protein design benchmark Experimental PDB Open Open

LLM Training Datasets

Dataset Domain Size Type Format License Access
ChemPile Chemistry 75B+ tokens LLM Training Mixed Open Open
SmolInstruct Small molecules 3.3M samples LLM Training JSON CC BY 4.0 Open
CAMEL Chemistry 20K problem-solution pairs LLM Training JSON Open Open
ChemNLP Chemistry Extensive, many combined datasets LLM Training JSON Open Open
ChemQA Chemistry Multimodal QA dataset LLM Training JSON Open Open
ChemLLMBench Chemistry 8 chemistry tasks benchmark LLM Training JSON Open Open
ChemistryQA Chemistry 4,500 questions across 200 topics LLM Training JSON Open Open
MaScQA Materials Science 640 QA pairs LLM Training XLSX Open Open
SciCode Research Coding in Physics, Math, Material Science, Biology, and Chemistry 338 subproblems LLM Training JSON Open Open
ChemData 700K Chemistry (9 core tasks) 730K Q-A instruction pairs LLM Training JSON CC BY-NC 4.0 Open
MatSci-Instruct (HoneyBee) Materials science ≈55K verified instructions LLM Training JSON CC BY 4.0 Open
MoleculeQA Molecular properties & safety 62K multiple-choice QA pairs LLM Training JSON MIT Open
BioInstruct 25K Biomedical / biochemistry 25K GPT-4 generated instructions LLM Training JSON MIT Open
Lab-Bench Biology 2,400+ questions for biology agents LLM Training JSON Open Open
ChemBench 4K Chemistry competency benchmark 4,100 single-choice questions LLM Training JSON CC BY-NC 4.0 Open
GPQA Diamond Biology, Physics, Chemistry 448 multiple-choice questions LLM Training JSON Open Open
SciAssess Scientific literature analysis Benchmark for LLMs in science LLM Training JSON Open Open
ZINC20-ML Drug-like molecules (SMILES) ≈1B molecules LLM Training SMILES ZINC License Open
PMC Open Access Subset Biomedical full-text 3.4M+ articles LLM Training XML Various CC Open
MatScholar Task-Schema QA (MatSci-NLP) Materials science (7 NLP tasks) Tens of thousands of examples LLM Training JSON CC BY 4.0 Open
Mol-Instructions Chemistry molecular, protein, and biochemical instructions LLM Training HuggingFace Dataset Open Open
USPTO-LLM Chemical reactions 247K reactions LLM Training JSON/Graph CC BY 4.0 Open

Literature-mined & Text Datasets

Dataset Domain Size Type Format License Access
PubChem Molecules & data 119M compounds Literature SMILES/SDF Public Domain Open
Open Reaction Database (ORD) Synthetic reactions ~1M reactions Experimental/Lit JSON CC BY 4.0 Open
PatCID (IBM) Chemical image data 81M images / 13M mols Literature PNG/SMILES Open Open
MatScholar NLP corpus (materials) 5M+ abstracts Literature JSON/Graph Open Open

Proprietary Datasets (for reference)

Dataset Domain Size Access Use Case Notes
CAS Registry Chemical substances 250M+ substances Proprietary Industry standard for molecule indexing
Reaxys (Elsevier) Reactions & properties Millions of reactions Proprietary Rich curated literature reaction data
Citrine Informatics DB Experimental materials Private Proprietary Materials ML platform w/ industry data
CSD (Cambridge) Organic crystals 1.3M+ Proprietary Gold-standard X-ray structures
PoLyInfo Polymers & properties 500k+ data points / Experimental Proprietary Polymer properties from literature sources

Dataset Resources

  • The Materials Data Facility - Over 100 TB of open materials data. #TODO list some of these in the tables above
  • Foundry-ML search Foundry - 61 structured datasets ready for download through a Python client #TODO list some of these in the tables above

TODO

  • Classify and add CRIPT for polymer data
  • Classify and add Polymer Genome and other datasets from Khazana
  • A dataset on solubilities of gases in polymers (15 000 experimental measurements of 79 gases' uptakes (0.01–50 wt%) in 102 different polymers, pressures from 1 × 10−3 to 7 × 102 bar and temperatures from 233 to 508 K, includes nearly 500 solvent–polymer systems). Optimized structures of various repeating units are included. Should it be of interest for you, it is available here: Data
  • Add Materials Cloud Datasets
  • Classify Atomly. A bit challenging with non-English
  • Look into adding NOMAD for experimental data as well
  • Review Alexandria Materials
  • Add A Quantum-Chemical Bonding Database for Solid-State Materials Part 1: https://zenodo.org/records/8091844 Part 2: https://zenodo.org/records/8092187
  • Add QM datasets. http://quantum-machine.org/datasets/
  • Find link for | ChemRxivQuest | Chemistry literature QA | 970 curated QA pairs | LLM Training | JSON | CC BY 4.0 | Open | ChemRxivQuest |
  • Find new link for USPTO-Reactions | USPTO Reactions | Organic reactions | 1.8M reactions | Literature | RXN/SMILES | Open | Open |


License

This project is licensed under the MIT License. Each dataset listed has its own license, noted in the table. Always check the source's license before using the data in your project.


Acknowledgements

Thanks to the open data and research communities including: - Meta AI FAIR - The Materials Data Facility / Foundry-ML - NIST JARVIS and Materials Project - LBL, MIT, CCDC, FIZ Karlsruhe - Contributors to Open Catalyst, PubChem, ORD, and AFLOW - Developers of open chemistry toolkits (RDKit, Open Babel)


Citation

If this repository was helpful in your work, feel free to cite or star the repo. You can also reference the underlying dataset publications linked above.

Source

Comments