Materials & Chemistry Datasets
# Awesome Materials & Chemistry Datasets
A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets—prioritizing open-access resources and community contributions.
This project aims to: - Catalog the best datasets by domain, type, quality, and size - Support reproducible research in AI for chemistry and materials - Provide a community-driven resource with contributions from researchers and developers
Table of Contents¶
- How to Use
- Contributing
- Datasets
- Computational (DFT, MD)
- Experimental
- LLM Training
- Literature-mined & Text
- Proprietary
- License
- Acknowledgements
How to Use¶
- Explore datasets by domain or data type using the tables below
- Click the access links to explore or download the data
- Sort/filter by quality, size, and suitability for ML models
- Fork the repo and submit a pull request to add new datasets
Contributing¶
Want to add a new dataset or improve metadata?
- Fork the repository
- Edit the appropriate dataset list or add a new entry
- Submit a pull request with a brief description and source
- Use the following fields:
- Dataset Name
- Domain
- Type (
Computational
,Experimental
,Literature-mined
) - Size
- Access (Open/Restricted/Proprietary)
- Format (JSON, CSV, CIF, HDF5, SMILES, etc.)
- License
- Access Link
- Notes or Use Cases
Datasets¶
Computational Datasets¶
Dataset | Domain | Size | Type | Format | License | Access |
---|---|---|---|---|---|---|
OMat24 (Meta) | Inorganic crystals | 110M DFT entries | Computational | JSON/HDF5 | CC BY 4.0 | Open |
OMol25 (Meta) | Molecular chemistry | 100M+ DFT calculations | Computational | LMDB | CC BY 4.0 | Open |
Materials Project (LBL) | Inorganic crystals | 500k+ compounds | Computational | JSON/API | CC BY 4.0 | Open |
Open Catalyst 2020 (OC20) | Catalysis (surfaces) | 1.2M relaxations | Computational | JSON/HDF5 | CC BY 4.0 | Open |
AFLOW | Inorganic materials | 3.5M materials | Computational | REST API | Open | Open |
OQMD | Inorganic solids | 1M+ compounds | Computational | SQL/CSV | Open | Open |
JARVIS-DFT (NIST) | 3D/2D materials | 40k+ entries | Computational | JSON/API | Open | Open |
Carolina Materials DB | Hypothetical crystals | 214k structures | Computational | JSON | CC BY 4.0 | Open |
NOMAD | Various DFT/MD | >19M calculations | Computational | JSON | CC BY 4.0 | Open |
MatPES | DFT Potential Energy Surfaces | ~400,000 structures from 300K MD simulations | Computational | JSON | Open | |
Vector-QM24 | Small organic and inorganic molecules | 836k conformational isomers | Computational | JSON | Placeholder | Open |
AIMNet2 Dataset | Non-metallic compounds | 20M hybrid DFT calculations | Computational | JSON | Open | Open |
RDB7 | Barrier height and enthalpy for small organic reactions | 12k CCSD(T)-F12 calculations | Computational | CSV | Open | Open |
RDB19-Rad | ΔG of activation and of reaction for organic reactions in 40 common solvents | 5.6k DFT + COSMO-RS calculations | Computational | CSV | Open | Open |
QCML | Small molecules consisting of up to 8 heavy atoms | 14.7B Semi-empirical + 33.5M DFT calculations | Computational | TFDS | CC BY-NC 4.0 | Open |
QM9 | Small organic molecules | 134k molecules with quantum properties | Experimental | SDF/CSV | CC BY 4.0 | Open |
QM7/QM7b | Small molecules | 7k molecules with atomization energies | Experimental | SDF/CSV | CC BY 4.0 | Open |
Experimental Datasets¶
Dataset | Domain | Size | Type | Format | License | Access |
---|---|---|---|---|---|---|
Crystallography Open Database (COD) | Crystal structures | ~525k entries | Experimental | CIF/SMILES | CC0 1.0 | Open |
NIST ICSD (subset) | Inorganic structures | ~290k structures | Experimental | CIF | Proprietary | Restricted |
CSD (Cambridge) | Organic crystals | ~1.3M structures | Experimental | CIF | Proprietary | Restricted |
opXRD | Crystal structures | 92552 (2179 labeled) | Experimental | JSON | CC BY 4.0 | Open |
MDR SuperCon | Superconductivity | legacy superconductor database w/ material composition, structure, properties, and processes | Mixed | CC BY 4.0 | Open | |
ChEMBL | Bioactive molecules | 2.3M+ compounds with bioactivity data | Experimental | JSON/SDF | CC BY-SA 3.0 | Open |
MoleculeNet | Molecular properties | 700k+ compounds across 17 datasets | Mixed | CSV/SDF | Various | Open |
ESOL | Aqueous solubility | 1,128 compounds with solubility data | Experimental | CSV | Open | Open |
FreeSolv | Hydration free energy | 643 molecules with experimental data | Experimental | CSV | CC BY 4.0 | Open |
Lipophilicity | Octanol/water distribution | 4,200 compounds with logD values | Experimental | CSV | Open | Open |
PCBA | Bioassay screening | 400k+ compounds, 128 bioassays | Experimental | CSV | Open | Open |
HIV | Antiviral screening | 41k compounds with HIV inhibition data | Experimental | CSV | Open | Open |
BACE | Beta-secretase inhibitors | 1,522 compounds with IC50 data | Experimental | CSV | Open | Open |
BBBP | Blood-brain barrier permeability | 2,053 compounds with permeability data | Experimental | CSV | Open | Open |
Tox21 | Toxicity screening | 8k compounds, 12 toxicity targets | Experimental | CSV | Open | Open |
ToxCast | High-throughput toxicity | 8k compounds, 600+ assays | Experimental | CSV | Open | Open |
SIDER | Drug side effects | 1,427 drugs with adverse reactions | Experimental | CSV | Open | Open |
ClinTox | Clinical trial toxicity | 1,491 compounds with FDA approval status | Experimental | CSV | Open | Open |
PDBbind | Protein-ligand binding | 19k complexes with binding affinities | Experimental | PDB/SDF | Open | Open |
BindingDB | Protein-ligand binding | 2.8M+ binding data points | Experimental | CSV/SDF | CC BY 4.0 | Open |
ProtBENCH | Drug-target interactions | Protein family-specific datasets | Experimental | CSV | GPL-3.0 | Open |
PDBench | Protein sequence design | 595 protein structures, 40 architectures | Experimental | PDB | MIT | Open |
PDB-Struct | Structure-based protein design | Comprehensive protein design benchmark | Experimental | PDB | Open | Open |
LLM Training Datasets¶
Dataset | Domain | Size | Type | Format | License | Access |
---|---|---|---|---|---|---|
ChemPile | Chemistry | 75B+ tokens | LLM Training | Mixed | Open | Open |
SmolInstruct | Small molecules | 3.3M samples | LLM Training | JSON | CC BY 4.0 | Open |
CAMEL | Chemistry | 20K problem-solution pairs | LLM Training | JSON | Open | Open |
ChemNLP | Chemistry | Extensive, many combined datasets | LLM Training | JSON | Open | Open |
ChemQA | Chemistry | Multimodal QA dataset | LLM Training | JSON | Open | Open |
ChemLLMBench | Chemistry | 8 chemistry tasks benchmark | LLM Training | JSON | Open | Open |
ChemistryQA | Chemistry | 4,500 questions across 200 topics | LLM Training | JSON | Open | Open |
MaScQA | Materials Science | 640 QA pairs | LLM Training | XLSX | Open | Open |
SciCode | Research Coding in Physics, Math, Material Science, Biology, and Chemistry | 338 subproblems | LLM Training | JSON | Open | Open |
ChemData 700K | Chemistry (9 core tasks) | 730K Q-A instruction pairs | LLM Training | JSON | CC BY-NC 4.0 | Open |
MatSci-Instruct (HoneyBee) | Materials science | ≈55K verified instructions | LLM Training | JSON | CC BY 4.0 | Open |
MoleculeQA | Molecular properties & safety | 62K multiple-choice QA pairs | LLM Training | JSON | MIT | Open |
BioInstruct 25K | Biomedical / biochemistry | 25K GPT-4 generated instructions | LLM Training | JSON | MIT | Open |
Lab-Bench | Biology | 2,400+ questions for biology agents | LLM Training | JSON | Open | Open |
ChemBench 4K | Chemistry competency benchmark | 4,100 single-choice questions | LLM Training | JSON | CC BY-NC 4.0 | Open |
GPQA Diamond | Biology, Physics, Chemistry | 448 multiple-choice questions | LLM Training | JSON | Open | Open |
SciAssess | Scientific literature analysis | Benchmark for LLMs in science | LLM Training | JSON | Open | Open |
ZINC20-ML | Drug-like molecules (SMILES) | ≈1B molecules | LLM Training | SMILES | ZINC License | Open |
PMC Open Access Subset | Biomedical full-text | 3.4M+ articles | LLM Training | XML | Various CC | Open |
MatScholar Task-Schema QA (MatSci-NLP) | Materials science (7 NLP tasks) | Tens of thousands of examples | LLM Training | JSON | CC BY 4.0 | Open |
Mol-Instructions | Chemistry | molecular, protein, and biochemical instructions | LLM Training | HuggingFace Dataset | Open | Open |
USPTO-LLM | Chemical reactions | 247K reactions | LLM Training | JSON/Graph | CC BY 4.0 | Open |
Literature-mined & Text Datasets¶
Dataset | Domain | Size | Type | Format | License | Access |
---|---|---|---|---|---|---|
PubChem | Molecules & data | 119M compounds | Literature | SMILES/SDF | Public Domain | Open |
Open Reaction Database (ORD) | Synthetic reactions | ~1M reactions | Experimental/Lit | JSON | CC BY 4.0 | Open |
PatCID (IBM) | Chemical image data | 81M images / 13M mols | Literature | PNG/SMILES | Open | Open |
MatScholar | NLP corpus (materials) | 5M+ abstracts | Literature | JSON/Graph | Open | Open |
Proprietary Datasets (for reference)¶
Dataset | Domain | Size | Access | Use Case Notes |
---|---|---|---|---|
CAS Registry | Chemical substances | 250M+ substances | Proprietary | Industry standard for molecule indexing |
Reaxys (Elsevier) | Reactions & properties | Millions of reactions | Proprietary | Rich curated literature reaction data |
Citrine Informatics DB | Experimental materials | Private | Proprietary | Materials ML platform w/ industry data |
CSD (Cambridge) | Organic crystals | 1.3M+ | Proprietary | Gold-standard X-ray structures |
PoLyInfo | Polymers & properties | 500k+ data points / Experimental | Proprietary | Polymer properties from literature sources |
Dataset Resources¶
- The Materials Data Facility - Over 100 TB of open materials data. #TODO list some of these in the tables above
- Foundry-ML search Foundry - 61 structured datasets ready for download through a Python client #TODO list some of these in the tables above
TODO¶
- Classify and add CRIPT for polymer data
- Classify and add Polymer Genome and other datasets from Khazana
- A dataset on solubilities of gases in polymers (15 000 experimental measurements of 79 gases' uptakes (0.01–50 wt%) in 102 different polymers, pressures from 1 × 10−3 to 7 × 102 bar and temperatures from 233 to 508 K, includes nearly 500 solvent–polymer systems). Optimized structures of various repeating units are included. Should it be of interest for you, it is available here: Data
- Add Materials Cloud Datasets
- Classify Atomly. A bit challenging with non-English
- Look into adding NOMAD for experimental data as well
- Review Alexandria Materials
- Add A Quantum-Chemical Bonding Database for Solid-State Materials Part 1: https://zenodo.org/records/8091844 Part 2: https://zenodo.org/records/8092187
- Add QM datasets. http://quantum-machine.org/datasets/
- Find link for | ChemRxivQuest | Chemistry literature QA | 970 curated QA pairs | LLM Training | JSON | CC BY 4.0 | Open | ChemRxivQuest |
- Find new link for USPTO-Reactions | USPTO Reactions | Organic reactions | 1.8M reactions | Literature | RXN/SMILES | Open | Open |
Other Links¶
License¶
This project is licensed under the MIT License. Each dataset listed has its own license, noted in the table. Always check the source's license before using the data in your project.
Acknowledgements¶
Thanks to the open data and research communities including: - Meta AI FAIR - The Materials Data Facility / Foundry-ML - NIST JARVIS and Materials Project - LBL, MIT, CCDC, FIZ Karlsruhe - Contributors to Open Catalyst, PubChem, ORD, and AFLOW - Developers of open chemistry toolkits (RDKit, Open Babel)
Citation¶
If this repository was helpful in your work, feel free to cite or star the repo. You can also reference the underlying dataset publications linked above.