Verifying intelligence in high-stakes scientific domains. Our datasets provide the grounding required for advanced physics, formal math, and biological research.
Our scientific evaluation goes beyond multiple-choice questions to include symbolic derivation and formal proof verification.
High-fidelity data designed to improve model reasoning in domains where "close enough" is not an option.
Step-by-step logical derivations in Lean and Coq, optimized for training models in automated theorem proving.
Multi-step reaction planning and molecular property prediction data curated by PhD chemists for drug discovery.
Training data that respects conservation laws and fluid dynamics, ensuring models ground their outputs in physical reality.
Measuring graduate-level reasoning and scientific accuracy with zero-shot and chain-of-thought protocols.
400+ graduate-level science questions designed to be "un-googleable," requiring deep domain expertise to solve correctly.
Benchmark Specs12,500 competition-level mathematics problems testing calculus, geometry, and number theory with step-wise verification.
Benchmark SpecsComprehensive suite for molecular biology and genetics, evaluating reasoning across biological systems and pharmaceutical research.
Benchmark SpecsUnderstanding how we verify scientific accuracy and domain expertise in our STEM datasets.