Designing experiments and result interpretations are core scientific competencies, particularly in natural sciences, where researchers design precise experiments to perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SCIGYM, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks.
SCIGYM overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.
Systems Biology Markup Language (SBML) is an XML-based standard for representing computational models of biological processes. It provides a formal specification language for describing dynamic systems in systems biology, enabling researchers to create mathematical models that capture the behavior of complex biological networks through ordinary differential equations (ODEs).
SBML can represent diverse biological phenomena, including metabolic networks, gene regulatory circuits, cell signaling pathways, and pharmacokinetic models. As an open standard, SBML has gained widespread adoption across the systems biology community, with extensive software support and active development by researchers and tool developers worldwide. The language's standardized format facilitates model sharing, reproducibility, and interoperability between different simulation platforms. Popular simulation tools like libRoadRunner and COPASI can efficiently execute SBML models to generate time-series data, making it straightforward for researchers to analyze system dynamics and test biological hypotheses computationally.
SCIGYM contains 350 curated SBML models from BioModels, a public repository of manually-curated models from published literature. The models span diverse areas of biology including metabolic pathways, gene regulatory networks, cell signaling, and epidemiological models. Each model provides a realistic biological system for agents to discover through systematic experimentation. To prevent memorization, we preprocessed models by anonymizing identifiers and shuffling components while preserving the underlying biological structure.
SCIGYM simulates all key facets of end-to-end scientific discovery: forming hypotheses, planning and executing experiments, analyzing results, and drawing conclusions. Starting with a partial SBML model where reactions have been removed, agents must design perturbation experiments, analyze time-series data, and propose mechanistic models that explain the observed biological system behavior.
SBML models serve as data simulators for experimental perturbations. When agents propose perturbations (such as changing initial concentrations), our environment applies the perturbation to the reference SBML model, simulates the modified system, and returns time-series data. This creates a realistic "dry lab" where agents can iteratively test hypotheses without expensive wet-lab experimentation.
We implemented a ReAct-style agent using a Thoughts-Actions-Observations framework. In each iteration, agents can choose from three actions:
Measuring progress on scientific discovery is challenging. SCIGYM evaluates complementary aspects of discovery performance:
Evaluates recovery of ground truth reaction structures by matching identical sets of reactants, products, and modifiers.
Compares dynamic behavior through time-series trajectories using Symmetric Mean Absolute Percentage Error (SMAPE).
We evaluated six frontier LLMs from three model families (Gemini, Claude, GPT-4) on SCIGYM-small, comparing both professional and mini variants. The results reveal several important findings about current models' scientific capabilities.
Model | STE โ | RMS With Modifiers | RMS Without Modifiers | ||||
---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | ||
Gemini-2.5-Flash | 0.4181 | 0.1527 | 0.1071 | 0.1217 | 0.2399 | 0.1839 | 0.2005 |
GPT-4.1-mini | 0.6007 | 0.1516 | 0.1253 | 0.1320 | 0.2530 | 0.2313 | 0.2322 |
Claude-3.5-Haiku | 0.6281 | 0.0858 | 0.0421 | 0.0530 | 0.1454 | 0.0805 | 0.0987 |
Gemini-2.5-Pro | 0.3212 | 0.2138 | 0.1664 | 0.1817 | 0.3781 | 0.3219 | 0.3383 |
GPT-4.1 | 0.4611 | 0.2067 | 0.1597 | 0.1740 | 0.3517 | 0.2888 | 0.3038 |
Claude-3.7-Sonnet | 0.3615 | 0.1780 | 0.1698 | 0.1688 | 0.3160 | 0.3170 | 0.3047 |
Table 1: Pro models outperform their mini counterparts in SCIGYM, with Gemini-2.5-Pro achieving the best overall performance. Bold values indicate best performance in each metric.
Our evaluation revealed several important insights about current LLMs' scientific capabilities:
Across all model families, professional variants achieved both lower simulation trajectory errors and higher reaction matching scores, suggesting that enhanced model capabilities benefit scientific discovery tasks.
All models showed significant performance decline as the number of reactions increased from 2 to 10, with simulation errors increasing from ~0.1 to 0.55 for top-performing models.
Models performed substantially better at discovering reactant-product relationships compared to modifier relationships. Even the best model achieved F1 scores 5ร higher for reactant-product vs. modifier connections.
Proposed mechanisms often failed to generalize to different initial conditions, indicating that agents overfit to specific experimental data rather than capturing fundamental biological properties.
We believe this work has immediate relevance for LLM-based autonomous experiment planning in self-driving labs, where selecting appropriate AI systems can enhance search efficiency and accelerate scientific discovery. To our knowledge, SCIGYM is the first benchmark to evaluate LLMs on the full cycle of scientific experimentation, and makes possible the study of agentic scientific decision-making.
SCIGYM is designed to evolve alongside scientific progress. As systems biologists continue advancing their field and contributing new models to databases like BioModels, SCIGYM will correspondingly grow more robust and comprehensive. The framework's modular design enables straightforward extension to additional experimental modalities and domains where protocols can be precisely specified and computationally simulated.
We envision SCIGYM as a living testbed that will continue to evolve, incorporating increasingly sophisticated experimental scenarios and evaluation metrics to better capture the nuanced complexities of authentic scientific discovery processes.
@article{duan2025measuring,
title={Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab},
author={Duan, Haonan and Lu, Stephen Zhewen and Harrigan, Caitlin Fiona and Desai, Nishkrit and Lu, Jiarui and Koziarski, Micha{\l} and Cotta, Leonardo and Maddison, Chris J},
journal={arXiv preprint arXiv:2507.02083},
year={2025}
}