SCIGYM: Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

1University of Toronto, 2SickKids, 3Axiom, 4Mila, 5Vector Institute
TL;DR
SCIGYM provides a framework that tests AI agents on end-to-end scientific discovery capabilities. We use simulated biological systems to generate experimental data, enabling the agent to iteratively design experiments and discover mechanisms while avoiding costly wet lab expenses.
Loading chat history...

Abstract

Designing experiments and result interpretations are core scientific competencies, particularly in natural sciences, where researchers design precise experiments to perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SCIGYM, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks.

SCIGYM overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.

What is Systems Biology Markup Language (SBML)?

Systems Biology Markup Language (SBML) is an XML-based standard for representing computational models of biological processes. It provides a formal specification language for describing dynamic systems in systems biology, enabling researchers to create mathematical models that capture the behavior of complex biological networks through ordinary differential equations (ODEs).

SBML can represent diverse biological phenomena, including metabolic networks, gene regulatory circuits, cell signaling pathways, and pharmacokinetic models. As an open standard, SBML has gained widespread adoption across the systems biology community, with extensive software support and active development by researchers and tool developers worldwide. The language's standardized format facilitates model sharing, reproducibility, and interoperability between different simulation platforms. Popular simulation tools like libRoadRunner and COPASI can efficiently execute SBML models to generate time-series data, making it straightforward for researchers to analyze system dynamics and test biological hypotheses computationally.

Benchmark Design and Data Curation

SCIGYM contains 350 curated SBML models from BioModels, a public repository of manually-curated models from published literature. The models span diverse areas of biology including metabolic pathways, gene regulatory networks, cell signaling, and epidemiological models. Each model provides a realistic biological system for agents to discover through systematic experimentation. To prevent memorization, we preprocessed models by anonymizing identifiers and shuffling components while preserving the underlying biological structure.

End-to-end Scientific Discovery Framework

SCIGYM simulates all key facets of end-to-end scientific discovery: forming hypotheses, planning and executing experiments, analyzing results, and drawing conclusions. Starting with a partial SBML model where reactions have been removed, agents must design perturbation experiments, analyze time-series data, and propose mechanistic models that explain the observed biological system behavior.

SBML models serve as data simulators for experimental perturbations. When agents propose perturbations (such as changing initial concentrations), our environment applies the perturbation to the reference SBML model, simulates the modified system, and returns time-series data. This creates a realistic "dry lab" where agents can iteratively test hypotheses without expensive wet-lab experimentation.

Agent Framework and Action Space

We implemented a ReAct-style agent using a Thoughts-Actions-Observations framework. In each iteration, agents can choose from three actions:

  • Writing code: Analyze experimental results using Python with access to libraries like pandas, numpy, and custom SBML simulation tools
  • Conducting experiments: Request perturbations such as changing initial species concentrations
  • Submitting models: Provide final SBML model with discovered reactions

Evaluation Metrics

Measuring progress on scientific discovery is challenging. SCIGYM evaluates complementary aspects of discovery performance:

Reaction Matching Score (RMS)

Evaluates recovery of ground truth reaction structures by matching identical sets of reactants, products, and modifiers.

Simulation Trajectory Error (STE)

Compares dynamic behavior through time-series trajectories using Symmetric Mean Absolute Percentage Error (SMAPE).

Experimental Results

We evaluated six frontier LLMs from three model families (Gemini, Claude, GPT-4) on SCIGYM-small, comparing both professional and mini variants. The results reveal several important findings about current models' scientific capabilities.

Model STE โ†“ RMS With Modifiers RMS Without Modifiers
Precision Recall F1 Precision Recall F1
Gemini-2.5-Flash 0.4181 0.1527 0.1071 0.1217 0.2399 0.1839 0.2005
GPT-4.1-mini 0.6007 0.1516 0.1253 0.1320 0.2530 0.2313 0.2322
Claude-3.5-Haiku 0.6281 0.0858 0.0421 0.0530 0.1454 0.0805 0.0987
Gemini-2.5-Pro 0.3212 0.2138 0.1664 0.1817 0.3781 0.3219 0.3383
GPT-4.1 0.4611 0.2067 0.1597 0.1740 0.3517 0.2888 0.3038
Claude-3.7-Sonnet 0.3615 0.1780 0.1698 0.1688 0.3160 0.3170 0.3047

Table 1: Pro models outperform their mini counterparts in SCIGYM, with Gemini-2.5-Pro achieving the best overall performance. Bold values indicate best performance in each metric.

Key Findings

Our evaluation revealed several important insights about current LLMs' scientific capabilities:

๐Ÿ† Pro Models Consistently Outperform Mini Variants

Across all model families, professional variants achieved both lower simulation trajectory errors and higher reaction matching scores, suggesting that enhanced model capabilities benefit scientific discovery tasks.

๐Ÿ“ˆ Performance Degrades with System Complexity

All models showed significant performance decline as the number of reactions increased from 2 to 10, with simulation errors increasing from ~0.1 to 0.55 for top-performing models.

๐Ÿ”ฌ Modifier Relationships Pose Major Challenge

Models performed substantially better at discovering reactant-product relationships compared to modifier relationships. Even the best model achieved F1 scores 5ร— higher for reactant-product vs. modifier connections.

โš ๏ธ Overfitting to Experimental Data

Proposed mechanisms often failed to generalize to different initial conditions, indicating that agents overfit to specific experimental data rather than capturing fundamental biological properties.

Impact and Future Directions

We believe this work has immediate relevance for LLM-based autonomous experiment planning in self-driving labs, where selecting appropriate AI systems can enhance search efficiency and accelerate scientific discovery. To our knowledge, SCIGYM is the first benchmark to evaluate LLMs on the full cycle of scientific experimentation, and makes possible the study of agentic scientific decision-making.

SCIGYM is designed to evolve alongside scientific progress. As systems biologists continue advancing their field and contributing new models to databases like BioModels, SCIGYM will correspondingly grow more robust and comprehensive. The framework's modular design enables straightforward extension to additional experimental modalities and domains where protocols can be precisely specified and computationally simulated.

We envision SCIGYM as a living testbed that will continue to evolve, incorporating increasingly sophisticated experimental scenarios and evaluation metrics to better capture the nuanced complexities of authentic scientific discovery processes.

BibTeX

@article{duan2025measuring,
      title={Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab},
      author={Duan, Haonan and Lu, Stephen Zhewen and Harrigan, Caitlin Fiona and Desai, Nishkrit and Lu, Jiarui and Koziarski, Micha{\l} and Cotta, Leonardo and Maddison, Chris J},
      journal={arXiv preprint arXiv:2507.02083},
      year={2025}
    }