fabscore

FabScore: Fine-Grained Evaluation of Fabrications in Automated AI Research

Introduction

FabScore is a fine-grained evaluation framework that measures the extent to which AI-generated papers contain fabrications. Given a research paper and its associate code, FabScore executes the following four stages by a coding agent: 1. Result Extraction; 2. Static Analysis; 3. Code Execution; 4. Verdict Generation.

</img>
Figure 1: An overview of the FabScore framework, illustrating our four-stage evaluation pipeline.


There are six verdict categories:

Evaluation Results

Evaluation Data

we conduct a comprehensive evaluation on 144 papers with accompanying code from multiple sources, including AI Scientist, MLR-Agent, Agents4Science, and FARS. For Agents4Science, we collect all 27 accepted submissions with available code, and additionally sampled 27 rejected submissions to balance accepted and rejected papers. For AI Scientist MLR-Agent, and FARS, we collect 30 papers each.

Overall Performance

As shown in Figure 2, the overall fabrication rate reaches 21.2%, where experiment fabrication accounts for the majority.

</img>
Figure 2: Proportion of each verdict category among 6,978 extracted claims from 144 AI-generated papers..


Claim-level and Paper-level Performance

As shown in Figure 3, claim-level fabrication rates range from 0.4% to 53.6% and paper-level rates from 10.0% to 81.5%. 70.4% of the 54 real conference submissions contain fabrications.

</img>
Figure 3: Claim-level verdict distribution and paper-level fabrication frequency across five data sources, where paper-level fabrication frequency is defined as the proportion of papers containing at least one fabrication.


Review Interface

We have developed a unified interface to support human review. If you would like to check out the evaluation results using our interface, please click this link.

Installation

We use uv to manage the environment of this repository. Here are commands for initialing uv in this project.

1. uv init
2. uv venv
3. # add requirements to pyproject.toml
4. uv add requests
5. uv lock
# update packages: uv sync

Install fabscore as a package:

uv pip install -e .

Before running the following steps, ensure you have activated the virtual environment:

source .venv/bin/activate

You can also skip manual activation and run commands through uv run, which is the recommended style.

Usage

The evaluation pipeline consists of 4 modular steps. You can run them all at once using the main orchestrator, or individually for more control.

Run the entire 4-step process automatically:

uv run python main.py --task_path <path_to_task_directory> --paper_filename <paper_filename_or_relative_path> [--judge_type claude]

Key optional arguments:

Required arguments:

Individual Step Usage

You may also execute each stage individually by running:

1. uv run python fabscore/eval/extraction.py --task_path <task_dir> --paper_filename <relative_paper_path> [--judge_type claude] [--model_name <model>]  # Result Extraction
2. uv run python fabscore/eval/analysis.py --task_path <task_dir> --paper_file <relative_paper_path> [--judge_type claude] [--model_name <model>]  # Static Analysis
3. uv run python fabscore/eval/execution.py --task_path <task_dir> --paper_file <relative_paper_path> [--analysis_path <analysis_json>] [--extracted_path <extracted_json>] [--judge_type claude]  # Code Execution

Citation

Please cite our paper if you find our work helpful:

@article{chen2026fabscore,
      title={FabScore: Fine-Grained Evaluation of Fabrications in Automated AI Research}, 
      author={Chen, Hui and Zhao, James Xu and Jiang, Dongfu and Guo, Qianyun and Chen, Jiefeng and Wang, Yiwei and Chen, Muhao and Ng, See-Kiong and Koh, Pang Wei and Hooi, Bryan},
      link={https://github.com/chchenhui/fabscore},
      year={2026}
}

Please feel free to contact chchenhui233@gmail.com if you have any questions.