FabScore is a fine-grained evaluation framework that measures the extent to which AI-generated papers contain fabrications. Given a research paper and its associate code, FabScore executes the following four stages by a coding agent: 1. Result Extraction; 2. Static Analysis; 3. Code Execution; 4. Verdict Generation.
</img>
There are six verdict categories:
we conduct a comprehensive evaluation on 144 papers with accompanying code from multiple sources, including AI Scientist, MLR-Agent, Agents4Science, and FARS. For Agents4Science, we collect all 27 accepted submissions with available code, and additionally sampled 27 rejected submissions to balance accepted and rejected papers. For AI Scientist MLR-Agent, and FARS, we collect 30 papers each.
As shown in Figure 2, the overall fabrication rate reaches 21.2%, where experiment fabrication accounts for the majority.
</img>
As shown in Figure 3, claim-level fabrication rates range from 0.4% to 53.6% and paper-level rates from 10.0% to 81.5%. 70.4% of the 54 real conference submissions contain fabrications.
</img>
We have developed a unified interface to support human review. If you would like to check out the evaluation results using our interface, please click this link.
We use uv to manage the environment of this repository. Here are commands for initialing uv in this project.
1. uv init
2. uv venv
3. # add requirements to pyproject.toml
4. uv add requests
5. uv lock
# update packages: uv sync
Install fabscore as a package:
uv pip install -e .
Before running the following steps, ensure you have activated the virtual environment:
source .venv/bin/activate
You can also skip manual activation and run commands through uv run, which is the recommended style.
The evaluation pipeline consists of 4 modular steps. You can run them all at once using the main orchestrator, or individually for more control.
Run the entire 4-step process automatically:
uv run python main.py --task_path <path_to_task_directory> --paper_filename <paper_filename_or_relative_path> [--judge_type claude]
Key optional arguments:
--judge_type — Agent to use: claude, or codex (default: claude)--model_name — Model name override (e.g. claude-sonnet-4-6)--extraction_only — Stop after extraction--analysis_only — Stop after extraction + static analysis--execution_only — Stop after extraction + static analysis + execution, and skip final summarization writeoutRequired arguments:
--task_path — Task root directory--paper_filename — Paper filename or relative path inside the task directory, for example paper.pdf, results/paper.md, or data_augmentation_grokking.pdfYou may also execute each stage individually by running:
1. uv run python fabscore/eval/extraction.py --task_path <task_dir> --paper_filename <relative_paper_path> [--judge_type claude] [--model_name <model>] # Result Extraction
2. uv run python fabscore/eval/analysis.py --task_path <task_dir> --paper_file <relative_paper_path> [--judge_type claude] [--model_name <model>] # Static Analysis
3. uv run python fabscore/eval/execution.py --task_path <task_dir> --paper_file <relative_paper_path> [--analysis_path <analysis_json>] [--extracted_path <extracted_json>] [--judge_type claude] # Code Execution
Please cite our paper if you find our work helpful:
@article{chen2026fabscore,
title={FabScore: Fine-Grained Evaluation of Fabrications in Automated AI Research},
author={Chen, Hui and Zhao, James Xu and Jiang, Dongfu and Guo, Qianyun and Chen, Jiefeng and Wang, Yiwei and Chen, Muhao and Ng, See-Kiong and Koh, Pang Wei and Hooi, Bryan},
link={https://github.com/chchenhui/fabscore},
year={2026}
}
Please feel free to contact chchenhui233@gmail.com if you have any questions.