BIRD-CRITIC

Hi! BIRD-CRITIC

We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts. We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI

This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.

BIRD-CRITIC (a.k.a SWE-SQL), the first SQL diagnostic benchmark, is released to answer: Can large language models (LLMs) fix user issues in real-world database applications? The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests. BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.

bird-critic-1.0-flash-exp: A lite version containing 200 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
bird-critic-1.0-open: The full version of BIRD-CRITIC, comprising 570 tasks covering open-source SQL dialects: PostgreSQL, MySQL, SQL Server, and Oracle. This allows for cross-dialect evaluation.
bird-critic-1.0-postgresql: A full version containing 530 tasks, all in PostgreSQL. This allows for focused analysis within a single dialect.
bird-critic-1.0-bigquery: A full version containing 200 tasks in BigQuery.

News

July 9, 2025: We have released human performance scores on the BIRD-CRITIC datasets! The scores displayed across all three leaderboards reflect human evaluators (database experts) who were allowed to use standard tools (database textbooks, official documentation, or IDEs) but not AI assistants. When another group with the same expertise was permitted to use AI tools (ChatGPT, Claude, or Gemini), the performance increased to 83.33 on Open, 87.90 on PG, and 90.00 on Flash, demonstrating the significant potential of human-AI collaboration in SQL problem-solving.
June 8, 2025: We have released bird-critic-1.0-pg (530 tasks focus on PostgreSQL). Check out the data in Hugging Face. Our next release will include a new set of 300 efficiency-focused tasks, stay tuned!
Apr. 23, 2025: We have released bird-critic-1.0-open (570 tasks by 4 dialects). Check out the data in Hugging Face and the newest code in GitHub. The full set of PostgreSQL will be released 1 week later. It seems that bird-critic is a challenging reasoning tasks for text-to-SQL since all top-performing models are reasoning-based models. Have fun! Thanks!
Feb. 4, 2025: BIRD CRITIC 1.0 SQL (Flash), a lite version, has been released. Please fill out this form to receive the GT solution SQLs and test case functions via email. This can help to prevent automated crawling, which is critical for mitigating data leakage problems.

GT SQLs & Test Cases

To mititgate data leakage, please free to email bird.bench25@gmail.com for solution SQLs and test cases. The delivery is quite fast.

BIRD-CRITIC Example

Submission

BIRD 2025 will accept a more flexible submission pipelines, please check Submission Guideline (below) and contact bird.bench25@gmail.com if you have any questions.

Subscribe to BIRD Update

Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.

Email Subscription

Citation

@article{li2025swe,
  title={SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications},
  author={Li, Jinyang and Li, Xiaolong and Qu, Ge and Jacobsson, Per and Qin, Bowen and Hui, Binyuan and Si, Shuzheng and Huo, Nan and Xu, Xiaohan and Zhang, Yue and others},
  journal={arXiv preprint arXiv:2506.18951},
  year={2025}
}

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

Leaderboard - BIRD-CRITIC-1.0-Open (570)

Rank	Model	SR (%)	Date	Tier
1	o1-preview-2025-04-20	35.5	2025-04-20	🏆 Leading
2	deepseek-reasoner (r1)	32.0	2025-04-20	🌟 Elite
3	gpt-4o-2024-11-20	27.5	2024-11-20	🌟 Elite
4	o1-mini	25.0	2025-04-20	💎 Superior
5	deepseek-V3	24.5	2025-04-20	💎 Superior
6	phi-4	22.0	2025-04-20	💎 Superior
7	claude-3-5-sonnet	21.5	2025-04-20	🔸 Advanced
8	gemini-2.0-pro	21.0	2025-04-20	🔸 Advanced
9	Qwen2.5-Coder-32B-Instruct	20.0	2025-04-20	🔸 Advanced
10	gemini-2.0-pro-thinking	17.5	2025-04-20	🔸 Advanced