Hi! BIRD-CRITIC
We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts.
We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI
This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.
BIRD-CRITIC (a.k.a SWE-SQL), the first SQL diagnostic benchmark, is released to answer: Can large language models (LLMs) fix user issues in real-world database applications?
The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests.
BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle.
It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios.
Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.
- bird-critic-1.0-flash-exp: A lite version containing 200 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
- bird-critic-1.0-open: The full version of BIRD-CRITIC, comprising 600 tasks covering open-source SQL dialects: PostgreSQL, MySQL, SQL Server, and Oracle. This allows for cross-dialect evaluation.
- bird-critic-1.0-postgresql: A full version containing 600 tasks, all in PostgreSQL. This allows for focused analysis within a single dialect.
- bird-critic-1.0-bigquery: A full version containing 200 tasks in BigQuery.
News
-
Feb. 4, 2025:
 
bird-critic-1.0-flash-exp
has been updated! We've added the issue_type label, which classifies issues into 4 main categories: Query, Management, Personalization, and Efficiency. Please download the newest version through Hugging Face Datasets! - Feb. 4, 2025:  BIRD CRITIC 1.0 SQL (Flash), a lite version, has been released. Please fill out this form to receive the GT solution SQLs and test case functions via email. This can help to prevent automated crawling, which is critical for mitigating data leakage problems.
GT SQLs & Test Cases
To mititgate data leakage, please free to email bird.bench25@gmail.com
for solution SQLs and test cases.
The delivery is quite fast.
BIRD-CRITIC Example
![](./img/example_img/example_1.png)
![](./img/example_img/example_2.png)
Submission
BIRD 2025 will accept a more flexible submission pipelines, please check Submission Guideline (below) and contact bird.bench25@gmail.com
if you have any questions.
Subscribe to BIRD Update
Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.
Citation
@article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} }
Rank | Model | Pass Rate (%) | Institute | Link | Date | Tier |
---|---|---|---|---|---|---|
1 | o1-preview-2025-01-30 | 38.5 |
![]() |
2025-01-30 | 🏆 Leading | |
2 | deepseek-reasoner (r1) | 34.0 |
![]() |
2025-01-30 | 🌟 Elite | |
3 | gpt-4o-2024-11-20 | 29.0 |
![]() |
2024-11-20 | 🌟 Elite | |
4 | o1-mini | 28.0 |
![]() |
2025-01-30 | 💎 Superior | |
5 | deepseek-V3 | 27.5 |
![]() |
2025-01-30 | 💎 Superior | |
6 | phi-4 | 24.5 |
![]() |
2025-01-30 | 💎 Superior | |
7 | claude-3-5-sonnet | 24.0 |
![]() |
2025-01-30 | 🔸 Advanced | |
8 | gemini-2.0-flash-exp | 24.0 |
![]() |
2025-01-30 | 🔸 Advanced | |
9 | Qwen2.5-Coder-32B-Instruct | 23.5 |
![]() |
2025-01-30 | 🔸 Advanced | |
10 | gemini-2.0-flash-thinking-exp | 19.5 |
![]() |
2025-01-30 | 🔸 Advanced | |
11 | Meta-Llama-3.3-70B-Instruct | 18.5 |
![]() |
2025-01-30 | 💫 Standard | |
12 | Codestral-22B-v0.1 | 18.0 |
![]() |
2025-01-30 | 💫 Standard | |
13 | gemma-2-27b-it | 18.0 |
![]() |
2025-01-30 | 💫 Standard | |
14 | QwQ-32B-Preview | 17.5 |
![]() |
2025-01-30 | 💫 Standard | |
15 | starcoder2-15b-instruct-v0.1 | 15.5 |
![]() |
2025-01-30 | 💫 Standard | |
16 | DeepSeek-Coder-V2-Lite-Instruct | 12.5 |
![]() |
2025-01-30 | ⚪ Basic | |
17 | Mixtral-8x7B-Instruct-v0.1 | 11.5 |
![]() |
2025-01-30 | ⚪ Basic | |
18 | gemma-2-9b-it | 10.5 |
![]() |
2025-01-30 | ⚪ Basic | |
19 | Yi-1.5-34B-Chat-16K | 10.5 |
![]() |
2025-01-30 | ⚪ Basic | |
20 | CodeLlama-34b-Instruct-hf | 10.0 |
![]() |
2025-01-30 | ⚪ Basic | |
21 | CodeLlama-13b-Instruct-hf | 8.5 |
![]() |
2025-01-30 | ⚪ Basic | |
22 | Mistral-7B-Instruct-v0.2 | 3.5 |
![]() |
2025-01-30 | ⚪ Basic |
Tier Classification (By Ranking):
- 🏆 Leading: The Best!
- 🌟 Elite: Top 15%
- 💎 Superior: Top 30%
- 🔸 Advanced: Top 45%
- 💫 Standard: Top 70%
- ⚪ Basic: Bottom 30%