logo

BIRD-CRITIC

Can LLMs Fix User Issues in Real-World Database Applications?

Hi! BIRD-CRITIC

We feel grateful to see BIRD-SQL 2023 can bring insights and contributions to Text-to-SQL community in LLM era with supervision of DB & DM experts. We also sincerely thank all feedbacks from the community! Our work has been featured by DeepMind and OpenAI

This year, along with collaboration with Google Cloud, we are launching BIRD-SQL 2025, which will cover a wide range of professional DBs and their knowledge in the wild applications.

BIRD-CRITIC (a.k.a SWE-SQL), the first SQL diagnostic benchmark, is released to answer: Can large language models (LLMs) fix user issues in real-world database applications? The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests. BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.

  • bird-critic-1.0-flash-exp: A lite version containing 200 tasks specifically for PostgreSQL. This is a good starting point for quick experimentation.
  • bird-critic-1.0-open: The full version of BIRD-CRITIC, comprising 600 tasks covering open-source SQL dialects: PostgreSQL, MySQL, SQL Server, and Oracle. This allows for cross-dialect evaluation.
  • bird-critic-1.0-postgresql: A full version containing 600 tasks, all in PostgreSQL. This allows for focused analysis within a single dialect.
  • bird-critic-1.0-bigquery: A full version containing 200 tasks in BigQuery.

News

  • Feb. 4, 2025:  bird-critic-1.0-flash-exp has been updated! We've added the issue_type label, which classifies issues into 4 main categories: Query, Management, Personalization, and Efficiency. Please download the newest version through Hugging Face Datasets!
  • Feb. 4, 2025:  BIRD CRITIC 1.0 SQL (Flash), a lite version, has been released. Please fill out this form to receive the GT solution SQLs and test case functions via email. This can help to prevent automated crawling, which is critical for mitigating data leakage problems.

GT SQLs & Test Cases

To mititgate data leakage, please free to email bird.bench25@gmail.com for solution SQLs and test cases. The delivery is quite fast.

BIRD-CRITIC Example

Submission

BIRD 2025 will accept a more flexible submission pipelines, please check Submission Guideline (below) and contact bird.bench25@gmail.com if you have any questions.

Subscribe to BIRD Update

Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.

Email Subscription

Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
              
Leaderboard - BIRD-CRITIC-Flash
Rank Model Pass Rate (%) Institute Link Date Tier
1 o1-preview-2025-01-30 38.5 openai_reasoning 2025-01-30 🏆 Leading
2 deepseek-reasoner (r1) 34.0 deepseek-reason 2025-01-30 🌟 Elite
3 gpt-4o-2024-11-20 29.0 openai 2024-11-20 🌟 Elite
4 o1-mini 28.0 openai_reasoning 2025-01-30 💎 Superior
5 deepseek-V3 27.5 deepseek 2025-01-30 💎 Superior
6 phi-4 24.5 microsoft 2025-01-30 💎 Superior
7 claude-3-5-sonnet 24.0 Anthropic 2025-01-30 🔸 Advanced
8 gemini-2.0-flash-exp 24.0 google 2025-01-30 🔸 Advanced
9 Qwen2.5-Coder-32B-Instruct 23.5 qwen 2025-01-30 🔸 Advanced
10 gemini-2.0-flash-thinking-exp 19.5 google 2025-01-30 🔸 Advanced
11 Meta-Llama-3.3-70B-Instruct 18.5 meta 2025-01-30 💫 Standard
12 Codestral-22B-v0.1 18.0 mistral 2025-01-30 💫 Standard
13 gemma-2-27b-it 18.0 google 2025-01-30 💫 Standard
14 QwQ-32B-Preview 17.5 qwen-reason 2025-01-30 💫 Standard
15 starcoder2-15b-instruct-v0.1 15.5 bigcode 2025-01-30 💫 Standard
16 DeepSeek-Coder-V2-Lite-Instruct 12.5 deepseek 2025-01-30 ⚪ Basic
17 Mixtral-8x7B-Instruct-v0.1 11.5 mistral 2025-01-30 ⚪ Basic
18 gemma-2-9b-it 10.5 google 2025-01-30 ⚪ Basic
19 Yi-1.5-34B-Chat-16K 10.5 yi 2025-01-30 ⚪ Basic
20 CodeLlama-34b-Instruct-hf 10.0 meta 2025-01-30 ⚪ Basic
21 CodeLlama-13b-Instruct-hf 8.5 meta 2025-01-30 ⚪ Basic
22 Mistral-7B-Instruct-v0.2 3.5 mistral 2025-01-30 ⚪ Basic

Tier Classification (By Ranking):

  • 🏆 Leading: The Best!
  • 🌟 Elite: Top 15%
  • 💎 Superior: Top 30%
  • 🔸 Advanced: Top 45%
  • 💫 Standard: Top 70%
  • ⚪ Basic: Bottom 30%