USABench Logo

USABench

The Definitive Government Data Analysis Benchmark for LLMs

459Evaluation Questions
100%API Execution Success
3Government Data Sources
47.4%Top Model Accuracy

Comprehensive Framework for Public Data

Open-source benchmark specifically designed for AI evaluations on government data

Extensible, Open Standard

Designed to be used across any public dataset

Transparent Methodology

Reproducible model comparison with function-calling evaluation

Community Driven

Supporting ongoing evaluation contributions from researchers and developers

Explore the Leaderboard

Performance Leaderboard

Model rankings with tier-specific breakdowns and comprehensive performance metrics

Loading leaderboard...

Methodology Transparency

Comprehensive evaluation framework with reproducible protocols and validation measures

Evaluation Framework Overview

  • Ragas integration with function calling evaluation methodology
  • 4-component binary scoring: Function Selection, Parameter Accuracy, Execution Success, Result Accuracy
  • Direct LiteLLM integration without framework dependencies
  • Real-time API execution with BLS and BEA endpoints
  • Comprehensive error analysis and debugging protocols

Government Data Source Integration

  • Federal agency API integration spanning OMB, BLS, and BEA datasets
  • Cross-source analytical capability requirements
  • Data quality assurance and validation frameworks
  • Standardized dataset repository with 459 unified records
  • Multi-temporal coverage (2014-2024) ensuring data relevance

Complexity Tier Definitions

  • Easy (30%): Basic data retrieval and simple aggregations
  • Medium (50%): Multi-table joins and statistical analysis
  • Hard (20%): Complex temporal analysis and cross-source synthesis
  • Geographic, demographic, and sectoral analysis requirements
  • Real-world analytical scenario representation

Performance Measurement

  • Binary accuracy scoring with execution validation
  • Statistical significance verification protocols
  • Comparative analysis across model architectures
  • Historical performance tracking and trend analysis
  • Comprehensive benchmark integrity assurance

Data Source Foundation

Real government economic data from authoritative federal agencies

OMB

Office of Management and Budget

Federal budget data, economic forecasts

2014-2024 coverage

BLS

Bureau of Labor Statistics

Employment Cost Index, CPI, Productivity

2014-2024 coverage

BEA

Bureau of Economic Analysis

GDP by industry, regional personal income

2023-2024 coverage

Community Contribution Process

Join the evaluation ecosystem and contribute to AI progress in government data analysis

1

Model Evaluation Execution

Use the provided SDK and standardized protocols to execute evaluations on your model using our comprehensive benchmark suite

2

Result Validation

Performance results undergo validation and statistical significance verification to ensure benchmark integrity

3

Community Review

Submitted results are reviewed by the community with model documentation and technical specifications

4

Leaderboard Integration

Approved submissions are integrated into the leaderboard following community review and approval processes

Getting Started

Repository Access & SDK

git clone https://github.com/usabench/usabench
pip install litellm sqlparse pydantic numpy pandas
python3 -m USABench --model your-model --evaluation-type mixed

Technical support and community engagement available through GitHub Issues and community forums

Strategic Positioning & Impact

Establishing the industry standard for government data analysis AI evaluation

Industry Standard Establishment

USABench establishes an authoritative evaluation framework for systems using LLMs to access public datasets, providing systematic capability assessment tools for the AI research community. The benchmark addresses critical gaps in specialized domain evaluation while maintaining accessibility for independent research execution.

Community Ecosystem Development

This project provides a transparent methodology, reproducible evaluation protocols, and community support in an effort to futher the conversation around AI and government data analysis. Regular model submissions and performance updates maintain benchmark relevance and establish ongoing capability measurement standards.

Government Data Analysis Advancement

USABench accelerates AI development in critical government data domains, supporting enhanced analytical capabilities across federal datasets. The benchmark enables systematic progress measurement and competitive development across model architectures and approaches.

USAFacts Sponsorship

USAFacts, a nonpartisan organization dedicated to government transparency through data, led the development of USABench and continues to review submissions. Note: USAFacts does not endorse any specific model, organization, or political party, and provides no warranty or guarantee regarding the accuracy or reliability of the benchmark or underlying data and code. See disclaimers in GitHub.

Join the USABench Community

Be part of establishing the definitive standard for AI evaluation on government data analysis