USABench

The Definitive Government Data Analysis Benchmark for LLMs

459Evaluation Questions

100%API Execution Success

3Government Data Sources

47.4%Top Model Accuracy

Comprehensive Framework for Public Data

Open-source benchmark specifically designed for AI evaluations on government data

Extensible, Open Standard

Designed to be used across any public dataset

Transparent Methodology

Reproducible model comparison with function-calling evaluation

Community Driven

Supporting ongoing evaluation contributions from researchers and developers

Explore the Leaderboard

Performance Leaderboard

Model rankings with tier-specific breakdowns and comprehensive performance metrics

Loading leaderboard...

Methodology Transparency

Comprehensive evaluation framework with reproducible protocols and validation measures

Evaluation Framework Overview

Ragas integration with function calling evaluation methodology
4-component binary scoring: Function Selection, Parameter Accuracy, Execution Success, Result Accuracy
Direct LiteLLM integration without framework dependencies
Real-time API execution with BLS and BEA endpoints
Comprehensive error analysis and debugging protocols

Government Data Source Integration

Federal agency API integration spanning OMB, BLS, and BEA datasets
Cross-source analytical capability requirements
Data quality assurance and validation frameworks
Standardized dataset repository with 459 unified records
Multi-temporal coverage (2014-2024) ensuring data relevance

Complexity Tier Definitions

Easy (30%): Basic data retrieval and simple aggregations
Medium (50%): Multi-table joins and statistical analysis
Hard (20%): Complex temporal analysis and cross-source synthesis
Geographic, demographic, and sectoral analysis requirements
Real-world analytical scenario representation

Performance Measurement

Binary accuracy scoring with execution validation
Statistical significance verification protocols
Comparative analysis across model architectures
Historical performance tracking and trend analysis
Comprehensive benchmark integrity assurance

Data Source Foundation

Real government economic data from authoritative federal agencies

OMB

Office of Management and Budget

Federal budget data, economic forecasts

2014-2024 coverage

BLS

Bureau of Labor Statistics

Employment Cost Index, CPI, Productivity

2014-2024 coverage

BEA

Bureau of Economic Analysis

GDP by industry, regional personal income

2023-2024 coverage

Community Contribution Process

Join the evaluation ecosystem and contribute to AI progress in government data analysis

Model Evaluation Execution

Use the provided SDK and standardized protocols to execute evaluations on your model using our comprehensive benchmark suite

Result Validation

Performance results undergo validation and statistical significance verification to ensure benchmark integrity

Community Review

Submitted results are reviewed by the community with model documentation and technical specifications

Leaderboard Integration

Approved submissions are integrated into the leaderboard following community review and approval processes

Getting Started

Repository Access & SDK

git clone https://github.com/usabench/usabench
pip install litellm sqlparse pydantic numpy pandas
python3 -m USABench --model your-model --evaluation-type mixed

Technical support and community engagement available through GitHub Issues and community forums

Strategic Positioning & Impact

Establishing the industry standard for government data analysis AI evaluation

Industry Standard Establishment

USABench establishes an authoritative evaluation framework for systems using LLMs to access public datasets, providing systematic capability assessment tools for the AI research community. The benchmark addresses critical gaps in specialized domain evaluation while maintaining accessibility for independent research execution.

Community Ecosystem Development

This project provides a transparent methodology, reproducible evaluation protocols, and community support in an effort to futher the conversation around AI and government data analysis. Regular model submissions and performance updates maintain benchmark relevance and establish ongoing capability measurement standards.

Government Data Analysis Advancement

USABench accelerates AI development in critical government data domains, supporting enhanced analytical capabilities across federal datasets. The benchmark enables systematic progress measurement and competitive development across model architectures and approaches.

USAFacts Sponsorship

USAFacts, a nonpartisan organization dedicated to government transparency through data, led the development of USABench and continues to review submissions. Note: USAFacts does not endorse any specific model, organization, or political party, and provides no warranty or guarantee regarding the accuracy or reliability of the benchmark or underlying data and code. See disclaimers in GitHub.

Join the USABench Community

Be part of establishing the definitive standard for AI evaluation on government data analysis

Submit Your Data View Documentation Hugging Face Space