Kaggle 推出社群基準以評估現代 AI

Kaggle 推出社群基準以評估現代 AI

Hacker News·

Kaggle 推出了社群基準(Community Benchmarks)新功能,讓全球 AI 社群能夠設計、運行和分享自訂的 AI 模型評估,超越了靜態準確度分數,以更好地反映真實世界的模型行為。

Learn more:

Learn more:

Learn more:

Learn more:

Learn more:

Learn more:

Introducing Community Benchmarks on Kaggle

Jan 14, 2026

Today’s AI models require more than static accuracy scores. Community Benchmarks, a new capability on Kaggle, enables the global AI community to design, run and share custom evaluations that better reflect real-world model behavior.

Image

Image

General summary

Kaggle launched Community Benchmarks so you can design and share custom benchmarks for evaluating AI models. You can build tasks to test model performance on specific problems. Group those tasks into a benchmark to evaluate leading AI models and track their performance on a leaderboard.

Bullet points

Image

Your browser does not support the audio element.

Today, Kaggle is launching Community Benchmarks, which lets the global AI community design, run and share their own custom benchmarks for evaluating AI models. This is the next step after we launched Kaggle Benchmarks last year, to provide trustworthy and transparent access to evaluations from top-tier research groups like Meta’s MultiLoKo and Google’s FACTS suite.

Why community-driven evaluation matters

AI capabilities have evolved so rapidly that it’s become difficult to evaluate model performance. Not long ago, a single accuracy score on a static dataset was enough to determine model quality. But today, as LLMs evolve into reasoning agents that collaborate, write code and use tools, those static metrics and simple evaluations are no longer sufficient.

Kaggle Community Benchmarks provide developers with a transparent way to validate their specific use cases and bridge the gap between experimental code and production-ready applications.

These real-world use cases demand a more flexible and transparent evaluation framework. Kaggle’s Community Benchmarks provide a more dynamic, rigorous and continuously evolving approach to AI model evaluation — one shaped by the users building and deploying these systems everyday.

How to build your own benchmarks on Kaggle

Benchmarks start with building tasks, which can range from evaluating multi-step reasoning and code generation to testing tool use or image recognition. Once you have tasks, you can add them to a benchmark to evaluate and rank selected models by how they perform across the tasks in the benchmark.

Here’s how you can get started:

Once you build your benchmark, here’s what benefits you’ll see:

These powerful capabilities are powered by the new kaggle-benchmarks SDK. Here are a few resources for getting started:

How we’re shaping the future of AI evaluation

The future of AI progress depends on how models are evaluated. With Kaggle Community Benchmarks, Kagglers are no longer just testing models, they’re helping shape the next generation of intelligence.

Ready to build? Try Community Benchmarks today.

Related stories

Image

Image

Image

Image

Image

Image

Image

Let’s stay in touch. Get the latest news from Google in your inbox.

Follow Us

Hacker News

相關文章

  1. Measuring progress toward AGI: A cognitive framework

    Google Deepmind · 大約 1 個月前

  2. 衡量通用人工智慧(AGI)的進展:一種認知框架

    大約 1 個月前

  3. 透過遊戲競技場推進AI基準測試

    3 個月前

  4. 重新思考如何衡量人工智慧的智慧

    Google Deepmind · 9 個月前

  5. 重新思考如何衡量人工智慧的智慧

    Google Deepmind · 9 個月前