News
Microsoft's Debug-Gym is a Python-driven framework aimed at assessing capabilities of AI agents in handling practical ...
PolyBench, a groundbreaking multi-language benchmark that exposes critical limitations in AI coding assistants across Python, JavaScript, TypeScript, and Java while introducing new metrics beyond ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results