We push models beyond isolated function generation. Our custom SWE-bench evaluations require agents to navigate full codebases, identify context, and generate patches that pass deterministic Dockerized unit tests.
We evaluate true software engineering capabilities, not just text prediction. Our pipeline ensures absolute functional correctness.
The agent is provided a secure, isolated Docker container containing the full repository, dependencies, and build environment.
The model must autonomously execute `grep`, navigate files, view AST structures, and iteratively generate unified diff patches.
Patches are applied and strict test suites (pytest, jest, cargo test) are executed. Only zero-error passes are recorded as successes.
We don't inflate scores with simple syntax fixes. Our proprietary evaluation set consists of complex logic bugs and architectural refactors meticulously sourced from real production environments.
High-quality, specialized codebases used for testing.
Issue-to-resolution mappings rigorously checked by engineers.
Understanding our automated software engineering evaluation framework.