REPOSITORY-LEVEL AUTONOMY

Engineering
Verified Fixes.

We push models beyond isolated function generation. Our custom SWE-bench evaluations require agents to navigate full codebases, identify context, and generate patches that pass deterministic Dockerized unit tests.

Run Evaluation

acadify-agent / resolve_issue.py

                            // Issue #2401: Memory leak in graph traversal

                            def analyze_and_patch(repo_context, issue_description):

                                ast_nodes = parser.parse_tree(repo_context)

                                # Locate vulnerability using semantic search

                                target_file = search_engine.find_relevance(issue_description)

                                try:

                                    patch_diff = llm.generate_diff(target_file)

                                    test_result = docker.run_pytest(patch_diff)

                                    if test_result.passed:

                                        return "SUCCESS: Patch verified."

                                except ExecutionError as e:

                                    return agent.self_correct(e.traceback)

                            > pytest tests/test_graph.py

                            > ==================== 14 passed in 0.83s ====================

                            > ACADIFY STATUS: VERIFIED FIX

Deterministic Pipeline

We evaluate true software engineering capabilities, not just text prediction. Our pipeline ensures absolute functional correctness.

1. Sandboxed Context

The agent is provided a secure, isolated Docker container containing the full repository, dependencies, and build environment.

2. Agentic Trajectory

The model must autonomously execute `grep`, navigate files, view AST structures, and iteratively generate unified diff patches.

3. Execution Validation

Patches are applied and strict test suites (pytest, jest, cargo test) are executed. Only zero-error passes are recorded as successes.

Boutique Scale,
Absolute Precision.

We don't inflate scores with simple syntax fixes. Our proprietary evaluation set consists of complex logic bugs and architectural refactors meticulously sourced from real production environments.

Contamination-free issue pairs
Multi-file dependency tracking
Strict Pass@1 evaluation methodology

2.5k

Curated Repositories

High-quality, specialized codebases used for testing.

12k

Verified PR Pairs

Issue-to-resolution mappings rigorously checked by engineers.

SWE-bench FAQ

Understanding our automated software engineering evaluation framework.

Public leaderboards are highly susceptible to data contamination as models scrape public GitHub data during pre-training. Acadify uses private, internally-developed repositories and withheld proprietary codebases to test genuine generalization.

Both. While Pass@1 correctness is the primary metric, we also track the "trajectory length" (number of bash commands/API calls) and computational overhead to measure the efficiency of the agent's reasoning process.

Engineering Verified Fixes.