Evaluation Protocols

Evaluation Protocols

Deep technical specifications for executing autonomous benchmarks, STEM validation, and deterministic sandboxing.

1. SWE-bench Infrastructure & Isolation

Evaluating an LLM's ability to autonomously solve GitHub issues requires executing untrusted, model-generated code. Acadify handles this securely using Ephemeral Deterministic Sandboxes.

The Container Lifecycle

When you initiate a SWE-bench evaluation via the API, the following sequence occurs within our Kubernetes cluster:

  1. Provisioning: A bare-metal container is provisioned running the specified python_version.
  2. Cloning & Checkout: The target GitHub repository is cloned, and the exact base_commit hash is checked out to guarantee identical starting states across models.
  3. Execution Phase: The model is given access to an interactive bash shell within the container. It begins exploring the codebase and writing patches.
  4. Evaluation Phase: The hidden test suite is executed against the model's modified codebase.
  5. Teardown: The container is instantly destroyed to prevent cross-contamination. Logs and trajectory data are exported to your Dashboard.

Endpoint: Initialize Sandbox

POST /v1/sandbox/init

curl -X POST https://api.acadifysolution.com/v1/sandbox/init \
  -H "Authorization: Bearer aca_live_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "repository": "django/django",
    "base_commit": "11a681373",
    "python_version": "3.11",
    "timeout_minutes": 45
  }'

2. GPQA (Google-Proof Q&A) STEM Validation

GPQA is a benchmark consisting of PhD-level questions spanning biology, physics, and chemistry. These questions are designed to be extremely difficult, even for human experts with internet access.

LaTeX Formatting Constraints

To successfully evaluate a model's logical reasoning trace, the output must be strictly formatted. Our parsing engine requires final mathematical answers to be enclosed in LaTeX block equations (\\[ ... \\]) and intermediate chain-of-thought logic to be contained within <reasoning> XML tags.

<!-- Example of a perfectly formatted model response -->
<reasoning>
To calculate the Hamiltonian for the perturbed system, we first identify the unperturbed state...
Applying first-order perturbation theory, the energy shift is given by the integral over the wavefunctions...
</reasoning>

The final energy shift is:
\\[ \\Delta E^{(1)} = \\frac{e^2 \\hbar B}{2m_e} \\]

Human-in-the-Loop Override

If the deterministic LaTeX parser fails to match the model's output to the strict benchmark key (e.g., due to an equivalent algebraic form that the regex missed), the trajectory is automatically flagged for SME Review. One of our STEM PhD network members will manually verify the mathematical equivalence.

3. CI/CD Continuous Evaluation

Enterprise teams cannot wait for manual benchmarking. Acadify supports seamless integration into modern CI/CD pipelines, automatically running evaluation suites against new model checkpoints before deployment.

GitHub Actions Integration

You can trigger an Acadify evaluation directly from your workflow file. Use our official GitHub Action to block PR merges if the new model checkpoint suffers a regression in SWE-bench scores.

name: Acadify Model Eval
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run SWE-bench Suite
        uses: acadify/eval-action@v1
        with:
          api_key: ${{ secrets.ACA_LIVE_KEY }}
          model_endpoint: "https://api.your-company.com/v1/inference"
          benchmark: "swe-bench-lite"
          regression_threshold: 0.95

4. Trajectory Scoring Algorithms

How do we decide if an autonomous agent "passed" a SWE-bench evaluation? It is not just about whether the final test suite passes; it is about efficiency, cost, and safety.

The Composite Score Matrix

Metric Weight Description
Functional Pass Rate 70% Did the model's final git patch resolve the target issue without breaking existing tests?
Token Efficiency 15% How many tokens were consumed exploring the codebase? Fewer tokens indicate better reasoning.
Execution Safety 15% Did the model attempt to execute destructive bash commands (e.g., rm -rf /)?

Fetching Trajectory Logs

You can pull the exact step-by-step terminal history of what your model did inside the sandbox.

GET /v1/eval/{job_id}/trajectory